Bug #2560

existing many top processor in hypervisors

Added by novid Agha Hasani almost 6 years ago. Updated almost 6 years ago.

Status:ClosedStart date:12/11/2013
Priority:NormalDue date:
Assignee:Javi Fontan% Done:

0%

Category:Drivers - Monitor
Target version:Release 4.6
Resolution:worksforme Pull request:
Affected Versions:OpenNebula 4.4

Description

when cpu load is very high (load average: 13.17, 11.51, 11.19), I checked my cpu process and i see many top and ruby precess exist !


oneadmin   390  0.5  0.0  19472  1724 ?        S    17:33   0:00 top -bin2
oneadmin   657  1.0  0.0  19472  1720 ?        S    17:33   0:00 top -bin2
oneadmin   839  0.5  0.0  19480  1732 ?        S    17:33   0:00 top -bin2
oneadmin   869  0.5  0.0  19468  1716 ?        S    17:33   0:00 top -bin2
oneadmin  1081  1.0  0.0  19476  1724 ?        S    17:33   0:00 top -bin2
oneadmin  1198  1.0  0.0  19472  1712 ?        S    17:33   0:00 top -bin2
oneadmin  1267  0.5  0.0  19476  1724 ?        S    17:33   0:00 top -bin2
oneadmin  1467  1.0  0.0  19472  1716 ?        S    17:33   0:00 top -bin2
oneadmin  1562  1.0  0.0  19484  1728 ?        S    17:33   0:00 top -bin2
oneadmin  1687  1.0  0.0  19488  1728 ?        S    17:33   0:00 top -bin2
oneadmin  1787  1.0  0.0  19484  1720 ?        S    17:33   0:00 top -bin2
oneadmin  1800  2.0  0.0  19488  1724 ?        S    17:33   0:00 top -bin2
oneadmin  1828  1.0  0.0  19480  1728 ?        S    17:33   0:00 top -bin2
oneadmin  1840  1.0  0.0  19488  1728 ?        S    17:33   0:00 top -bin2
oneadmin  1975  1.0  0.0  19484  1724 ?        S    17:33   0:00 top -bin2
oneadmin  2096  1.0  0.0  19484  1716 ?        S    17:33   0:00 top -bin2
oneadmin  2157  1.0  0.0  19472  1704 ?        S    17:33   0:00 top -bin2
oneadmin  2861  2.0  0.0  19472  1704 ?        S    17:33   0:00 top -bin2
oneadmin  2866  2.0  0.0  19476  1724 ?        S    17:33   0:00 top -bin2
oneadmin  2907  0.0  0.0  19468  1712 ?        S    17:33   0:00 top -bin2
oneadmin  3085  0.0  0.0  19468  1700 ?        S    17:33   0:00 top -bin2
oneadmin  3170  0.0  0.0  19484  1716 ?        S    17:33   0:00 top -bin2
oneadmin  3193  0.0  0.0  19476  1708 ?        S    17:33   0:00 top -bin2
oneadmin  3408  0.0  0.0  19468  1704 ?        S    17:33   0:00 top -bin2
oneadmin  3750  0.0  0.0  19468  1704 ?        S    17:33   0:00 top -bin2
oneadmin  3843  0.0  0.0  19468  1704 ?        S    17:33   0:00 top -bin2
oneadmin  4203  0.0  0.0  19356  1528 ?        S    17:33   0:00 top -bin2
oneadmin 32282  0.6  0.0  19484  1724 ?        S    17:33   0:00 top -bin2
oneadmin 32561  0.5  0.0  19476  1712 ?        S    17:33   0:00 top -bin2

and this is my output pstree command

├─35*[ruby─┬─run_probes───run_probes───run_probes───ruby─┬─top]
│ │ └─{ruby}]
│ └─{ruby}]
├─3*[ruby─┬─run_probes───run_probes───run_probes───monitor_ds.sh.d]
│ └─{ruby}]
├─ruby─┬─run_probes───run_probes───run_probes───monitor_ds.sh
│ └─{ruby}

I cant kill top process because they are create from poll.sh ?

but when I kill ruby process , load server from 10 , decrease to 1 until 2 !

it seems there is a problem in hypervisors monitoring ?


Related issues

Related to Bug #2656: Monitor continually cycles through finding machines RUNNI... Closed 01/17/2014

Associated revisions

Revision 414fdf8f
Added by Javi Fontan almost 6 years ago

bug #2560: add remote to kill wild collectd processes

Revision 24625364
Added by Javi Fontan almost 6 years ago

bug #2560: add remote to kill wild collectd processes

(cherry picked from commit 414fdf8fb6b2813a49213616d795147c9b56c520)

History

#1 Updated by Ruben S. Montero almost 6 years ago

  • Category set to Drivers - Monitor
  • Status changed from Pending to New
  • Target version set to Release 4.6

This maybe caused because the poll process starts a new probe before the previous ends. Or maybe you have started multiple collectd clients (adding the same host multiple times), this should not happen either.

#2 Updated by novid Agha Hasani almost 6 years ago

Ruben S. Montero wrote:

This maybe caused because the poll process starts a new probe before the previous ends. Or maybe you have started multiple collectd clients (adding the same host multiple times), this should not happen either.

and what can i do?

for temporary solution, i'm writing a script when load server will be higher than a specific number, then it automatically kill ruby process

I have similar problem on opennbula server too (fronted) and when I'm starting opennebula then load server is going up rapidly and sunstone cant response very well...

this is my "vmstat -2" output when opennebula is running

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1      0 5274216 147200 174484    0    0    21   903  216  433  1  1 62 36
 0  1      0 5273720 147200 174504    0    0     8   636  253  861  1  0 65 34
 0  1      0 5272592 147200 174524    0    0    12   668  278  836  1  2 61 37
 0  1      0 5272228 147208 174564    0    0    10   672  238  785  0  0 64 36
 0  1      0 5271740 147208 174576    0    0    18   652  247  793  1  1 65 33
 0  1      0 5271732 147208 174592    0    0    10   656  217  746  0  0 65 34
 0  1      0 5264232 147208 174644    0    0     4   676  695 2752 25 11 36 28


and output ps command
ps -ef | grep ruby
root      3518  1612  0 14:43 pts/2    00:00:00 grep --color=auto ruby
oneadmin 32104 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 2 -r 0 kvm
oneadmin 32174 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 2 kvm
oneadmin 32187 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_tm.rb -t 15 -d dummy,lvm,shared,fs_lvm,qcow2,ssh,vmfs,ceph
oneadmin 32203 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_hm.rb
oneadmin 32220 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_datastore.rb -t 5 -d dummy,fs,vmfs,lvm,ceph
oneadmin 32236 32084  0 14:19 ?        00:00:00 ruby /usr/lib/one/mads/one_auth_mad.rb --authn ssh,x509,ldap,server_cipher,server_x509

after stoped opennebula
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 5353436 148296 179568    0    0    20   874  214 1052  2  1 62 35
 0  0      0 5353428 148296 179568    0    0     0     0   46   48  0  0 100  0
 0  0      0 5353428 148304 179568    0    0     0    16   32   39  0  0 99  1
 0  0      0 5353428 148304 179568    0    0     0     0   25   19  0  0 100  0
 0  0      0 5353428 148304 179568    0    0     0     0   29   39  0  0 100  0
 0  0      0 5353428 148304 179572    0    0     0     0   30   43  0  0 100  0


and this is about "b" column
: Number of processes in uninterruptible sleep (b) can be used to identify the CPU power. If the value is constantly greater than zero then you may not have enough CPU power.

which is true? my CPU processor is not enough for running opennebula fronted or opennebula is not optimized on my server?
or is opennebula bug related to ruby script?

my cpu is Dual-Core E5700 @ 3GHz, and opennebula just manage 4 host and 14 vm

#3 Updated by Ruben S. Montero almost 6 years ago

  • Status changed from New to Closed
  • Resolution set to worksforme

I cannot see any problem (regarding the ps output) from OpenNebula point of view. Those processes are the drivers, they are most of the time sleeping, and executing operations when needed.

Are there other operations going on in the system? for example Registering a big file in a local datastore can slow down the system if it hasn't got a performing storage...

We can keep looking at it closer, and reopen the bug if we find out any performance problem with OpenNebula.

Thanks for your feedback

#4 Updated by Ruben S. Montero almost 6 years ago

  • Status changed from Closed to New

Wrongly closed this issue, the original problem still exits: too many monitor process on the host

#5 Updated by Ruben S. Montero almost 6 years ago

  • Related to Bug #2656: Monitor continually cycles through finding machines RUNNING and stat UNKNOWN added

#6 Updated by Javi Fontan almost 6 years ago

  • Status changed from New to Assigned
  • Assignee set to Javi Fontan

#7 Updated by Javi Fontan almost 6 years ago

It's possible that your problem is the same as in this thread http://lists.opennebula.org/pipermail/users-opennebula.org/2014-January/026234.html.

The start script for the collectd-client daemon in the hosts checks if a daemon is already running in the host using a pid file that resides in /tmp. When this file can not be correctly written (maybe permissions) or contains an incorrect pid (when this file is shared with other hosts) the script does not know a daemon is already running as starts another one.

Can you check that the file /tmp/one-collectd-client.pid has read/write permissions for oneadmin user, the file is not shared with more hosts and contais the PID of one of the 'collectd' daemons already running.

We are going to change the path of this file to the remotes directory. The advantage of this is that this directory can be configured in oned.conf so the administrator has more control over it.

#8 Updated by Javi Fontan almost 6 years ago

  • Status changed from Assigned to Closed

Also available in: Atom PDF