Bug #2560
existing many top processor in hypervisors
Status: | Closed | Start date: | 12/11/2013 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Javi Fontan | % Done: | 0% | |
Category: | Drivers - Monitor | |||
Target version: | Release 4.6 | |||
Resolution: | worksforme | Pull request: | ||
Affected Versions: | OpenNebula 4.4 |
Description
when cpu load is very high (load average: 13.17, 11.51, 11.19), I checked my cpu process and i see many top and ruby precess exist !
oneadmin 390 0.5 0.0 19472 1724 ? S 17:33 0:00 top -bin2 oneadmin 657 1.0 0.0 19472 1720 ? S 17:33 0:00 top -bin2 oneadmin 839 0.5 0.0 19480 1732 ? S 17:33 0:00 top -bin2 oneadmin 869 0.5 0.0 19468 1716 ? S 17:33 0:00 top -bin2 oneadmin 1081 1.0 0.0 19476 1724 ? S 17:33 0:00 top -bin2 oneadmin 1198 1.0 0.0 19472 1712 ? S 17:33 0:00 top -bin2 oneadmin 1267 0.5 0.0 19476 1724 ? S 17:33 0:00 top -bin2 oneadmin 1467 1.0 0.0 19472 1716 ? S 17:33 0:00 top -bin2 oneadmin 1562 1.0 0.0 19484 1728 ? S 17:33 0:00 top -bin2 oneadmin 1687 1.0 0.0 19488 1728 ? S 17:33 0:00 top -bin2 oneadmin 1787 1.0 0.0 19484 1720 ? S 17:33 0:00 top -bin2 oneadmin 1800 2.0 0.0 19488 1724 ? S 17:33 0:00 top -bin2 oneadmin 1828 1.0 0.0 19480 1728 ? S 17:33 0:00 top -bin2 oneadmin 1840 1.0 0.0 19488 1728 ? S 17:33 0:00 top -bin2 oneadmin 1975 1.0 0.0 19484 1724 ? S 17:33 0:00 top -bin2 oneadmin 2096 1.0 0.0 19484 1716 ? S 17:33 0:00 top -bin2 oneadmin 2157 1.0 0.0 19472 1704 ? S 17:33 0:00 top -bin2 oneadmin 2861 2.0 0.0 19472 1704 ? S 17:33 0:00 top -bin2 oneadmin 2866 2.0 0.0 19476 1724 ? S 17:33 0:00 top -bin2 oneadmin 2907 0.0 0.0 19468 1712 ? S 17:33 0:00 top -bin2 oneadmin 3085 0.0 0.0 19468 1700 ? S 17:33 0:00 top -bin2 oneadmin 3170 0.0 0.0 19484 1716 ? S 17:33 0:00 top -bin2 oneadmin 3193 0.0 0.0 19476 1708 ? S 17:33 0:00 top -bin2 oneadmin 3408 0.0 0.0 19468 1704 ? S 17:33 0:00 top -bin2 oneadmin 3750 0.0 0.0 19468 1704 ? S 17:33 0:00 top -bin2 oneadmin 3843 0.0 0.0 19468 1704 ? S 17:33 0:00 top -bin2 oneadmin 4203 0.0 0.0 19356 1528 ? S 17:33 0:00 top -bin2 oneadmin 32282 0.6 0.0 19484 1724 ? S 17:33 0:00 top -bin2 oneadmin 32561 0.5 0.0 19476 1712 ? S 17:33 0:00 top -bin2
and this is my output pstree command
├─35*[ruby─┬─run_probes───run_probes───run_probes───ruby─┬─top]
│ │ └─{ruby}]
│ └─{ruby}]
├─3*[ruby─┬─run_probes───run_probes───run_probes───monitor_ds.sh.d]
│ └─{ruby}]
├─ruby─┬─run_probes───run_probes───run_probes───monitor_ds.sh
│ └─{ruby}
I cant kill top process because they are create from poll.sh ?
but when I kill ruby process , load server from 10 , decrease to 1 until 2 !
it seems there is a problem in hypervisors monitoring ?
Related issues
Associated revisions
bug #2560: add remote to kill wild collectd processes
bug #2560: add remote to kill wild collectd processes
(cherry picked from commit 414fdf8fb6b2813a49213616d795147c9b56c520)
History
#1 Updated by Ruben S. Montero over 7 years ago
- Category set to Drivers - Monitor
- Status changed from Pending to New
- Target version set to Release 4.6
This maybe caused because the poll process starts a new probe before the previous ends. Or maybe you have started multiple collectd clients (adding the same host multiple times), this should not happen either.
#2 Updated by novid Agha Hasani over 7 years ago
Ruben S. Montero wrote:
This maybe caused because the poll process starts a new probe before the previous ends. Or maybe you have started multiple collectd clients (adding the same host multiple times), this should not happen either.
and what can i do?
for temporary solution, i'm writing a script when load server will be higher than a specific number, then it automatically kill ruby process
I have similar problem on opennbula server too (fronted) and when I'm starting opennebula then load server is going up rapidly and sunstone cant response very well...
this is my "vmstat -2" output when opennebula is running
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 5274216 147200 174484 0 0 21 903 216 433 1 1 62 36 0 1 0 5273720 147200 174504 0 0 8 636 253 861 1 0 65 34 0 1 0 5272592 147200 174524 0 0 12 668 278 836 1 2 61 37 0 1 0 5272228 147208 174564 0 0 10 672 238 785 0 0 64 36 0 1 0 5271740 147208 174576 0 0 18 652 247 793 1 1 65 33 0 1 0 5271732 147208 174592 0 0 10 656 217 746 0 0 65 34 0 1 0 5264232 147208 174644 0 0 4 676 695 2752 25 11 36 28
and output ps command
ps -ef | grep ruby root 3518 1612 0 14:43 pts/2 00:00:00 grep --color=auto ruby oneadmin 32104 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_vmm_exec.rb -t 2 -r 0 kvm oneadmin 32174 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_im_exec.rb -r 3 -t 2 kvm oneadmin 32187 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_tm.rb -t 15 -d dummy,lvm,shared,fs_lvm,qcow2,ssh,vmfs,ceph oneadmin 32203 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_hm.rb oneadmin 32220 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_datastore.rb -t 5 -d dummy,fs,vmfs,lvm,ceph oneadmin 32236 32084 0 14:19 ? 00:00:00 ruby /usr/lib/one/mads/one_auth_mad.rb --authn ssh,x509,ldap,server_cipher,server_x509
after stoped opennebula
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 5353436 148296 179568 0 0 20 874 214 1052 2 1 62 35 0 0 0 5353428 148296 179568 0 0 0 0 46 48 0 0 100 0 0 0 0 5353428 148304 179568 0 0 0 16 32 39 0 0 99 1 0 0 0 5353428 148304 179568 0 0 0 0 25 19 0 0 100 0 0 0 0 5353428 148304 179568 0 0 0 0 29 39 0 0 100 0 0 0 0 5353428 148304 179572 0 0 0 0 30 43 0 0 100 0
and this is about "b" column : Number of processes in uninterruptible sleep (b) can be used to identify the CPU power. If the value is constantly greater than zero then you may not have enough CPU power.
which is true? my CPU processor is not enough for running opennebula fronted or opennebula is not optimized on my server?
or is opennebula bug related to ruby script?
my cpu is Dual-Core E5700 @ 3GHz, and opennebula just manage 4 host and 14 vm
#3 Updated by Ruben S. Montero over 7 years ago
- Status changed from New to Closed
- Resolution set to worksforme
I cannot see any problem (regarding the ps output) from OpenNebula point of view. Those processes are the drivers, they are most of the time sleeping, and executing operations when needed.
Are there other operations going on in the system? for example Registering a big file in a local datastore can slow down the system if it hasn't got a performing storage...
We can keep looking at it closer, and reopen the bug if we find out any performance problem with OpenNebula.
Thanks for your feedback
#4 Updated by Ruben S. Montero over 7 years ago
- Status changed from Closed to New
Wrongly closed this issue, the original problem still exits: too many monitor process on the host
#5 Updated by Ruben S. Montero over 7 years ago
- Related to Bug #2656: Monitor continually cycles through finding machines RUNNING and stat UNKNOWN added
#6 Updated by Javi Fontan over 7 years ago
- Status changed from New to Assigned
- Assignee set to Javi Fontan
#7 Updated by Javi Fontan over 7 years ago
It's possible that your problem is the same as in this thread http://lists.opennebula.org/pipermail/users-opennebula.org/2014-January/026234.html.
The start script for the collectd-client daemon in the hosts checks if a daemon is already running in the host using a pid file that resides in /tmp. When this file can not be correctly written (maybe permissions) or contains an incorrect pid (when this file is shared with other hosts) the script does not know a daemon is already running as starts another one.
Can you check that the file /tmp/one-collectd-client.pid
has read/write permissions for oneadmin
user, the file is not shared with more hosts and contais the PID of one of the 'collectd' daemons already running.
We are going to change the path of this file to the remotes directory. The advantage of this is that this directory can be configured in oned.conf
so the administrator has more control over it.
#8 Updated by Javi Fontan over 7 years ago
- Status changed from Assigned to Closed