Bug #1060
remotes/vmm/kvm/poll_ganglia:73: undefined method `[]' for nil:NilClass
Status: | Closed | Start date: | 01/13/2012 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Javi Fontan | % Done: | 0% | |
Category: | Drivers - Auth | |||
Target version: | Release 3.2.1 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 3.0 |
Description
Hi,
I've followed configuration directions here: http://opennebula.org/documentation:rel3.0:ganglia
IM seems is working correctly, onehost list output:
ID NAME RVM TCPU FCPU ACPU TMEM FMEM AMEM STAT 2 thor 4 400 382 360 2.9G 2.5G 2.7G on 3 odin 6 400 362 340 3.9G 2.9G 3.5G on
Hosts are configured with:
IM_MAD : im_ganglia VM_MAD : vmm_kvm TM_MAD : tm_shared
I've enabled the poll_ganglia local poll adding -l poll=poll_ganglia to vmm_kvm arguments.
I've configured the cron to push the VMS_INFORMATION base64 array to Ganglia, it's appearing both in the host details in the ganglia web interface, and in the output of onehost show for both nodes:
$ onehost show 3|grep VMS_INFORMATION|awk -F'"' '{print$2}'|base64 -d --- one-35: :state: a :nettx: 0 :usedcpu: "2.5" :name: one-35 :usedmemory: 65536 :netrx: 185856 one-37: :state: a :nettx: 0 :usedcpu: "2.5" :name: one-37 :usedmemory: 65536 :netrx: 183694 $ onehost show 2|grep VMS_INFORMATION|awk -F'"' '{print$2}'|base64 -d --- one-36: :state: a :nettx: 0 :usedcpu: "2.8" :name: one-36 :usedmemory: 65536 :netrx: 172851
However, something is broken:
$ onevm list ID USER GROUP NAME STAT CPU MEM HOSTNAME TIME 28 oneadmin oneadmin tty_0 runn 0 0K odin 04 21:26:28 29 oneadmin oneadmin tty_1 runn 0 0K thor 04 21:26:28 30 oneadmin oneadmin tty_2 runn 0 0K odin 04 21:26:28 31 oneadmin oneadmin tty_3 runn 0 0K thor 04 21:26:28 32 oneadmin oneadmin tty_4 runn 0 0K odin 04 21:26:27 33 oneadmin oneadmin tty_5 runn 0 0K thor 04 21:26:27 34 oneadmin oneadmin tty_6 runn 0 0K odin 04 21:26:27 35 oneadmin oneadmin tty_7 runn 0 0K odin 04 21:26:27 36 oneadmin oneadmin tty_8 runn 0 0K thor 04 21:26:27 37 oneadmin oneadmin tty_9 runn 0 0K odin 04 21:26:26
Poll are constantly failing.
[VMM][D]: Message received: LOG I 37 Command execution fail: /var/cloud/one/var/remotes/vmm/kvm/poll_ganglia one-37 odin 37 odin [VMM][D]: Message received: LOG I 37 /var/cloud/one/var/remotes/vmm/kvm/poll_ganglia:73: undefined method `[]' for nil:NilClass (NoMethodError) [VMM][D]: Message received: LOG I 37 ExitCode: 1 [VMM][D]: Message received: POLL FAILURE 37 -
Looking in the code, seems that the early call:
doms_info=ganglia.get_vms_information
is returning null, but I'm not so smart to go deeper than this.
This is a test environment based on Debian Squeeze amd64, I can provide external SSH access if could help debugging the issue.
Associated revisions
bug #1060: fix parameter order in ganglia poll
bug #1060: set VM to unknown when it cannot be monitoried
bug #1060: fix parameter order in ganglia poll(cherry picked from commit cd9f6d670545d1d69aeb36ed65ff12bf7373d4fd)
bug #1060: set VM to unknown when it cannot be monitoried(cherry picked from commit 2f4bf9bbe214bb84e138bb57fa354a0e3e7cefa8)
History
#1 Updated by Giovanni Toraldo over 9 years ago
Oh nice, I've realized that the parameters passed to poll_ganglia are swapped.
I've fixed changing var/remotes/vmm/kvm/poll_ganglia lines 52-54 from:
domain=ARGV[0] dom_id=ARGV[1] host=ARGV[2]
to:
domain=ARGV[0] dom_id=ARGV[2] host=ARGV[1]
Here is the output of onehost list:
$ onevm list ID USER GROUP NAME STAT CPU MEM HOSTNAME TIME 28 oneadmin oneadmin tty_0 runn 0 0K odin 04 22:37:02 29 oneadmin oneadmin tty_1 runn 0 0K thor 04 22:37:02 30 oneadmin oneadmin tty_2 runn 0 0K odin 04 22:37:02 31 oneadmin oneadmin tty_3 runn 0 0K thor 04 22:37:02 32 oneadmin oneadmin tty_4 runn 0 0K odin 04 22:37:01 33 oneadmin oneadmin tty_5 runn 0 0K thor 04 22:37:01 34 oneadmin oneadmin tty_6 runn 0 0K odin 04 22:37:01 35 oneadmin oneadmin tty_7 runn 1 64M odin 04 22:37:01 36 oneadmin oneadmin tty_8 runn 2 64M thor 04 22:37:01 37 oneadmin oneadmin tty_9 runn 1 64M odin 04 22:37:00
Now the running VMs have updated CPU and memory usage, however killed instances are in running state instead of the expected unknown state. Is this behavior bound on how Ganglia works?
#2 Updated by Javi Fontan over 9 years ago
It seems that the problem is caused because the name of the host in ganglia differs from the name you have registered in OpenNebula. Ganglia xml is composed of host nodes, those nodes have a host name:
<HOST NAME="host.name.com" IP="10.10.10.10" REPORTED="1326470344" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="0">
We use the name from OpenNebula database to get the node where the VM is running so we can check the info stored there. I suppose the NAME in ganglia differs from the ones you are using in OpenNebula (odin and thor). Is this correct?
PD: I agree that the script should be more reliable and give a meaningful error message.
#3 Updated by Giovanni Toraldo over 9 years ago
I'm not a ganglia guru, but the hostnames should correspond to the ON ones: http://twitpic.com/86sa8n/full
How can I check it to be sure?
#4 Updated by Giovanni Toraldo over 9 years ago
I've killed a kvm process, oned.log:
[VMM][I]: Monitoring VM 36. [VMM][D]: Message received: LOG I 36 ExitCode: 0 [VMM][D]: Message received: POLL SUCCESS 36 -
Exit without error, but it doesn't print anything.
And the state is kept on running:
$ onevm list|grep 36 36 oneadmin oneadmin tty_8 runn 6 64M odin 04 23:40:04
However an error is logged in the machine instance template:
$ onevm show 36 [..] ERROR=[ MESSAGE="Error parsing monitoring str:\"POLL SUCCESS 36 - \"", TIMESTAMP="Fri Jan 13 17:16:28 2012" ] [..]
#5 Updated by Javi Fontan over 9 years ago
- Category set to Drivers - Auth
- Assignee set to Javi Fontan
You are right with the parameter switch. We've changed that parameters order in the drivers but failed to update the ganglia drivers accordingly. The problem with manually killed machines is known but we still didn't have time to fix it. I'll fix both the parameter order and the unknown state and will attach the patch here.
I'm leaving this ticket open to keep track on these issues.
Thanks for reporting and discovering the cause of the problem.
#6 Updated by Javi Fontan over 9 years ago
Patches are uploaded that fix the parameter order and set the VM to unknown when it disappears.
#7 Updated by Javi Fontan over 9 years ago
- Status changed from New to Closed
- Resolution set to fixed
#8 Updated by Carlos MartÃn over 9 years ago
- Target version set to Release 3.2.1