Bug #1060

remotes/vmm/kvm/poll_ganglia:73: undefined method `[]' for nil:NilClass

Added by Giovanni Toraldo over 9 years ago. Updated over 9 years ago.

Status:ClosedStart date:01/13/2012
Priority:NormalDue date:
Assignee:Javi Fontan% Done:

0%

Category:Drivers - Auth
Target version:Release 3.2.1
Resolution:fixed Pull request:
Affected Versions:OpenNebula 3.0

Description

Hi,

I've followed configuration directions here: http://opennebula.org/documentation:rel3.0:ganglia

IM seems is working correctly, onehost list output:

  ID NAME               RVM   TCPU   FCPU   ACPU   TMEM   FMEM   AMEM   STAT
   2 thor                 4    400    382    360   2.9G   2.5G   2.7G     on
   3 odin                 6    400    362    340   3.9G   2.9G   3.5G     on

Hosts are configured with:

IM_MAD                : im_ganglia          
VM_MAD                : vmm_kvm             
TM_MAD                : tm_shared  

I've enabled the poll_ganglia local poll adding -l poll=poll_ganglia to vmm_kvm arguments.

I've configured the cron to push the VMS_INFORMATION base64 array to Ganglia, it's appearing both in the host details in the ganglia web interface, and in the output of onehost show for both nodes:

$ onehost show 3|grep VMS_INFORMATION|awk -F'"' '{print$2}'|base64 -d
--- 
one-35: 
  :state: a
  :nettx: 0
  :usedcpu: "2.5" 
  :name: one-35
  :usedmemory: 65536
  :netrx: 185856
one-37: 
  :state: a
  :nettx: 0
  :usedcpu: "2.5" 
  :name: one-37
  :usedmemory: 65536
  :netrx: 183694
$ onehost show 2|grep VMS_INFORMATION|awk -F'"' '{print$2}'|base64 -d
--- 
one-36: 
  :state: a
  :nettx: 0
  :usedcpu: "2.8" 
  :name: one-36
  :usedmemory: 65536
  :netrx: 172851

However, something is broken:

$ onevm list
    ID USER     GROUP    NAME         STAT CPU     MEM        HOSTNAME        TIME
    28 oneadmin oneadmin tty_0        runn   0      0K            odin 04 21:26:28
    29 oneadmin oneadmin tty_1        runn   0      0K            thor 04 21:26:28
    30 oneadmin oneadmin tty_2        runn   0      0K            odin 04 21:26:28
    31 oneadmin oneadmin tty_3        runn   0      0K            thor 04 21:26:28
    32 oneadmin oneadmin tty_4        runn   0      0K            odin 04 21:26:27
    33 oneadmin oneadmin tty_5        runn   0      0K            thor 04 21:26:27
    34 oneadmin oneadmin tty_6        runn   0      0K            odin 04 21:26:27
    35 oneadmin oneadmin tty_7        runn   0      0K            odin 04 21:26:27
    36 oneadmin oneadmin tty_8        runn   0      0K            thor 04 21:26:27
    37 oneadmin oneadmin tty_9        runn   0      0K            odin 04 21:26:26

Poll are constantly failing.

[VMM][D]: Message received: LOG I 37 Command execution fail: /var/cloud/one/var/remotes/vmm/kvm/poll_ganglia one-37 odin 37 odin
[VMM][D]: Message received: LOG I 37 /var/cloud/one/var/remotes/vmm/kvm/poll_ganglia:73: undefined method `[]' for nil:NilClass (NoMethodError)
[VMM][D]: Message received: LOG I 37 ExitCode: 1
[VMM][D]: Message received: POLL FAILURE 37 -

Looking in the code, seems that the early call:

doms_info=ganglia.get_vms_information

is returning null, but I'm not so smart to go deeper than this.

This is a test environment based on Debian Squeeze amd64, I can provide external SSH access if could help debugging the issue.

Associated revisions

Revision cd9f6d67
Added by Javi Fontan over 9 years ago

bug #1060: fix parameter order in ganglia poll

Revision 2f4bf9bb
Added by Javi Fontan over 9 years ago

bug #1060: set VM to unknown when it cannot be monitoried

Revision 5d357af6
Added by Javi Fontan over 9 years ago

bug #1060: fix parameter order in ganglia poll(cherry picked from commit cd9f6d670545d1d69aeb36ed65ff12bf7373d4fd)

Revision 6060d482
Added by Javi Fontan over 9 years ago

bug #1060: set VM to unknown when it cannot be monitoried(cherry picked from commit 2f4bf9bbe214bb84e138bb57fa354a0e3e7cefa8)

History

#1 Updated by Giovanni Toraldo over 9 years ago

Oh nice, I've realized that the parameters passed to poll_ganglia are swapped.

I've fixed changing var/remotes/vmm/kvm/poll_ganglia lines 52-54 from:

domain=ARGV[0]
dom_id=ARGV[1]
host=ARGV[2]

to:
domain=ARGV[0]
dom_id=ARGV[2]
host=ARGV[1]

Here is the output of onehost list:

$ onevm list
    ID USER     GROUP    NAME         STAT CPU     MEM        HOSTNAME        TIME
    28 oneadmin oneadmin tty_0        runn   0      0K            odin 04 22:37:02
    29 oneadmin oneadmin tty_1        runn   0      0K            thor 04 22:37:02
    30 oneadmin oneadmin tty_2        runn   0      0K            odin 04 22:37:02
    31 oneadmin oneadmin tty_3        runn   0      0K            thor 04 22:37:02
    32 oneadmin oneadmin tty_4        runn   0      0K            odin 04 22:37:01
    33 oneadmin oneadmin tty_5        runn   0      0K            thor 04 22:37:01
    34 oneadmin oneadmin tty_6        runn   0      0K            odin 04 22:37:01
    35 oneadmin oneadmin tty_7        runn   1     64M            odin 04 22:37:01
    36 oneadmin oneadmin tty_8        runn   2     64M            thor 04 22:37:01
    37 oneadmin oneadmin tty_9        runn   1     64M            odin 04 22:37:00

Now the running VMs have updated CPU and memory usage, however killed instances are in running state instead of the expected unknown state. Is this behavior bound on how Ganglia works?

#2 Updated by Javi Fontan over 9 years ago

It seems that the problem is caused because the name of the host in ganglia differs from the name you have registered in OpenNebula. Ganglia xml is composed of host nodes, those nodes have a host name:

<HOST NAME="host.name.com" IP="10.10.10.10" REPORTED="1326470344" TN="1" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="0">

We use the name from OpenNebula database to get the node where the VM is running so we can check the info stored there. I suppose the NAME in ganglia differs from the ones you are using in OpenNebula (odin and thor). Is this correct?

PD: I agree that the script should be more reliable and give a meaningful error message.

#3 Updated by Giovanni Toraldo over 9 years ago

I'm not a ganglia guru, but the hostnames should correspond to the ON ones: http://twitpic.com/86sa8n/full

How can I check it to be sure?

#4 Updated by Giovanni Toraldo over 9 years ago

I've killed a kvm process, oned.log:

[VMM][I]: Monitoring VM 36.
[VMM][D]: Message received: LOG I 36 ExitCode: 0
[VMM][D]: Message received: POLL SUCCESS 36 -

Exit without error, but it doesn't print anything.

And the state is kept on running:

$ onevm list|grep 36
    36 oneadmin oneadmin tty_8        runn   6     64M            odin 04 23:40:04

However an error is logged in the machine instance template:

$ onevm show 36
[..]
ERROR=[
  MESSAGE="Error parsing monitoring str:\"POLL SUCCESS 36 -
\"",
  TIMESTAMP="Fri Jan 13 17:16:28 2012" ]
[..]

#5 Updated by Javi Fontan over 9 years ago

  • Category set to Drivers - Auth
  • Assignee set to Javi Fontan

You are right with the parameter switch. We've changed that parameters order in the drivers but failed to update the ganglia drivers accordingly. The problem with manually killed machines is known but we still didn't have time to fix it. I'll fix both the parameter order and the unknown state and will attach the patch here.

I'm leaving this ticket open to keep track on these issues.

Thanks for reporting and discovering the cause of the problem.

#6 Updated by Javi Fontan over 9 years ago

Patches are uploaded that fix the parameter order and set the VM to unknown when it disappears.

#7 Updated by Javi Fontan over 9 years ago

  • Status changed from New to Closed
  • Resolution set to fixed

#8 Updated by Carlos Martín over 9 years ago

  • Target version set to Release 3.2.1

Also available in: Atom PDF