Bug #1896

VMs stuck in UNKNOWN mode

Added by Cyrille Duverne about 8 years ago. Updated about 8 years ago.

Status:ClosedStart date:04/12/2013
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Resolution:fixed Pull request:
Affected Versions:OpenNebula 3.8

Description

Hello,

I'm facing a really strange behaviour since a few weeks now.

I've shutted down my whole ONE infra, and put it online again.

All the VMs have been restored via sunstone.
And 90% of them went back to RUNNING.

But on 2 hosts only, the VMs are still stuck in UNKNOWN status.

"virsh list" on host is saying that they are running.
e.g :
virsh list
Id Name State
----------------------------------
1 one-294 running

"ruby /var/tmp/one/vmm/kvm/poll one-xxx" is giving an answer with monitoring information.
e.g :
ruby -wd /var/tmp/one/vmm/kvm/poll one-294
STATE=a NETTX=231069507 USEDCPU=0.1 USEDMEMORY=2118112 NETRX=507801157

But in oned.log the answer of the poll is still status=d.
e.g :
Fri Apr 12 08:13:56 2013 [VMM][I]: Monitoring VM 294.
Fri Apr 12 08:13:56 2013 [VMM][D]: Message received: LOG I 294 ExitCode: 0
Fri Apr 12 08:13:56 2013 [VMM][D]: Message received: POLL SUCCESS 294 STATE=d

I verified that oneadmin has still rights to access passwordlessly the hosts and it's correctly configured.

I'm a bit loosing my french with this since it seems that the value is not cached anywhere, but I didn't check the DB.
Any help on this could be of great use.

Kind regards
Cyrille

History

#1 Updated by Ruben S. Montero about 8 years ago

The answers are different indeed:

POLL SUCCESS 294 STATE=d

vs

STATE=a NETTX=231069507 USEDCPU=0.1 USEDMEMORY=2118112 NETRX=507801157

Maybe not all the scripts have been properly copied. Maybe you can try to remove and copy again the whole /var/tmp/one/ (either manually or with onehost sync)

#2 Updated by Cyrille Duverne about 8 years ago

Hello,

I made a onehost sync and now, ho magic, ALL VMs are in UNKNOWN status, except the ones that are present on the sunstone machine...

That's a change, not in the good way, but still...

Thanks in advance for your feedback.
Cyrille

#3 Updated by Ruben S. Montero about 8 years ago

Are the /var/tmp/one being recreated in the host, do you have any problem (e.g. space) in /var/tmp in that host?

#4 Updated by Cyrille Duverne about 8 years ago

Well,

No space issue on the hosts.

But from the master : ls -lArth /var/tmp : drwxr-xr-x 10 oneadmin oneadmin 4.0K Sep 30 2012 one
From the remote hosts : ls -lArth /var/tmp : drwxr-xr-x 9 oneadmin oneadmin 4.0K Nov 23 10:43 one

Is this normal that after the onehost sync the folder is aged of 23/11 ?

It seems to be really weird for me... I really don't understand the issue here.

All accesses are granted to oneadmin, poll directly from the host give a good answer, but when coming from the master, it seems not to be working...

#5 Updated by Ruben S. Montero about 8 years ago

I do not see how this would affect the execution of run_probes, so to be different from the command line and from the dirver (ssh). But it may be worth trying to sync the clocks of master and host, recreate the /var/tmp/one and see if that fix the problem...

#6 Updated by Cyrille Duverne about 8 years ago

Well well well, I finally managed to solve this.

By... drumbs are rolling... removing ganglia options in oned.conf !

I'm using ganglia and for an unknown reason it seems that ganglia was not responding on 2 hosts...

Then I've chosen to remove it.

Do you know any other monitoring soft that doesn't need to have an agent running on the VM to achieve basic monitoring tasks ?

Let me say that this kind of feature integrated in ONE could be GREAT !!!

Thanks a lot for your investigation and time.
Have a great week end.
Cyrille

#7 Updated by Ruben S. Montero about 8 years ago

  • Status changed from New to Closed
  • Resolution set to fixed

OK. Great!!!

Yes we are thinking in developing a very thin and light agent using the current probe mechanism and the XML-RPC API of OpenNebula. Basically adding an option to change the monitoring strategy from pull (polling based) to push...

Also available in: Atom PDF