Bug #317

Negative VM counters for hosts

Added by Szabolcs Székelyi almost 11 years ago. Updated over 9 years ago.

Status:ClosedStart date:08/17/2010
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:Release 3.2 - Beta1
Resolution:fixed Pull request:
Affected Versions:OpenNebula 3.0

Description

It looks like there's a very ugly bug in ONE 1.4, namely `onehost list` saying:

$ onehost list
  ID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
  14 node4.vh                   -2    400    395    395 8192016 8130084  off
  15 node4.vh                  -16    400    400    400 8192016 8136420  off
  16 node1.vh                  -32    400    394    394 8192016 8026628   on

These negative numbers in place of the VM counters don't look very nice... At the same time `onevm list` shows:

$ onevm list
  ID     USER     NAME STAT CPU     MEM        HOSTNAME        TIME
 909 oneadmin cc-debia runn   0  262144        node1.vh 00 04:07:33
1012 oneadmin    nc-99 fail   0       0        node1.vh 00 00:00:37
1022 oneadmin   nc-106 fail   0       0        node1.vh 00 00:00:42
1023 oneadmin   nc-107 fail   0       0        node1.vh 00 00:00:47
1024 oneadmin   nc-108 fail   0       0        node1.vh 00 00:00:54
1025 oneadmin   nc-110 fail   0       0        node1.vh 00 00:01:03
1026 oneadmin   nc-111 fail   0       0        node1.vh 00 00:01:12
1027 oneadmin   nc-112 fail   0       0        node1.vh 00 00:01:14
1028 oneadmin   nc-113 fail   0       0        node1.vh 00 00:01:16
1029 oneadmin   nc-109 fail   0       0        node1.vh 00 00:01:24
1030 oneadmin   nc-114 fail   0       0        node1.vh 00 00:01:25

It looks like we only use node1.vh, but when I try to delete node4.vh I get:

$ onehost delete 14
Host still has associated VMs. It will be disabled instead.
$ onehost delete 15
Host still has associated VMs. It will be disabled instead.

oned.log (3.6 MB) Szabolcs Székelyi, 08/18/2010 06:28 PM

one.db (809 KB) Szabolcs Székelyi, 08/18/2010 06:28 PM

Associated revisions

Revision ff683fc5
Added by Carlos Martín over 10 years ago

Bug #317: Before migration, the RM checks if another migration has just started.

Revision 231b3e80
Added by Carlos Martín over 10 years ago

Bug #317: Before migration, the RM checks if another migration has just started.
(cherry picked from commit ff683fc5014c334080e363911b7157cf93ac8ffa)

History

#1 Updated by Javi Fontan almost 11 years ago

Could you send us oned.log file to check what can be happening? one.db file would be nice also. Thanks

#2 Updated by Szabolcs Székelyi almost 11 years ago

Unfortunately our oned.log doesn't contain too much info since oned has been restarted a few times since this error appeared (but this far we didn't care about it because otherwise everything worked fine, but when we tried to remove a host, this hit us very badly), and it clears the log on startup -- which is also a serious bug. Attached anyway alongside with one.db.

#3 Updated by Ruben S. Montero over 10 years ago

  • Category set to Core & System
  • Status changed from New to Closed
  • Assignee set to Carlos Martín
  • Target version set to Release 2.2
  • Resolution set to fixed

This seems to be caused by a wrong counter handling when two migrations occur simultaneously on the same VM. Bug fix for this is in one-2.0 branch and master.

#4 Updated by Krzysztof Pawlik almost 10 years ago

I think this bug is still not solved completely:

oneadmin@ii1:~$ onehost list | grep on-10-177-32-58 -B 1
  ID NAME              CLUSTER  RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 on-10-177-32-58   default    1   2400   2372   1700  141.9G  129.1G   on
oneadmin@ii1:~$ onevm list | grep on-10-177-32-58
 2036 oneadmin X runn   3    7.5G on-10-177-32-58 09 18:37:31
 2039 oneadmin X runn   3  517.4M on-10-177-32-58 03 16:49:24
 2055 oneadmin X runn   3    7.5G on-10-177-32-58 01 18:06:40
 2060 oneadmin X runn   1    1.7G on-10-177-32-58 00 19:59:53
oneadmin@ii1:~$

This is from OpenNebula version 2.2.1. When I update the value of running_vms column in database to correct value (4) it will get reset to incorrect (1) shortly, so it's a counting bug somewhere in oned. Running /tmp/vmm/kvm/poll --kvm on this node correctly returns 4 entries:

root@on-10-177-32-58:~# /tmp/vmm/kvm/poll --kvm | base64 -d
--- 
one-2060: 
  :state: a
  :nettx: "38320459" 
  :usedcpu: "1.7" 
  :name: one-2060
  :usedmemory: 1781760
  :netrx: "98076876" 
one-2039: 
  :state: a
  :nettx: "161570270" 
  :usedcpu: "3.5" 
  :name: one-2039
  :usedmemory: 529916
  :netrx: "349813672" 
one-2055: 
  :state: a
  :nettx: "55957833" 
  :usedcpu: "3.6" 
  :name: one-2055
  :usedmemory: 7864320
  :netrx: "1706234403" 
one-2036: 
  :state: a
  :nettx: "71730336" 
  :usedcpu: "3.4" 
  :name: one-2036
  :usedmemory: 7864320
  :netrx: "1145203212"

#5 Updated by Carlos Martín almost 10 years ago

  • Status changed from Closed to Assigned
  • Target version changed from Release 2.2 to Release 3.0
  • Resolution deleted (fixed)

If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.

Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?

#6 Updated by Krzysztof Pawlik almost 10 years ago

Carlos Martín wrote:

If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.

Yes, I've found that out and updated the counters when oned was down.

Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?

No migrations or suspends, only few resubmits, normal shutdowns. I'll try to provide more history info from database when it happens again.

#7 Updated by Carlos Martín almost 10 years ago

  • Assignee deleted (Carlos Martín)

#8 Updated by Ruben S. Montero almost 10 years ago

  • Target version deleted (Release 3.0)
  • Affected Versions OpenNebula 3.0 added

#9 Updated by Ruben S. Montero almost 10 years ago

  • Status changed from Assigned to New

#10 Updated by Ruben S. Montero almost 10 years ago

  • Target version set to Release 3.4

#11 Updated by Ruben S. Montero almost 10 years ago

  • Target version changed from Release 3.4 to Release 3.2 - S0

#12 Updated by Ruben S. Montero almost 10 years ago

  • Status changed from New to Assigned

#13 Updated by Ruben S. Montero over 9 years ago

  • Target version changed from Release 3.2 - S0 to Release 3.2 - S1

#14 Updated by Ruben S. Montero over 9 years ago

  • Target version changed from Release 3.2 - S1 to Release 3.2 - Beta1

#15 Updated by Ruben S. Montero over 9 years ago

  • Status changed from Assigned to Closed
  • Resolution set to fixed

There was a bug when getting resources by name. In the case of the example reported (two hosts with the same name) or deleting/adding hosts with the same name that bug may have the effect described here. This bug caused the same effect for networks....

Closing this issue, hope not to have to reopen it again ;)

Also available in: Atom PDF