Bug #317: Negative VM counters for hosts - OpenNebula - OpenNebula Development pages

Bug #317

Negative VM counters for hosts

Added by Szabolcs Székelyi almost 11 years ago. Updated over 9 years ago.

Status:

Closed

Start date:

08/17/2010

Priority:

Normal

Due date:

Assignee:

% Done:

Category:

Core & System

Target version:

Release 3.2 - Beta1

Resolution:

fixed

Pull request:

Affected Versions:

OpenNebula 3.0

Description

It looks like there's a very ugly bug in ONE 1.4, namely `onehost list` saying:

$ onehost list
  ID NAME                      RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
  14 node4.vh                   -2    400    395    395 8192016 8130084  off
  15 node4.vh                  -16    400    400    400 8192016 8136420  off
  16 node1.vh                  -32    400    394    394 8192016 8026628   on

These negative numbers in place of the VM counters don't look very nice... At the same time `onevm list` shows:

$ onevm list
  ID     USER     NAME STAT CPU     MEM        HOSTNAME        TIME
 909 oneadmin cc-debia runn   0  262144        node1.vh 00 04:07:33
1012 oneadmin    nc-99 fail   0       0        node1.vh 00 00:00:37
1022 oneadmin   nc-106 fail   0       0        node1.vh 00 00:00:42
1023 oneadmin   nc-107 fail   0       0        node1.vh 00 00:00:47
1024 oneadmin   nc-108 fail   0       0        node1.vh 00 00:00:54
1025 oneadmin   nc-110 fail   0       0        node1.vh 00 00:01:03
1026 oneadmin   nc-111 fail   0       0        node1.vh 00 00:01:12
1027 oneadmin   nc-112 fail   0       0        node1.vh 00 00:01:14
1028 oneadmin   nc-113 fail   0       0        node1.vh 00 00:01:16
1029 oneadmin   nc-109 fail   0       0        node1.vh 00 00:01:24
1030 oneadmin   nc-114 fail   0       0        node1.vh 00 00:01:25

It looks like we only use node1.vh, but when I try to delete node4.vh I get:

$ onehost delete 14
Host still has associated VMs. It will be disabled instead.
$ onehost delete 15
Host still has associated VMs. It will be disabled instead.

oned.log (3.6 MB) Szabolcs Székelyi, 08/18/2010 06:28 PM

one.db (809 KB) Szabolcs Székelyi, 08/18/2010 06:28 PM

Associated revisions

Revision ff683fc5
Added by Carlos Martín over 10 years ago

Bug #317: Before migration, the RM checks if another migration has just started.

Revision 231b3e80
Added by Carlos Martín over 10 years ago

Bug #317: Before migration, the RM checks if another migration has just started.
(cherry picked from commit ff683fc5014c334080e363911b7157cf93ac8ffa)

History

#1 Updated by Javi Fontan almost 11 years ago

Could you send us oned.log file to check what can be happening? one.db file would be nice also. Thanks

#2 Updated by Szabolcs Székelyi almost 11 years ago

File oned.log added
File one.db added

Unfortunately our oned.log doesn't contain too much info since oned has been restarted a few times since this error appeared (but this far we didn't care about it because otherwise everything worked fine, but when we tried to remove a host, this hit us very badly), and it clears the log on startup -- which is also a serious bug. Attached anyway alongside with one.db.

#3 Updated by Ruben S. Montero over 10 years ago

Category set to Core & System
Status changed from New to Closed
Assignee set to Carlos Martín
Target version set to Release 2.2
Resolution set to fixed

This seems to be caused by a wrong counter handling when two migrations occur simultaneously on the same VM. Bug fix for this is in one-2.0 branch and master.

#4 Updated by Krzysztof Pawlik almost 10 years ago

I think this bug is still not solved completely:

oneadmin@ii1:~$ onehost list | grep on-10-177-32-58 -B 1
  ID NAME              CLUSTER  RVM   TCPU   FCPU   ACPU    TMEM    FMEM STAT
   0 on-10-177-32-58   default    1   2400   2372   1700  141.9G  129.1G   on
oneadmin@ii1:~$ onevm list | grep on-10-177-32-58
 2036 oneadmin X runn   3    7.5G on-10-177-32-58 09 18:37:31
 2039 oneadmin X runn   3  517.4M on-10-177-32-58 03 16:49:24
 2055 oneadmin X runn   3    7.5G on-10-177-32-58 01 18:06:40
 2060 oneadmin X runn   1    1.7G on-10-177-32-58 00 19:59:53
oneadmin@ii1:~$

This is from OpenNebula version 2.2.1. When I update the value of running_vms column in database to correct value (4) it will get reset to incorrect (1) shortly, so it's a counting bug somewhere in oned. Running /tmp/vmm/kvm/poll --kvm on this node correctly returns 4 entries:

root@on-10-177-32-58:~# /tmp/vmm/kvm/poll --kvm | base64 -d
--- 
one-2060: 
  :state: a
  :nettx: "38320459" 
  :usedcpu: "1.7" 
  :name: one-2060
  :usedmemory: 1781760
  :netrx: "98076876" 
one-2039: 
  :state: a
  :nettx: "161570270" 
  :usedcpu: "3.5" 
  :name: one-2039
  :usedmemory: 529916
  :netrx: "349813672" 
one-2055: 
  :state: a
  :nettx: "55957833" 
  :usedcpu: "3.6" 
  :name: one-2055
  :usedmemory: 7864320
  :netrx: "1706234403" 
one-2036: 
  :state: a
  :nettx: "71730336" 
  :usedcpu: "3.4" 
  :name: one-2036
  :usedmemory: 7864320
  :netrx: "1145203212"

#5 Updated by Carlos Martín almost 10 years ago

Status changed from Closed to Assigned
Target version changed from Release 2.2 to Release 3.0
Resolution deleted (~~fixed~~)

If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.

Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?

#6 Updated by Krzysztof Pawlik almost 10 years ago

Carlos MartÃn wrote:

If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.

Yes, I've found that out and updated the counters when oned was down.

Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?

No migrations or suspends, only few resubmits, normal shutdowns. I'll try to provide more history info from database when it happens again.

#7 Updated by Carlos Martín almost 10 years ago

Assignee deleted (~~Carlos Martín~~)

#8 Updated by Ruben S. Montero almost 10 years ago

Target version deleted (~~Release 3.0~~)
Affected Versions OpenNebula 3.0 added

#9 Updated by Ruben S. Montero almost 10 years ago

Status changed from Assigned to New

#10 Updated by Ruben S. Montero almost 10 years ago

Target version set to Release 3.4

#11 Updated by Ruben S. Montero almost 10 years ago

Target version changed from Release 3.4 to Release 3.2 - S0

#12 Updated by Ruben S. Montero almost 10 years ago

Status changed from New to Assigned

#13 Updated by Ruben S. Montero over 9 years ago

Target version changed from Release 3.2 - S0 to Release 3.2 - S1

#14 Updated by Ruben S. Montero over 9 years ago

Target version changed from Release 3.2 - S1 to Release 3.2 - Beta1

#15 Updated by Ruben S. Montero over 9 years ago

Status changed from Assigned to Closed
Resolution set to fixed

There was a bug when getting resources by name. In the case of the example reported (two hosts with the same name) or deleting/adding hosts with the same name that bug may have the effect described here. This bug caused the same effect for networks....

Closing this issue, hope not to have to reopen it again ;)

Also available in: Atom PDF

OpenNebula

Issues

Custom queries