Bug #317
Negative VM counters for hosts
Status: | Closed | Start date: | 08/17/2010 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 3.2 - Beta1 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 3.0 |
Description
It looks like there's a very ugly bug in ONE 1.4, namely `onehost list` saying:
$ onehost list ID NAME RVM TCPU FCPU ACPU TMEM FMEM STAT 14 node4.vh -2 400 395 395 8192016 8130084 off 15 node4.vh -16 400 400 400 8192016 8136420 off 16 node1.vh -32 400 394 394 8192016 8026628 on
These negative numbers in place of the VM counters don't look very nice... At the same time `onevm list` shows:
$ onevm list ID USER NAME STAT CPU MEM HOSTNAME TIME 909 oneadmin cc-debia runn 0 262144 node1.vh 00 04:07:33 1012 oneadmin nc-99 fail 0 0 node1.vh 00 00:00:37 1022 oneadmin nc-106 fail 0 0 node1.vh 00 00:00:42 1023 oneadmin nc-107 fail 0 0 node1.vh 00 00:00:47 1024 oneadmin nc-108 fail 0 0 node1.vh 00 00:00:54 1025 oneadmin nc-110 fail 0 0 node1.vh 00 00:01:03 1026 oneadmin nc-111 fail 0 0 node1.vh 00 00:01:12 1027 oneadmin nc-112 fail 0 0 node1.vh 00 00:01:14 1028 oneadmin nc-113 fail 0 0 node1.vh 00 00:01:16 1029 oneadmin nc-109 fail 0 0 node1.vh 00 00:01:24 1030 oneadmin nc-114 fail 0 0 node1.vh 00 00:01:25
It looks like we only use node1.vh, but when I try to delete node4.vh I get:
$ onehost delete 14 Host still has associated VMs. It will be disabled instead. $ onehost delete 15 Host still has associated VMs. It will be disabled instead.
Associated revisions
Bug #317: Before migration, the RM checks if another migration has just started.
Bug #317: Before migration, the RM checks if another migration has just started.
(cherry picked from commit ff683fc5014c334080e363911b7157cf93ac8ffa)
History
#1 Updated by Javi Fontan almost 11 years ago
Could you send us oned.log file to check what can be happening? one.db file would be nice also. Thanks
#2 Updated by Szabolcs Székelyi almost 11 years ago
Unfortunately our oned.log doesn't contain too much info since oned has been restarted a few times since this error appeared (but this far we didn't care about it because otherwise everything worked fine, but when we tried to remove a host, this hit us very badly), and it clears the log on startup -- which is also a serious bug. Attached anyway alongside with one.db.
#3 Updated by Ruben S. Montero over 10 years ago
- Category set to Core & System
- Status changed from New to Closed
- Assignee set to Carlos Martín
- Target version set to Release 2.2
- Resolution set to fixed
This seems to be caused by a wrong counter handling when two migrations occur simultaneously on the same VM. Bug fix for this is in one-2.0 branch and master.
#4 Updated by Krzysztof Pawlik almost 10 years ago
I think this bug is still not solved completely:
oneadmin@ii1:~$ onehost list | grep on-10-177-32-58 -B 1
ID NAME CLUSTER RVM TCPU FCPU ACPU TMEM FMEM STAT
0 on-10-177-32-58 default 1 2400 2372 1700 141.9G 129.1G on
oneadmin@ii1:~$ onevm list | grep on-10-177-32-58
2036 oneadmin X runn 3 7.5G on-10-177-32-58 09 18:37:31
2039 oneadmin X runn 3 517.4M on-10-177-32-58 03 16:49:24
2055 oneadmin X runn 3 7.5G on-10-177-32-58 01 18:06:40
2060 oneadmin X runn 1 1.7G on-10-177-32-58 00 19:59:53
oneadmin@ii1:~$
This is from OpenNebula version 2.2.1. When I update the value of running_vms
column in database to correct value (4) it will get reset to incorrect (1) shortly, so it's a counting bug somewhere in oned
. Running /tmp/vmm/kvm/poll --kvm
on this node correctly returns 4 entries:
root@on-10-177-32-58:~# /tmp/vmm/kvm/poll --kvm | base64 -d
---
one-2060:
:state: a
:nettx: "38320459"
:usedcpu: "1.7"
:name: one-2060
:usedmemory: 1781760
:netrx: "98076876"
one-2039:
:state: a
:nettx: "161570270"
:usedcpu: "3.5"
:name: one-2039
:usedmemory: 529916
:netrx: "349813672"
one-2055:
:state: a
:nettx: "55957833"
:usedcpu: "3.6"
:name: one-2055
:usedmemory: 7864320
:netrx: "1706234403"
one-2036:
:state: a
:nettx: "71730336"
:usedcpu: "3.4"
:name: one-2036
:usedmemory: 7864320
:netrx: "1145203212"
#5 Updated by Carlos Martín almost 10 years ago
- Status changed from Closed to Assigned
- Target version changed from Release 2.2 to Release 3.0
- Resolution deleted (
fixed)
If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.
Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?
#6 Updated by Krzysztof Pawlik almost 10 years ago
Carlos MartÃn wrote:
If you change the DB with opennebula running, the cached object will still contain the old data. Sooner or later it is saved to the DB, overwriting your changes.
Yes, I've found that out and updated the counters when oned
was down.
Could you provide more info? Any operations that could lead to the wrong counter?
Did you apply migrations, resubmit, cancel or suspend operations?
No migrations or suspends, only few resubmits, normal shutdowns. I'll try to provide more history info from database when it happens again.
#7 Updated by Carlos Martín almost 10 years ago
- Assignee deleted (
Carlos Martín)
#8 Updated by Ruben S. Montero almost 10 years ago
- Target version deleted (
Release 3.0) - Affected Versions OpenNebula 3.0 added
#9 Updated by Ruben S. Montero almost 10 years ago
- Status changed from Assigned to New
#10 Updated by Ruben S. Montero almost 10 years ago
- Target version set to Release 3.4
#11 Updated by Ruben S. Montero almost 10 years ago
- Target version changed from Release 3.4 to Release 3.2 - S0
#12 Updated by Ruben S. Montero almost 10 years ago
- Status changed from New to Assigned
#13 Updated by Ruben S. Montero over 9 years ago
- Target version changed from Release 3.2 - S0 to Release 3.2 - S1
#14 Updated by Ruben S. Montero over 9 years ago
- Target version changed from Release 3.2 - S1 to Release 3.2 - Beta1
#15 Updated by Ruben S. Montero over 9 years ago
- Status changed from Assigned to Closed
- Resolution set to fixed
There was a bug when getting resources by name. In the case of the example reported (two hosts with the same name) or deleting/adding hosts with the same name that bug may have the effect described here. This bug caused the same effect for networks....
Closing this issue, hope not to have to reopen it again ;)