Bug #3419

leases not correctly freed

Added by Edouard Bourguignon over 6 years ago. Updated over 6 years ago.

Status:ClosedStart date:12/09/2014
Priority:NormalDue date:
Assignee:Carlos Martín% Done:

0%

Category:Core & System
Target version:Release 4.10.2
Resolution:worksforme Pull request:
Affected Versions:OpenNebula 4.10

Description

I still don't know yet how to reproduce, but after deploying and then deleting some VMs, a few IP are still marked as used in AR of their VNET.

onevnet show 126

VIRTUAL NETWORK 126 INFORMATION                                                 
ID             : 126                 
NAME           : x0010.data          
USER           : oneadmin            
GROUP          : oneadmin            
CLUSTER        : -                   
BRIDGE         : br0                 
VLAN           : Yes                 
VLAN ID        : 2010                
USED LEASES    : 1                   

PERMISSIONS                                                                     
OWNER          : um-                 
GROUP          : ---                 
OTHER          : ---                 

VIRTUAL NETWORK TEMPLATE                                                        
BRIDGE="br0" 
CLUSTER="private_networks" 
DESCRIPTION="reseau DATA pour x0010" 
NETWORK_ADDRESS="192.168.2.0" 
NETWORK_MASK="255.255.255.0" 
PHYDEV="" 
VLAN="YES" 
VLAN_ID="2010" 

ADDRESS RANGE POOL                                                              
 AR TYPE    SIZE LEASES               MAC              IP          GLOBAL_PREFIX
  0 IP4      190      1 02:00:c0:a8:02:0a    192.168.2.10                      -

LEASES                                                                          
AR  OWNER                    MAC              IP                      IP6_GLOBAL
0   VM : 1147  02:00:c0:a8:02:12    192.168.2.18 

onevm show 1147

VIRTUAL MACHINE 1147 INFORMATION                                                
ID                  : 1147                
NAME                : appli0.x0010        
USER                : oneadmin            
GROUP               : oneadmin            
STATE               : DONE                
LCM_STATE           : LCM_INIT            
RESCHED             : No                  
START TIME          : 12/03 11:42:50      
END TIME            : 12/03 11:50:39      
DEPLOY ID           : -                   

VIRTUAL MACHINE MONITORING                                                      
USED MEMORY         : 0K                  
USED CPU            : 0                   
NET_TX              : 0K                  
NET_RX              : 0K                  

PERMISSIONS                                                                     
OWNER               : um-                 
GROUP               : ---                 
OTHER               : ---                 

VM DISKS                                                                        
 ID TARGET IMAGE                               TYPE SAVE SAVE_AS
  0 sda    soclelinux-2012-2.31-jboss430-x86_6 file   NO       -

VM NICS                                                                         
 ID NETWORK              VLAN BRIDGE       IP              MAC              
  0 admin                 yes br0          192.168.199.3   02:00:c0:a8:c7:03
  1 x0010.appli           yes br0          192.168.1.19    02:00:c0:a8:01:13
  2 x0010.data            yes br0          192.168.2.18    02:00:c0:a8:02:12

USER TEMPLATE                                                                   
COMMENT="Serveur applicatif pour la plateforme x0010" 
HYPERVISOR="kvm" 
LOGO="images/logos/centos.png" 
SCHED_REQUIREMENTS="TYPE=\"HYPERVISOR\"" 

VIRTUAL MACHINE TEMPLATE                                                        
AUTOMATIC_REQUIREMENTS="!(PUBLIC_CLOUD = YES)" 
CONTEXT=[
  DISK_ID="1",
  ETH0_DNS="10.156.0.2",
  ETH0_GATEWAY="192.168.199.254",
  ETH0_IP="192.168.199.3",
  ETH0_MAC="02:00:c0:a8:c7:03",
  ETH1_DNS="192.168.1.254",
  ETH1_GATEWAY="192.168.1.254",
  ETH1_IP="192.168.1.19",
  ETH1_MAC="02:00:c0:a8:01:13",
  ETH1_MASK="255.255.255.0",
  ETH1_NETWORK="192.168.1.0",
  ETH2_IP="192.168.2.18",
  ETH2_MAC="02:00:c0:a8:02:12",
  ETH2_MASK="255.255.255.0",
  ETH2_NETWORK="192.168.2.0",
  FILES_DS="/var/lib/one//datastores/2/1c9b55dd6cccaf334177609daf9ec5d9:'init.sh' ",
  NETWORK="YES",
  PROJET="x0010",
  PUPPET_ENV="x0010",
  PUPPET_PORT="8140",
  PUPPET_SERVER="192.168.199.200",
  SET_HOSTNAME="appli0.x0010",
  TARGET="hda" ]
CPU="0.1" 
GRAPHICS=[
  KEYMAP="fr",
  LISTEN="0.0.0.0",
  PORT="7047",
  TYPE="VNC" ]
MEMORY="768" 
OS=[
  ARCH="x86_64" ]
TEMPLATE_ID="5" 
VMID="1147" 

It's impossible to delete vnet 126 because openNebula says there is on IP in use in the leases. Ip is belonging to a VM that has been deleted and doesn't not exist anymore.
Any idea how to fix that ?

onevnet delete 126
[VirtualNetworkDelete] Cannot delete virtual network. Can not remove a virtual network with leases in use

Best regards,

Edouard.

Capture d_écran de 2014-12-10 15_24_17.png - accounting screenshot (111 KB) Edouard Bourguignon, 12/10/2014 03:14 PM

one-121.xml Magnifier - one-121.xml (12.3 KB) Edouard Bourguignon, 12/11/2014 12:10 PM

one-1147.xml Magnifier - one-1147.xml (4.69 KB) Edouard Bourguignon, 12/11/2014 12:13 PM

History

#1 Updated by Ruben S. Montero over 6 years ago

  • Category set to Core & System

#2 Updated by Edouard Bourguignon over 6 years ago

It seems I have a similar problem with the accounting. It shows an old deleted VM that is still consuming CPU hours (this VM is now using any IP lease). Should I open a new bug report?

#3 Updated by Ruben S. Montero over 6 years ago

  • Assignee set to Carlos Martín
  • Target version set to Release 4.10.2

Thanks Edouard, we'll take a look art both issues within this ticket. thanks again for your feedbak!

#4 Updated by Edouard Bourguignon over 6 years ago

Here is a screenshot of the accounting problem. VM 121 is using a lot of CPU hours.

Here is the running VMs:

    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
   171 oneadmin oneadmin puppet-0.s22    runn   21      1G pvidgsh104  13d 03h09
   172 oneadmin oneadmin puppetenc-0.s22 runn    0    768M pvidgsh103  13d 03h09
  1290 edouard  users    socle-2012-2.31 poff    0      0K pvidgsh101   0d 03h25
  1291 edouard  users    modif-13-1291   runn    0    768M pvidgsh104   0d 01h37
  1292 oneadmin oneadmin koan-1292       runn    0    512M pvidgsh102   0d 00h03

Here is details on VM 121:

VIRTUAL MACHINE 121 INFORMATION                                                 
ID                  : 121                 
NAME                : CentOS6-121         
USER                : benoit              
GROUP               : users               
STATE               : DONE                
LCM_STATE           : LCM_INIT            
RESCHED             : No                  
START TIME          : 11/18 14:41:16      
END TIME            : 11/26 11:00:25      
DEPLOY ID           : one-121             

VIRTUAL MACHINE MONITORING                                                      
USED MEMORY         : 0K                  
USED CPU            : 0                   
NET_TX              : 0K                  
NET_RX              : 1K                  

PERMISSIONS                                                                     
OWNER               : um-                 
GROUP               : ---                 
OTHER               : ---                 

VM DISKS                                                                        
 ID TARGET IMAGE                               TYPE SAVE SAVE_AS
  0 vda    CentOS 6.5                          file   NO       -

VM NICS                                                                         
 ID NETWORK              VLAN BRIDGE       IP              MAC              
  0 admin                 yes br0          192.168.199.108 02:00:c0:a8:c7:6c
  1 appli                 yes br0          192.168.200.75  02:00:c0:a8:c8:4b

VIRTUAL MACHINE HISTORY                                                         
SEQ HOST            ACTION             DS           START        TIME     PROLOG
  0 pvidgsh103      live-migrate        0  11/18 14:41:35   2d 01h22m   0h00m14s
  1 pvidgsh102      live-migrate        0  11/20 16:04:15   0d 00h21m   0h00m00s
  2 pvidgsh103      stop                0  11/20 16:26:00   0d 00h15m   0h00m00s
  3 pvidgsh104      none                0  11/20 16:52:24   0d 00h43m   0h00m00s
  4 pvidgsh104      none                0  11/20 17:53:11  19d 21h36m   0h00m00s
  5 pvidgsh104      none                0  11/20 17:54:10  19d 21h35m   0h00m00s
  6 pvidgsh104      none                0  11/20 17:56:13  19d 21h33m   0h00m00s
  7 pvidgsh104      none                0  11/20 17:56:41  19d 21h33m   0h00m00s
  8 pvidgsh104      none                0  11/20 17:57:44  19d 21h32m   0h00m00s
  9 pvidgsh104      none                0  11/20 17:58:48  19d 21h31m   0h00m00s
 10 pvidgsh104      none                0  11/20 17:59:46  19d 21h30m   0h00m00s
 11 pvidgsh104      none                0  11/20 18:01:52  19d 21h28m   0h00m00s
 12 pvidgsh104      stop                0  11/20 18:03:04   4d 21h58m   0h00m00s
 13 pvidgsh101      stop                0  11/25 16:07:54   0d 01h29m   0h00m00s
 14 pvidgsh103      stop                0  11/25 17:42:24   0d 00h00m   0h00m00s

USER TEMPLATE                                                                   
ERROR="Tue Nov 25 17:42:24 2014 : Error restoring VM: Could not restore from /var/lib/one//datastores/0/121/checkpoint" 
HYPERVISOR="kvm" 
LOGO="images/logos/centos.png" 

VIRTUAL MACHINE TEMPLATE                                                        
AUTOMATIC_REQUIREMENTS="!(PUBLIC_CLOUD = YES)" 
CONTEXT=[
  DISK_ID="1",
  ETH0_DNS="10.156.0.2",
  ETH0_GATEWAY="192.168.199.254",
  ETH0_IP="192.168.199.108",
  ETH0_MAC="02:00:c0:a8:c7:6c",
  ETH1_IP="192.168.200.75",
  ETH1_MAC="02:00:c0:a8:c8:4b",
  NETWORK="YES",
  TARGET="hda" ]
CPU="1" 
GRAPHICS=[
  KEYMAP="fr",
  LISTEN="0.0.0.0",
  PORT="6021",
  TYPE="VNC" ]
MEMORY="768" 
OS=[
  ARCH="x86_64" ]
TEMPLATE_ID="0" 
VMID="121" 

I have already done a onedb fsck with no luck. Glad to help :)

#5 Updated by Carlos Martín over 6 years ago

Hi,

Could you please send us the individual VM logs for both VMs? We would also need the 'onevm show -x' output.
Thanks.

#6 Updated by Carlos Martín over 6 years ago

About the accounting problem, definitely looks like the history records 4 to 11 do not have an end time, and are all still considered open (running). Let's see if the logs provide any clue to find out why.

#7 Updated by Edouard Bourguignon over 6 years ago

Here is the onevm show 121 -x.

For the individual VM logs do you mean the libvirt log? Or some file in /var/log/one? Because in /var/log/one/121.log I only have this:

Tue Nov 25 17:42:24 2014 [Z0][VMM][E]: Could not restore from /var/lib/one//datastores/0/121/checkpoint
Tue Nov 25 17:42:24 2014 [Z0][VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/121/checkpoint
Wed Nov 26 11:00:25 2014 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 121 -

I remember I try to use snaphot on the disk of this vm without success, but the VM was running VM. At least this one in particular.

I also add the onevm show 1147 -x. This VM has only this in its log:

Wed Dec  3 11:50:39 2014 [Z0][DiM][I]: New VM state is DONE.

For the accounting problem, here is log for VM 11:

Wed Nov 12 18:26:59 2014 [Z0][DiM][I]: New VM state is ACTIVE.
Wed Nov 12 18:27:00 2014 [Z0][LCM][I]: New VM state is PROLOG.
Wed Nov 12 18:27:32 2014 [Z0][LCM][I]: New VM state is BOOT
Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/11/deployment.0
Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: ExitCode: 0
Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: ExitCode: 0
Wed Nov 12 18:27:32 2014 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-vsctl set Port vnet2 tag=199".
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,arp,dl_src=02:00:c0:a8:c7:09,priority=45000,actions=drop".
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,arp,dl_src=02:00:c0:a8:c7:09,nw_src=192.168.199.9,priority=46000,actions=normal".
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,dl_src=02:00:c0:a8:c7:09,priority=40000,actions=normal".
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: post: Executed "sudo ovs-ofctl add-flow br0 in_port=5,priority=39000,actions=drop".
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: ExitCode: 0
Wed Nov 12 18:27:33 2014 [Z0][VMM][I]: Successfully execute network driver operation: post.
Wed Nov 12 18:27:33 2014 [Z0][LCM][I]: New VM state is RUNNING
Wed Nov 12 19:16:20 2014 [Z0][LCM][I]: New VM state is CLEANUP.
Wed Nov 12 19:16:21 2014 [Z0][DiM][I]: New VM state is DONE
Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Driver command for 11 cancelled

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 error: failed to get domain 'one-11'

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 error: Domaine non trouvé : no domain with matching name 'one-11'

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 ExitCode: 0

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute virtualization driver operation: cancel.

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,in_port=5,dl_src=02:00:c0:a8:c7:09".

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,arp,in_port=5,dl_src=02:00:c0:a8:c7:09".

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 ,arp,in_port=5,dl_src=02:00:c0:a8:c7:09,arp_spa=192.168.199.9".

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 clean: Executed "sudo ovs-ofctl del-flows br0 in_port=5".

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 ExitCode: 0

Wed Nov 12 19:16:21 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute network driver operation: clean.

Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute transfer manager driver operation: tm_delete.

Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: LOG I 11 Successfully execute transfer manager driver operation: tm_delete.

Wed Nov 12 19:16:22 2014 [Z0][VMM][W]: Ignored: CLEANUP SUCCESS 11

It seem interesting that it did find domain one-11 on the hypervisor, but that the cleanup was however successful. Same for other VM 4 to 10. Is it possible that this "failed to get domain 'one-11'" error could result from an unavailable ssh connection for oneadmin user?

#8 Updated by Edouard Bourguignon over 6 years ago

Sorry I meant "It seem interesting that it did not find domain one-11 on the hypervisor, but that the cleanup was however successful."

#9 Updated by Carlos Martín over 6 years ago

Hi there,

I've been trying to reproduce your problem with the leases, and I found that they can get stuck as used when the AR is resized to be smaller, see #3443.

Can it be the reason in your case? Did you change your AR size?

#10 Updated by Edouard Bourguignon over 6 years ago

No I'm pretty sure I didn't change the AR size. All the vnet + AR were created using some scripts (stress tests to create many vnets/VM). So today I try to rerun this scripts and the problem didn't occur.
The only thing I can think of is that during the first deployement and deletion of all this VMs, when the problem did occur, the network were really busy, and then the NFS get a bit stuck, perhaps preventing some SSH connections to the hypervisors. I remember having some VM still running on the hypervisors but shown as DONE/deleted from OpenNebula.
It's not a big deal if there is a procedure to clean this problem. Is it the case? How to I free this leases?

#11 Updated by Ruben S. Montero over 6 years ago

  • Status changed from Pending to Closed
  • Resolution set to worksforme

Marking this as worksforme as we cannot reproduce it.

The only way to clean the problem would be by implementing #3056. Meanwhile you can add a new AR with the addresses not freed, this is not very elegant but functionally oned could continue using those leases.

#12 Updated by Edouard Bourguignon over 6 years ago

Great #3056 is what we need to go in production

Also available in: Atom PDF