Bug #1324

stopped kvm VMs can't be resumed (using ssh transfer driver)

Added by Jochem Ippers over 7 years ago. Updated over 7 years ago.

Status:ClosedStart date:06/30/2012
Priority:NormalDue date:
Assignee:Jaime Melis% Done:

0%

Category:Drivers - Auth
Target version:Release 3.6
Resolution:fixed Pull request:
Affected Versions:OpenNebula 3.4

Description

Hi,
when using the ssh transfer manager driver stopped VMs can't be resumed. (While resuming after the local suspending works.)
The called restore-script (/var/tmp/one/vmm/kvm/restore) doesn't work for the remote stop operation.

Kind regards.
Jochem

from the vm log:

1. stopped VM moved from node to frontend:

Sat Jun 30 00:31:10 2012 [LCM][I]: New VM state is RUNNING
Sat Jun 30 00:37:51 2012 [VMM][I]: ExitCode: 0
...
Sat Jun 30 02:27:45 2012 [LCM][I]: New VM state is SAVE_STOP
Sat Jun 30 02:27:45 2012 [VMM][I]: save: Moving old checkpoint file /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:27:45 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:27:45 2012 [VMM][I]: Successfully execute virtualization driver operation: save.
Sat Jun 30 02:27:45 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:27:45 2012 [VMM][I]: Successfully execute network driver operation: clean.
Sat Jun 30 02:27:45 2012 [LCM][I]: New VM state is EPILOG_STOP
Sat Jun 30 02:27:46 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:27:55 2012 [TM][I]: mv: Moving a089:/var/lib/one//datastores/0/11 to b006:/var/lib/one/datastores/0/11
Sat Jun 30 02:27:55 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:27:55 2012 [DiM][I]: New VM state is STOPPED
Sat Jun 30 02:29:22 2012 [DiM][I]: New VM state is PENDING.
Sat Jun 30 02:29:41 2012 [DiM][I]: New VM state is ACTIVE.
Sat Jun 30 02:29:41 2012 [LCM][I]: New VM state is PROLOG.
Sat Jun 30 02:29:41 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:29:41 2012 [TM][I]: mv: Moving b006:/var/lib/one/datastores/0/11 to b006:/var/lib/one//datastores/0/11
Sat Jun 30 02:29:41 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:29:41 2012 [LCM][I]: New VM state is BOOT
Sat Jun 30 02:29:41 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:29:41 2012 [VMM][I]: Successfully execute network driver operation: pre.
Sat Jun 30 02:29:42 2012 [VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore /var/lib/one//datastores/0/11/checkpoint b006 11 b006
Sat Jun 30 02:29:42 2012 [VMM][E]: restore: Command "virsh --connect qemu:///system restore /var/lib/one//datastores/0/11/checkpoint" failed: error: Failed to restore domain from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:29:42 2012 [VMM][I]: error: Failed to create file '/var/lib/one//datastores/0/11/checkpoint': Operation not permitted
Sat Jun 30 02:29:42 2012 [VMM][E]: Could not restore from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:29:42 2012 [VMM][I]: ExitCode: 1
Sat Jun 30 02:29:42 2012 [VMM][I]: Failed to execute virtualization driver operation: restore.
Sat Jun 30 02:29:42 2012 [VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:29:42 2012 [DiM][I]: New VM state is FAILED

2. VM runs on the frontend and is stopped there:

Sat Jun 30 02:34:30 2012 [LCM][I]: New VM state is RUNNING
Sat Jun 30 02:34:40 2012 [LCM][I]: New VM state is SAVE_STOP
Sat Jun 30 02:34:41 2012 [VMM][I]: save: Moving old checkpoint file /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:34:41 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:34:41 2012 [VMM][I]: Successfully execute virtualization driver operation: save.
Sat Jun 30 02:34:41 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:34:41 2012 [VMM][I]: Successfully execute network driver operation: clean.
Sat Jun 30 02:34:41 2012 [LCM][I]: New VM state is EPILOG_STOP
Sat Jun 30 02:34:41 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:34:42 2012 [TM][I]: mv: Moving b006:/var/lib/one//datastores/0/11 to b006:/var/lib/one/datastores/0/11
Sat Jun 30 02:34:42 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:34:42 2012 [DiM][I]: New VM state is STOPPED
Sat Jun 30 02:35:14 2012 [DiM][I]: New VM state is PENDING.
Sat Jun 30 02:35:41 2012 [DiM][I]: New VM state is ACTIVE.
Sat Jun 30 02:35:41 2012 [LCM][I]: New VM state is PROLOG.
Sat Jun 30 02:35:41 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:35:41 2012 [TM][I]: mv: Moving b006:/var/lib/one/datastores/0/11 to b006:/var/lib/one//datastores/0/11
Sat Jun 30 02:35:41 2012 [TM][I]: ExitCode: 0
Sat Jun 30 02:35:41 2012 [LCM][I]: New VM state is BOOT
Sat Jun 30 02:35:41 2012 [VMM][I]: ExitCode: 0
Sat Jun 30 02:35:41 2012 [VMM][I]: Successfully execute network driver operation: pre.
Sat Jun 30 02:35:41 2012 [VMM][I]: Command execution fail: /var/tmp/one/vmm/kvm/restore /var/lib/one//datastores/0/11/checkpoint b006 11 b006
Sat Jun 30 02:35:41 2012 [VMM][E]: restore: Command "virsh --connect qemu:///system restore /var/lib/one//datastores/0/11/checkpoint" failed: error: Failed to restore domain from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:35:41 2012 [VMM][I]: error: Failed to create file '/var/lib/one//datastores/0/11/checkpoint': Operation not permitted
Sat Jun 30 02:35:41 2012 [VMM][E]: Could not restore from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:35:41 2012 [VMM][I]: ExitCode: 1
Sat Jun 30 02:35:41 2012 [VMM][I]: Failed to execute virtualization driver operation: restore.
Sat Jun 30 02:35:41 2012 [VMM][E]: Error restoring VM: Could not restore from /var/lib/one//datastores/0/11/checkpoint
Sat Jun 30 02:35:41 2012 [DiM][I]: New VM state is FAILED

Associated revisions

Revision fb42cdd8
Added by Jaime Melis over 7 years ago

Bug #1324: stopped kvm VMs can't be resumed (using ssh transfer driver)

Revision 921aa55b
Added by Jaime Melis over 7 years ago

Bug #1324: stopped kvm VMs can't be resumed (using ssh transfer driver)
(cherry picked from commit fb42cdd8ce3f4f0b0fb85c99d94e870a573ce786)

History

#1 Updated by Ruben S. Montero over 7 years ago

  • Status changed from New to Assigned
  • Assignee set to Jaime Melis

#2 Updated by Jaime Melis over 7 years ago

  • Status changed from Assigned to Closed
  • Resolution set to worksforme

It works for me. It seems like a problem with KVM configuration, especifically dynamic_ownership:
http://opennebula.org/documentation:rel3.6:kvmg#kvm_configuration

Closing the ticket.

#3 Updated by Jochem Ippers over 7 years ago

Hi,
I found the reason for the failures (my kvm settings were ok). It's a small thing in the shell script (Opennebula v.3.4.1):

/var/lib/one/remotes/tm/ssh/mv

The problem is in line 54:

if [ "$SRC" == "$DST" ]; then

If $SRC and $DST are the same path this comparison still fails because as you can see from the logs above their values are different:
[TM][I]: mv: Moving b006:/var/lib/one/datastores/0/11 to b006:/var/lib/one//datastores/0/11

The destination path has a double slash in it while the source path doesn't have it, and so the comparison doesn't return true as it should.

So to correct this and get rid of the double slashes you can put the usual two short calls after sourcing $TMCOMMON in the mv script:

SRC=`fix_dir_slashes $SRC`
DST=`fix_dir_slashes $DST`

(Maybe the problem appears in othe scripts too. Searching log files for double slashes could tell about it.)

Greetings from Aachen, germany

P.S.: Congratulations to the european football championship, anyway spain was my bet from the start ;-)

mini patch:


--- mv.orig     2012-07-03 19:44:56.122917522 +0200
+++ mv  2012-07-03 20:30:15.837082633 +0200
@@ -36,6 +36,10 @@
 # Return if moving a disk, we will move them when moving the whole system_ds
 # directory for the VM
 #-------------------------------------------------------------------------------
+
+SRC=`fix_dir_slashes $SRC`
+DST=`fix_dir_slashes $DST`
+
 SRC_PATH=`arg_path $SRC`
 DST_PATH=`arg_path $DST`

#4 Updated by Ruben S. Montero over 7 years ago

  • Status changed from Closed to Assigned
  • Assignee changed from Jaime Melis to Javi Fontan
  • Resolution deleted (worksforme)

Re-openning it to include fix_dir slashes probably in arg_path

#5 Updated by Javi Fontan over 7 years ago

arg_path function already calls fix_dir_slashes since e4017e610685a8466aa4fc7ed122e4800585d052 (ONE 3.4 beta). Are you sure that you have one 3.4.1?

#6 Updated by Jochem Ippers over 7 years ago

Hi,
yes, it is rpm package opennebula-3.4.1-2.x86_64 on Opensuse 11.4.

#7 Updated by Jochem Ippers over 7 years ago

Ah, that's the mistake:
The path comparison in line 54 is done between $SRC and $DST, but arg_path is only called ON both of them for the value assertion of SRC_PATH and DST_PATH in line 39 and 40:

 39 SRC_PATH=`arg_path $SRC`
 40 DST_PATH=`arg_path $DST`

SRC and DST themselves don't get corrected with arg_path before the path comparison.

So the comparison should rather be like that:

54 if [ "$SRC_PATH" == "$DST_PATH" ]; then
55     log "Not moving $SRC_PATH to $DST_PATH, they are the same path" 

#8 Updated by Jaime Melis over 7 years ago

  • Status changed from Assigned to Closed
  • Assignee changed from Javi Fontan to Jaime Melis
  • Resolution set to fixed

Hello Jochem,

apologies, the reason why it was working for me is that I was resuming the VM in the same host. You're right, it's a bug and we have applied your patch.

Thanks a lot for reporting this!

#9 Updated by Jochem Ippers over 7 years ago

Hi,
I think that all shell script variables that get initialized with the command line arguments $1, $2... and that contain paths as their values should be corrected this way (with the function fix_dir_slashes to get rid of double slashes). Because sooner or later that could again cause problems. Also the annoying double slashes in the logs would be gone then.

You can easily find the candidates to change for $1 this way:
grep -r ="\$1" /var/tmp/one/*

So, possible candidates would be the variables:
SRC, DST, src, dest_host, domain, HYPERVISOR_DIR

Kind regards
Jochem

Also available in: Atom PDF