Bug #191

onevm resume (restore) deletes checkpoint and vm disks after failed restore

Added by Marlon Nerling over 11 years ago. Updated over 11 years ago.

Status:ClosedStart date:02/10/2010
Priority:HighDue date:
Assignee:Ruben S. Montero% Done:

0%

Category:Core & System
Target version:Release 1.4.2
Resolution:worksforme Pull request:
Affected Versions:

Description

If a restore procedure fails, one simple deletes all files of the VM:
By my mean it should retry to restore, if for some reason the host could not restore it, and at no mean delete the disks!

Wed Feb 10 09:06:30 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Feb 10 09:06:35 2010 [LCM][I]: New VM state is EPILOG_STOP
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.6:/var/lib/one//3256/images
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.6:/one-master/var/lib/one//3256
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 mkdir -p /one-master/var/lib/one//3256".
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Holding 172.22.0.6:/var/lib/one//3256/images back
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 cp -ar /var/lib/one//3256/images /one-master/var/lib/one//3256".
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 rm -rf /var/lib/one//3256/images".
Wed Feb 10 09:11:15 2010 [DiM][I]: New VM state is STOPPED
Wed Feb 10 09:31:07 2010 [DiM][I]: New VM state is PENDING.
Wed Feb 10 09:31:11 2010 [DiM][I]: New VM state is ACTIVE.
Wed Feb 10 09:31:11 2010 [LCM][I]: New VM state is PROLOG.
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.12:/one-master//var/lib/one/3256/images
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.12:/var/lib/one//3256
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 mkdir -p /var/lib/one//3256".
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Holding abatesting:/var/lib/one/3256/images back
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 cp -ar /one-master//var/lib/one/3256/images /var/lib/one//3256".
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 rm -rf /one-master//var/lib/one/3256/images".
Wed Feb 10 09:35:38 2010 [LCM][I]: New VM state is BOOT
Wed Feb 10 09:35:39 2010 [VMM][I]: Command execution fail: virsh restore /var/lib/one//3256/images/checkpoint
Wed Feb 10 09:35:39 2010 [VMM][I]: STDERR follows.
Wed Feb 10 09:35:39 2010 [VMM][I]: Warning: Permanently added '172.22.0.12' (RSA) to the list of known hosts.
Wed Feb 10 09:35:39 2010 [VMM][I]: libvir: QEMU error : operation failed: failed to start VM
Wed Feb 10 09:35:39 2010 [VMM][I]: error: Failed to restore domain from /var/lib/one//3256/images/checkpoint
Wed Feb 10 09:35:39 2010 [VMM][I]: ExitCode: 1
Wed Feb 10 09:35:39 2010 [VMM][E]: Error restoring VM, -
Wed Feb 10 09:35:39 2010 [DiM][I]: New VM state is FAILED
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Source: 172.22.0.12:/var/lib/one//3256/images
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Destiny: 172.22.0.12:/var/lib/one//3256/images
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Executed "ssh 172.22.0.12 mkdir -p /var/lib/one//3256".
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Holding 172.22.0.12:/var/lib/one//3256/images back
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Executed "ssh 172.22.0.12 rm -rf /var/lib/one//3256".
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: TRANSFER SUCCESS 3256 -

one_vmm_kvm.rb.diff Magnifier - Patch to retry virsh restore if failed (839 Bytes) Marlon Nerling, 03/05/2010 05:06 PM

Associated revisions

Revision 61a04f8e
Added by Tino Vázquez about 8 years ago

Bug #191: Fix wrong vifs usage (using -c to copy, instead of -p to upload)

Revision 4392349d
Added by Abel Coronado over 4 years ago

F #5001 added trash when push terminate button (#191)

History

#1 Updated by Marlon Nerling over 11 years ago

I think the very problem is the Scheduler don't selecting a host to restore/resume the VM, but taking the old one.
But If the host is full and have no free RAM/CPU then the libvirt returns the error as above, that cascade to the issue.
So by my mean, it is no need to retry the restore, the need is to right pick out the host where the VM should be restored.

#2 Updated by Ruben S. Montero over 11 years ago

  • Status changed from New to Closed
  • Assignee set to Ruben S. Montero
  • Target version changed from Release 1.4 to Release 1.4.2
  • Resolution set to worksforme

Hi Marlon

This is actually the life-cycle coded in the OpenNebula core, however as you describe if you restore fails the disk are removed. However, your next comment seems strange:

  • suspend/resume: Saves the VM, images are left in the host but resources, CPU, RAM are not freed. So, when you resume the VM in the same host you are sure to have enough CPU and RAM.
  • stop/resume: Saves the VM, images are saved in the VM_DIR and resources are freed. When you resume the VM it is assigned to a host with enough resources (not need to be the previous one). In case of failure images are deleted from the target host but you have a copy in the VM_DIR.

I'll close this one, and mark as work for me. We will reopen it if OpenNebula does not behave as described above in your installation.

#3 Updated by Marlon Nerling over 11 years ago

I agree in principle to close this issue.
The delete behavior was partly my fault, I changed tm_mv.sh by myself to delete from the $SRC, since open does not manages the disk files, I resolved the problem now with a local patch.
There stays the problem with the stop/resume.
I don't see it as a opennebula bug, since it is libvirt/kvm who cannot restore the domain, but I'm thinking to patch /usr/lib/one/mads/one_vmm_kvm.rb to workaround this issue, as an example by retrying the resume.

Great regards.

#4 Updated by Marlon Nerling over 11 years ago

For this who are interessed:
I have the patch as attachment here.

It help in my fall, my point is not seeing it ported into opennebula, but to help admins with the same problem.
With this patch restore tries 'virsh restore' 5 times bevor setting the VM failed. It works, most of the time:

(snip)
Fri Mar 5 16:11:01 2010 [DiM][I]: New VM state is STOPPED
Fri Mar 5 16:14:01 2010 [DiM][I]: New VM state is PENDING.
Fri Mar 5 16:14:30 2010 [DiM][I]: New VM state is ACTIVE.
Fri Mar 5 16:14:30 2010 [LCM][I]: New VM state is PROLOG.
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.9:/one-master/var/lib/one/3733/images
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.9:/var/lib/one/3733
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 mkdir -p /var/lib/one/3733".
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Holding abatesting:/var/lib/one/3733/images back
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 ionice -c3 cp -ar /one-master/var/lib/one/3733/images /var/lib/one/3733".
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 rm -rf /one-master/var/lib/one/3733/images".
Fri Mar 5 16:28:19 2010 [LCM][I]: New VM state is BOOT ## HERE IT TRY FIRST .. and would fail
Fri Mar 5 16:28:30 2010 [VMM][I]: Command execution fail:
Fri Mar 5 16:28:30 2010 [VMM][I]: STDERR follows.
Fri Mar 5 16:28:30 2010 [VMM][I]: Warning: Permanently added '172.22.0.9' (RSA) to the list of known hosts.
Fri Mar 5 16:28:30 2010 [VMM][I]: stdin: is not a tty ## Bad, bad qemu/kvm
Fri Mar 5 16:28:30 2010 [VMM][I]: libvir: QEMU error : operation failed: failed to start VM
Fri Mar 5 16:28:30 2010 [VMM][I]: error: Failed to restore domain from /var/lib/one/3733/images/checkpoint
Fri Mar 5 16:28:30 2010 [VMM][I]: ExitCode: 1 ## HERE IT TRY ONE MORE TIME
Fri Mar 5 16:28:57 2010 [LCM][I]: New VM state is RUNNING ## AND WHERE RECEIVE THE BACON..
(snip)

Best Regards

#5 Updated by Ruben S. Montero over 11 years ago

THANKS FOR THIS CONTRIBUTION!

Also available in: Atom PDF