Bug #191
onevm resume (restore) deletes checkpoint and vm disks after failed restore
Status: | Closed | Start date: | 02/10/2010 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | Ruben S. Montero | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 1.4.2 | |||
Resolution: | worksforme | Pull request: | ||
Affected Versions: |
Description
If a restore procedure fails, one simple deletes all files of the VM:
By my mean it should retry to restore, if for some reason the host could not restore it, and at no mean delete the disks!
Wed Feb 10 09:06:30 2010 [LCM][I]: New VM state is SAVE_STOP
Wed Feb 10 09:06:35 2010 [LCM][I]: New VM state is EPILOG_STOP
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.6:/var/lib/one//3256/images
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.6:/one-master/var/lib/one//3256
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 mkdir -p /one-master/var/lib/one//3256".
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Holding 172.22.0.6:/var/lib/one//3256/images back
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 cp -ar /var/lib/one//3256/images /one-master/var/lib/one//3256".
Wed Feb 10 09:11:15 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.6 rm -rf /var/lib/one//3256/images".
Wed Feb 10 09:11:15 2010 [DiM][I]: New VM state is STOPPED
Wed Feb 10 09:31:07 2010 [DiM][I]: New VM state is PENDING.
Wed Feb 10 09:31:11 2010 [DiM][I]: New VM state is ACTIVE.
Wed Feb 10 09:31:11 2010 [LCM][I]: New VM state is PROLOG.
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.12:/one-master//var/lib/one/3256/images
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.12:/var/lib/one//3256
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 mkdir -p /var/lib/one//3256".
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Holding abatesting:/var/lib/one/3256/images back
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 cp -ar /one-master//var/lib/one/3256/images /var/lib/one//3256".
Wed Feb 10 09:35:38 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.12 rm -rf /one-master//var/lib/one/3256/images".
Wed Feb 10 09:35:38 2010 [LCM][I]: New VM state is BOOT
Wed Feb 10 09:35:39 2010 [VMM][I]: Command execution fail: virsh restore /var/lib/one//3256/images/checkpoint
Wed Feb 10 09:35:39 2010 [VMM][I]: STDERR follows.
Wed Feb 10 09:35:39 2010 [VMM][I]: Warning: Permanently added '172.22.0.12' (RSA) to the list of known hosts.
Wed Feb 10 09:35:39 2010 [VMM][I]: libvir: QEMU error : operation failed: failed to start VM
Wed Feb 10 09:35:39 2010 [VMM][I]: error: Failed to restore domain from /var/lib/one//3256/images/checkpoint
Wed Feb 10 09:35:39 2010 [VMM][I]: ExitCode: 1
Wed Feb 10 09:35:39 2010 [VMM][E]: Error restoring VM, -
Wed Feb 10 09:35:39 2010 [DiM][I]: New VM state is FAILED
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Source: 172.22.0.12:/var/lib/one//3256/images
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Destiny: 172.22.0.12:/var/lib/one//3256/images
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Executed "ssh 172.22.0.12 mkdir -p /var/lib/one//3256".
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Holding 172.22.0.12:/var/lib/one//3256/images back
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: LOG - 3256 tm_delete.sh: Executed "ssh 172.22.0.12 rm -rf /var/lib/one//3256".
Wed Feb 10 09:36:03 2010 [TM][W]: Ignored: TRANSFER SUCCESS 3256 -
History
#1 Updated by Marlon Nerling over 11 years ago
I think the very problem is the Scheduler don't selecting a host to restore/resume the VM, but taking the old one.
But If the host is full and have no free RAM/CPU then the libvirt returns the error as above, that cascade to the issue.
So by my mean, it is no need to retry the restore, the need is to right pick out the host where the VM should be restored.
#2 Updated by Ruben S. Montero over 11 years ago
- Status changed from New to Closed
- Assignee set to Ruben S. Montero
- Target version changed from Release 1.4 to Release 1.4.2
- Resolution set to worksforme
Hi Marlon
This is actually the life-cycle coded in the OpenNebula core, however as you describe if you restore fails the disk are removed. However, your next comment seems strange:
- suspend/resume: Saves the VM, images are left in the host but resources, CPU, RAM are not freed. So, when you resume the VM in the same host you are sure to have enough CPU and RAM.
- stop/resume: Saves the VM, images are saved in the VM_DIR and resources are freed. When you resume the VM it is assigned to a host with enough resources (not need to be the previous one). In case of failure images are deleted from the target host but you have a copy in the VM_DIR.
I'll close this one, and mark as work for me. We will reopen it if OpenNebula does not behave as described above in your installation.
#3 Updated by Marlon Nerling over 11 years ago
I agree in principle to close this issue.
The delete behavior was partly my fault, I changed tm_mv.sh by myself to delete from the $SRC, since open does not manages the disk files, I resolved the problem now with a local patch.
There stays the problem with the stop/resume.
I don't see it as a opennebula bug, since it is libvirt/kvm who cannot restore the domain, but I'm thinking to patch /usr/lib/one/mads/one_vmm_kvm.rb to workaround this issue, as an example by retrying the resume.
Great regards.
#4 Updated by Marlon Nerling over 11 years ago
- File one_vmm_kvm.rb.diff added
For this who are interessed:
I have the patch as attachment here.
It help in my fall, my point is not seeing it ported into opennebula, but to help admins with the same problem.
With this patch restore tries 'virsh restore' 5 times bevor setting the VM failed. It works, most of the time:
(snip)
Fri Mar 5 16:11:01 2010 [DiM][I]: New VM state is STOPPED
Fri Mar 5 16:14:01 2010 [DiM][I]: New VM state is PENDING.
Fri Mar 5 16:14:30 2010 [DiM][I]: New VM state is ACTIVE.
Fri Mar 5 16:14:30 2010 [LCM][I]: New VM state is PROLOG.
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Source : 172.22.0.9:/one-master/var/lib/one/3733/images
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Destination: 172.22.0.9:/var/lib/one/3733
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 mkdir -p /var/lib/one/3733".
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Holding abatesting:/var/lib/one/3733/images back
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 ionice -c3 cp -ar /one-master/var/lib/one/3733/images /var/lib/one/3733".
Fri Mar 5 16:28:19 2010 [TM][I]: tm_mv.sh: Executed "ssh 172.22.0.9 rm -rf /one-master/var/lib/one/3733/images".
Fri Mar 5 16:28:19 2010 [LCM][I]: New VM state is BOOT ## HERE IT TRY FIRST .. and would fail
Fri Mar 5 16:28:30 2010 [VMM][I]: Command execution fail:
Fri Mar 5 16:28:30 2010 [VMM][I]: STDERR follows.
Fri Mar 5 16:28:30 2010 [VMM][I]: Warning: Permanently added '172.22.0.9' (RSA) to the list of known hosts.
Fri Mar 5 16:28:30 2010 [VMM][I]: stdin: is not a tty ## Bad, bad qemu/kvm
Fri Mar 5 16:28:30 2010 [VMM][I]: libvir: QEMU error : operation failed: failed to start VM
Fri Mar 5 16:28:30 2010 [VMM][I]: error: Failed to restore domain from /var/lib/one/3733/images/checkpoint
Fri Mar 5 16:28:30 2010 [VMM][I]: ExitCode: 1 ## HERE IT TRY ONE MORE TIME
Fri Mar 5 16:28:57 2010 [LCM][I]: New VM state is RUNNING ## AND WHERE RECEIVE THE BACON..
(snip)
Best Regards
#5 Updated by Ruben S. Montero over 11 years ago
THANKS FOR THIS CONTRIBUTION!