Bug #4055: disk snapshot restore while VM is running is leading to broken VM - OpenNebula - OpenNebula Development pages

Bug #4055

disk snapshot restore while VM is running is leading to broken VM

Added by Anton Todorov over 5 years ago. Updated over 5 years ago.

Status:

Closed

Start date:

10/14/2015

Priority:

Normal

Due date:

Assignee:

Ruben S. Montero

% Done:

Category:

Core & System

Target version:

Release 4.14.2

Resolution:

worksforme

Pull request:

Affected Versions:

OpenNebula 4.14

Description

steps to reproduce

create VM
create disk snapshot 1
do something inside VM
create snapshot 2
select first snapshot (1) and hit Restore
after restore is complete reboot the VM (PowerOFf + Resume)
open VNC console to see the broken VM

The above mentioned procedure is restoring the disk on SUSPENDED VM. So after restore we have one file system info in the VM kernel and practically another filesystem on the VM disk. So when VM is restored it is breaking the file system.

I can see two options to solve the issue:

in any case the disk restore procedure should be powerOff-snaprevert-resume.
After thinking a while I've implemented another solution in our storage addon as follow:
in TM_MAD/snap_create save the checkpoint file, TM_MAD/snap_revert restore the checkpoint file and TM_MAD/snap_delete delete checkpoint file. This way the VM is restored with the checkpoint file for the given snapshot. Probably there is better solution but this one is working for our setups.

Other problems I've found. ProbablyI should open separate cases for them?

revert procedure completes, although TM_MAD/snap_revert is returning error exit code. So I've implemented extra measure to have proper VM disk/checkpoint for the rest of the process
From this state I can not PowerOff(hard) the VM to re-restore from a snapshot - hard PowerOff not woring when VM is stuck in VM BIOS.

in oned.log DISKSNAPSHOTCREATE is logged at the end of the restore procedure:

...
Wed Oct 14 12:36:33 2015 [Z0][ReM][D]: Req:9296 UID:0 VirtualMachineDiskSnapshotRevert invoked , 100, 0, 0
Wed Oct 14 12:36:33 2015 [Z0][ReM][D]: Req:9296 UID:0 VirtualMachineDiskSnapshotRevert result SUCCESS, 100
Wed Oct 14 12:36:48 2015 [Z0][VMM][D]: Message received: LOG I 100 ExitCode: 0
Wed Oct 14 12:36:48 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute virtualization driver operation: save.
Wed Oct 14 12:36:48 2015 [Z0][VMM][D]: Message received: LOG I 100 ExitCode: 0
Wed Oct 14 12:36:48 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute network driver operation: clean.
Wed Oct 14 12:36:48 2015 [Z0][ReM][D]: Req:9232 UID:0 VirtualMachineInfo invoked , 100
Wed Oct 14 12:36:48 2015 [Z0][ReM][D]: Req:9232 UID:0 VirtualMachineInfo result SUCCESS, "<VM><ID>100</ID><UID..." 
Wed Oct 14 12:36:51 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute transfer manager driver operation: tm_snap_revert.
Wed Oct 14 12:36:51 2015 [Z0][VMM][D]: Message received: LOG I 100 ExitCode: 0
Wed Oct 14 12:36:51 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute network driver operation: pre.
Wed Oct 14 12:36:52 2015 [Z0][VMM][D]: Message received: LOG I 100 ExitCode: 0
Wed Oct 14 12:36:52 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute virtualization driver operation: restore.
Wed Oct 14 12:36:52 2015 [Z0][VMM][D]: Message received: LOG I 100 ExitCode: 0
Wed Oct 14 12:36:52 2015 [Z0][VMM][D]: Message received: LOG I 100 Successfully execute network driver operation: post.
Wed Oct 14 12:36:52 2015 [Z0][VMM][D]: Message received: DISKSNAPSHOTCREATE SUCCESS 100 
...

Kind Regards,
Anton Todorov

Related issues

History

#1 Updated by Anton Todorov over 5 years ago

here is the addon-storpool commit that implements a solution:
https://github.com/OpenNebula/addon-storpool/commit/989f5a3c8ee340252053712b4c7c594edecf24a7

#2 Updated by Anton Todorov over 5 years ago

Well to proposed (second) solution is not working if there are more than one disk attached to the VM :(

Only first proposal left. Or there are other solution?

Kind Regards,
Anton Todorov

#3 Updated by Ruben S. Montero over 5 years ago

Currently, the users need to guarantee that the snapshots can be safely taken, and they need to be aware that snapshots require to sync the guest OS state. Currently the driver supports three modes for VMs in RUNNNING:

1.- interactive, the TM driver + hypervisor needs to support live snapshots. You need to be aware of possible caching and sync'ing issues

2.- suspend, the VM is suspended/resumed. You may also have sync'ing issues

3.- detach, disk is detached/attached from the VM to take the snapshot. Also you probably need to unmount it first; and it will not work for the root FS.

We are working on made cloud view more flexible, so in case that end-users are not aware of things going on behind the scenes, you may for example disable snapshots in RUNNING. Also we could force this for all VMs at the driver level...

Also, for kvm, guest agent may be used to quiesce the guest. We did not include this because it implies extra conf steps for the guest. But given it as an option could work.

#4 Updated by Ruben S. Montero over 5 years ago

Related to Bug #4056: Snapshot revert problems added

#5 Updated by Ruben S. Montero over 5 years ago

BTW, I've also created a new issue for the other issues #4056. Feel free to update it.

THANKS!!

#6 Updated by Anton Todorov over 5 years ago

Hi Ruben,

I've delayed my response to recheck/verify the following.

I found the following interfaces regarding snapshots:

1. entire VM snapshot:
VM details -> Snapshots -> Take snapshot (ACTIVE/RUNNING)
/var/tmp/one/vmm/kvm/snapshot_create

it is calling virsh --connect $LIBVIRT_URI snapshot-create-as $DOMAIN. It is not investigated in details because our libvirt integration is not ready yet.
is this the interactive snapshot that do you mention?

2. Disk (hot) Save as. There was sort of deferred copy in 4.12 but I can not find it in 4.14.
VM details -> Storage -> Save as (ACTIVE/RUNNING)
/var/lib/one/remotes/tm/<mad>/cpds

(inconsistent) "live" snapshot via tm_mad/cpds driver
I would classify this as the interactive snapshot too?

3. Disk snapshot by suspending the running VM
VM details -> Storage -> Snapshot (ACTIVE/RUNNING)
/var/tmp/one/vmm/kvm/save (to checkpoint)
/var/lib/one/remotes/tm/<mad>/snap_create
/var/tmp/one/vmm/kvm/restore (from checkpoint)

(incosistent) "suspend" snapshot.

4. Disk snapshot on PoweredOff VM
VM details -> Storage -> Snapshot (POWEROFF/LCM_INIT)
/var/lib/one/remotes/tm/<mad>/snap_create

(consistent) "poweroff" snapshot
You do not mention this snapshot.

I could not find where/how the detach snapshot is triggerred?

All mentioned above types/methods are creating more or less consistent snapshots. In most cases the VM will boot, at least with quick fsck on the incosistent ones.

Utilizing/Reverting snapshots

1. VM snapshots - I do not test this one :(
2. "Save as" snapshots are practically new images in the datastore, so we can use them via templates.
3. "suspeded" snapshots are ok for restore on powered off VM-s. On running VM, IMO it is guarantee for total disaster for the root FS disk, I agree it will work for other disks, if they are unmounted in the VM before restore.

IMO there a bold red warning should be displayed so users must know/confirm what are they doing

4. "power off" snapshots are same as (3.) - ok on poweroff, probable on unmounted disks, disaster on root fs disk.

Also, for kvm, guest agent may be used to quiesce the guest. We did not include this because it implies extra conf steps for the guest. But given it as an option could work.

For me the fs freeze/thaw via the qemu guest agent is best, safe and fastest option for taking snapshots. I do not agree that there is need for too much extra conf steps for the guests. Here are my tests for most of the distributions imported from the Marketplace:

Ubuntu 14.04 LTS
root@ubuntu:~# apt-get update
root@ubuntu:~# apt-get install qemu-guest-agent

Ubuntu 15.04
root@ubuntu:~# apt-get update
root@ubuntu:~# apt-get install qemu-guest-agent

CentOS 6.5
[root@localhost ~]# yum install qemu-guest-agent
[root@localhost ~]# service qemu-ga start

CentOS 7.1
[root@localhost ~]# yum install qemu-guest-agent
[root@localhost ~]# systemctl start qemu-guest-agent

Debian 7 "Wheezy" 
root@debian:~# echo "deb http://http.debian.net/debian wheezy-backports main" >> /etc/apt/sources.list.d/wheezy-backports.list
root@debian:~# apt-get update
root@debian:~# apt-get install qemu-guest-agent

Debian 8 "Jessie" (virtio-blk only - extlinux expects /dev/vda1)
root@debian:~# apt-get update
root@debian:~# apt-get install qemu-guest-agent

As you can see the only need is to install/enable the qemu guest agent. It is more tricky to enable the guest agent socket in libvirt on the HV hosts, especially on Debian/Ubuntu ones(apparmor). But once it is made it is working without further problems.

What do you think about a solution like this: if the VMM drivers knows the context in which they are called (via STATE/LCM_STATE?) they could test for guest agent availability and fs freeze/thaw else they should fallback to current logic.

Kind Regards,
Anton Todorov
ps. While testing i've spot some more (irrelevant) issues, I'll open separate tickets for them :)

#7 Updated by Stefan Kooman over 5 years ago

@Anton Todorov, I created #4064 for --quiesce support for qcow2 based disk snapshots. This feature (--quiesce) has lived in a development OpenNebula version for some days but if a VM has no guest agent running ... it gave issues creating a snapshot ... and therefore this support was removed. #4064 is created to smarten the disk snapshot driver (i.e. check if VM has guest agent support) to allow for consistent snapshots.

#8 Updated by Anton Todorov over 5 years ago

Hi Stefan,

I've checked the --quiesce option and it looks like that it will work only on storage backends that have native libvirt support. Actually my suggestion is equivalent of --quiesce but done outside libvirt so it will work with both native and no-native supported libvirt block backends. To test is qemu-guest-agent working is simple:

virsh qemu-agent-commad <DOMAIN_ID> '{"execute":"guest-ping"}'

At least for StorPool driver if fs freeze inside VM is not working it will do snapshot wit state no worst than the current situation.

Kind Regards,
Anton Todorov

#9 Updated by Anton Todorov over 5 years ago

Update about my statement:

I could not find where/how the detach snapshot is triggerred?

I am so sorry that I've totally missed the snap_create_live and the new options for vmm_exec where related configuration is placed. Even Javi hinted about it few months before but I totally lost traction about it.

So the live snapshot function with enabled qemu-guest-agent is implemented in our addon via tm/snap_create_live and working like charm - the snapshots creation time is about 2 seconds.

Regards,
Anton Todorov

#10 Updated by Ruben S. Montero over 5 years ago

Assignee set to Ruben S. Montero

#11 Updated by Ruben S. Montero over 5 years ago

Status changed from Pending to Closed
Resolution set to worksforme

OK. So I am closing this issue, we have #4056 for the other issues and #4064 for the agent.

Also available in: Atom PDF

OpenNebula

Issues

Custom queries

Bug #4055

disk snapshot restore while VM is running is leading to broken VM

History

#1 Updated by Anton Todorov over 5 years ago

#2 Updated by Anton Todorov over 5 years ago

#3 Updated by Ruben S. Montero over 5 years ago

#4 Updated by Ruben S. Montero over 5 years ago

#5 Updated by Ruben S. Montero over 5 years ago

#6 Updated by Anton Todorov over 5 years ago

#7 Updated by Stefan Kooman over 5 years ago

#8 Updated by Anton Todorov over 5 years ago

#9 Updated by Anton Todorov over 5 years ago

#10 Updated by Ruben S. Montero over 5 years ago

#11 Updated by Ruben S. Montero over 5 years ago