Bug #3937

After disk snapshot with suspend / resume vlan does not get (re-)tagged on openvswitch

Added by Stefan Kooman almost 6 years ago. Updated almost 6 years ago.

Status:ClosedStart date:08/14/2015
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Drivers - VM
Target version:Release 4.14
Resolution:fixed Pull request:
Affected Versions:OpenNebula 4.12

Description

The the vlan does not get (re-)tagged after the
VM gets resumed again (after a snapshot create action for a VM (with a RAW
image):

ovs-vsctls show:

Port "vnet1"
Interface "vnet1"

While it was

Port "vnet1"
tag: 228
Interface "vnet1"

before the "DISK_SNAPSHOT" / suspend action

Associated revisions

Revision 988cf671
Added by Jaime Melis almost 6 years ago

Bug #3937: After disk snapshot with suspend / resume vlan does not get
(re-)tagged on openvswitch

Revision 40caf75b
Added by Jaime Melis almost 6 years ago

Bug #3937: Apply network drivers after disk-snapshot-revert

History

#1 Updated by Javi Fontan almost 6 years ago

  • Category set to Drivers - VM
  • Target version set to Release 4.14

#2 Updated by Jaime Melis almost 6 years ago

  • Status changed from Pending to Closed
  • Resolution set to fixed
  • Affected Versions OpenNebula 4.12 added

#3 Updated by Stefan Kooman almost 6 years ago

I replaced /var/lib/one/remotes/vmm/one_vmm_exec.rb with this new version and did a "onehost sync --force" afterwards. Tag does not get re-applied. Besides that, it looks like the VM does not get a "poweroff --hard" before the revert ... leading to OS crash. Is this the right way to test this fix?

#4 Updated by Stefan Kooman almost 6 years ago

I just checked out master and recompiled / reinstalled (/usr/lib/one/mads/one_vmm_exec.rb is also the new version). The VM is shut down but seems to be resumed as there is no normal boot sequence when the VM is running again (bios -> boot). After a reboot the VM ends in a strack trace with ext4 inode errors ...

#5 Updated by Jaime Melis almost 6 years ago

  • Status changed from Closed to Assigned

I have updated added a part that was missing, in order to re-apply also on revert, not only on create. It should not be necessary to run onehost sync or to reinstall, just replace the one_vmm_exec.rb and restart opennebula.

I don't understand exactly what you are doing. My workflow is as follows:

  • VM is running
  • onevm disk-snapshot-create <vmid> <diskid> <snapshot name>
  • I observe how the VM disappears from libvirt for a second as it's being suspended, and then reappears and the vnm drivers are reapplied (testing with ovswitch)
  • onevm disk-snapshot-revert ...
  • I observe the same thing as with disk-snapshot-create

Note that there is no poweroff --hard involved here, I'm doing this while the VM is running, and after the operation the VM is running again. OpenNebula does the suspend behind the scenes, I only need to instruct it to do disk-snapshot-create.

Can you clarify what you mean with your previous comments? Maybe posting your workflow will help.

#6 Updated by Stefan Kooman almost 6 years ago

Note that there is no poweroff --hard involved here, I'm doing this while the VM is running, and after the operation the VM is running again. OpenNebula does the suspend behind the scenes, I only need to instruct it to do disk-snapshot-create.

I think that's the problem: if you replace the root disk of a VM (rootfs) with a previous snapshot and resume the VM again, you will end up with a corrupted filesystem. The system expects files / inodes / fscache at certain places ... and all of the sudden it's gone or somewhere else. I believe the correct way to do a "snapshot_revert_while_running" is to do a "poweroff --hard" -> onevm disk-snapshot-revert -> poweron.

#7 Updated by Stefan Kooman almost 6 years ago

TL;DR The vlan tags get applied nicely, so the bug is fixed.

I assumed a "poweroff --hard" would have been given ... to avoid fs corruption to the running fs. But apparently it's not how it's designed. Maybe a warning should be added that a revert for mounted filesystems is very dangerous and will lead to dataloss.

#8 Updated by Jaime Melis almost 6 years ago

  • Status changed from Assigned to Closed

Also available in: Atom PDF