Bug #265: Do not delete vm images after migration fail - OpenNebula - OpenNebula Development pages

Bug #265

Do not delete vm images after migration fail

Added by Javi Fontan about 11 years ago. Updated almost 10 years ago.

Status:

Closed

Start date:

06/21/2010

Priority:

Normal

Due date:

Assignee:

Ruben S. Montero

% Done:

Category:

Core & System

Target version:

Release 3.0

Resolution:

fixed

Pull request:

Affected Versions:

Description

From Shi Jin mail:

I recently had a very serious problem.
I called "onevm stop" on a VM to hiberate the VM into checkpoint file.
Then I tried to call "onevm resume" to bring it back online.
However, the resumption progress went wrong.
There can be several reasons for it to go wrong.
For example, libvirt would fail if there is another volume attached to it.
But this is not relevant to this thread (I am planning on starting a
new one on this soon).
The key point here is that, as soon as the restore fails, the
OpenNebula code triggers the DEPLOY_FAILURE LCM.
This can be found at src/vmm/VirtualMachineManagerDriver.cc
399     else if ( action == "RESTORE" )
400     {
401         Nebula              &ne  = Nebula::instance();
402         LifeCycleManager    *lcm = ne.get_lcm();
403
404         if (result == "SUCCESS")
405         {
406             lcm->trigger(LifeCycleManager::DEPLOY_SUCCESS, id);
407         }
408         else
409         {
410             string          info;
411
412             getline(is,info);
413
414             os.str("");
415             os << "Error restoring VM, " << info;
416
417             vm->log("VMM",Log::ERROR,os);
418
419             lcm->trigger(LifeCycleManager::DEPLOY_FAILURE, id);
420         }
421     }

The LCM would eventually delete the images directory and the user
would lost all the precious data he/she has obtained so far and there
is no way to get it back!

So I desperately need to prevent OpenNebula from deleting  the precious images.
A quick hack I did was to comment out the line 419 above so that the
LCM is not triggered at all. But I am sure this is not clean and we
need more than this.
I am thinking maybe one needs a way to separate a fresh booting VM and
a resumption VM. For now, they are no different to OpenNebula and are
both in the BOOT State.
So please let me know if what I reported is a bug and if this can be
fixed in the future.
I could submit this on the dev site as well.
Thank you very much.

Associated revisions

Revision c6a8c1fb
Added by Ruben S. Montero almost 10 years ago

bug #265: Failure actions will NOT remove VM files in the host. Host files will be removed from the remote host upon VM resubmition or deletion. This will let sysadmins to easily debug any failure or perform forensic analysis.

Revision 2c2036ce
Added by Ruben S. Montero almost 10 years ago

Revision 5681d03d
Added by Juan Jose Montiel Cano about 4 years ago

Feature 5065 Use filter flag to implement group views (#265)

F #5065 changed ruby to acept filter

F #5065 added posibility to change group and filter the resources pool

Bug resolved bug when change primary group

F #5065 click refresh button when you select a group

History

#1 Updated by Shi Jin about 11 years ago

Now I think about it more, i think it does not make sense to delete images on any error at all.
Even a fresh start failure should leave the images there for debugging purpose.

So the simplest solution is to remove image deletion from the code. Here is patch:

diff --git a/src/lcm/LifeCycleStates.cc b/src/lcm/LifeCycleStates.cc
index 7828397..a1ccda5 100644
--- a/src/lcm/LifeCycleStates.cc
+++ b/src/lcm/LifeCycleStates.cc
@@ -793,7 +793,7 @@ void  LifeCycleManager::failure_action(VirtualMachine * vm)

     dm->trigger(DispatchManager::FAILED,vm->get_oid());

-    tm->trigger(TransferManager::EPILOG_DELETE,vm->get_oid());
+    //tm->trigger(TransferManager::EPILOG_DELETE,vm->get_oid());
 }

 /* -------------------------------------------------------------------------- */

#2 Updated by Ruben S. Montero about 11 years ago

Well we do not want to leave cluster workernodes with disk images from failed VMs. This will fill the workernode FS and will make the admin to manually delete those volumes/images.

#3 Updated by Shi Jin about 11 years ago

Agree.
But the delete action will definitely remove the images, right?
When a VM fails, the user still need to issue the delete command anyway, correct?
I guess they don't have to but the failed VM keeps showing up in the list of VMs, which can be very annoying.
So this is not a perfect solution indeed.

Shi
Ruben S. Montero wrote:

Well we do not want to leave cluster workernodes with disk images from failed VMs. This will fill the workernode FS and will make the admin to manually delete those volumes/images.

#4 Updated by Shi Jin about 11 years ago

More thought: on a properly setup system, a failed VM should get the admin's attention any way.
Then he/she has the chance to fix the problem and then delete the images.

#5 Updated by Ruben S. Montero about 11 years ago

Yes, I've not realized that a VM will end in a failed state and you have to delete it (and hence delete the images) anyway... Note that delete came after the life-cycle implementation so back to 1.2 when we did not have a delete there were no other means to delete failed VM images... Have you applied the patch to your system?

#6 Updated by Ruben S. Montero over 10 years ago

Target version set to Release 3.0

#7 Updated by Ruben S. Montero almost 10 years ago

Tracker changed from Request to Feature

#8 Updated by Ruben S. Montero almost 10 years ago

Tracker changed from Feature to Bug

#9 Updated by Ruben S. Montero almost 10 years ago

Status changed from New to Closed
Resolution set to fixed

This is now implemented. When a failure occurs VM files are not removed from the host. Host files are cleaned up when the VM is finally deleted or resubmitted. Now sysadmins can easily debug any problems or keep VM images for a forensic analysis.

Also available in: Atom PDF

OpenNebula

Issues

Custom queries