Bug #265
Do not delete vm images after migration fail
Status: | Closed | Start date: | 06/21/2010 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Ruben S. Montero | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 3.0 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: |
Description
From Shi Jin mail:
I recently had a very serious problem. I called "onevm stop" on a VM to hiberate the VM into checkpoint file. Then I tried to call "onevm resume" to bring it back online. However, the resumption progress went wrong. There can be several reasons for it to go wrong. For example, libvirt would fail if there is another volume attached to it. But this is not relevant to this thread (I am planning on starting a new one on this soon). The key point here is that, as soon as the restore fails, the OpenNebula code triggers the DEPLOY_FAILURE LCM. This can be found at src/vmm/VirtualMachineManagerDriver.cc 399 else if ( action == "RESTORE" ) 400 { 401 Nebula &ne = Nebula::instance(); 402 LifeCycleManager *lcm = ne.get_lcm(); 403 404 if (result == "SUCCESS") 405 { 406 lcm->trigger(LifeCycleManager::DEPLOY_SUCCESS, id); 407 } 408 else 409 { 410 string info; 411 412 getline(is,info); 413 414 os.str(""); 415 os << "Error restoring VM, " << info; 416 417 vm->log("VMM",Log::ERROR,os); 418 419 lcm->trigger(LifeCycleManager::DEPLOY_FAILURE, id); 420 } 421 } The LCM would eventually delete the images directory and the user would lost all the precious data he/she has obtained so far and there is no way to get it back! So I desperately need to prevent OpenNebula from deleting the precious images. A quick hack I did was to comment out the line 419 above so that the LCM is not triggered at all. But I am sure this is not clean and we need more than this. I am thinking maybe one needs a way to separate a fresh booting VM and a resumption VM. For now, they are no different to OpenNebula and are both in the BOOT State. So please let me know if what I reported is a bug and if this can be fixed in the future. I could submit this on the dev site as well. Thank you very much.
Associated revisions
bug #265: Failure actions will NOT remove VM files in the host. Host files will be removed from the remote host upon VM resubmition or deletion. This will let sysadmins to easily debug any failure or perform forensic analysis.
bug #265: Failure actions will NOT remove VM files in the host. Host files will be removed from the remote host upon VM resubmition or deletion. This will let sysadmins to easily debug any failure or perform forensic analysis.
(cherry picked from commit c6a8c1fbdcc1d11df23f8ead30a1fd0df3d2630e)
History
#1 Updated by Shi Jin about 11 years ago
Now I think about it more, i think it does not make sense to delete images on any error at all.
Even a fresh start failure should leave the images there for debugging purpose.
So the simplest solution is to remove image deletion from the code. Here is patch:
diff --git a/src/lcm/LifeCycleStates.cc b/src/lcm/LifeCycleStates.cc index 7828397..a1ccda5 100644 --- a/src/lcm/LifeCycleStates.cc +++ b/src/lcm/LifeCycleStates.cc @@ -793,7 +793,7 @@ void LifeCycleManager::failure_action(VirtualMachine * vm) dm->trigger(DispatchManager::FAILED,vm->get_oid()); - tm->trigger(TransferManager::EPILOG_DELETE,vm->get_oid()); + //tm->trigger(TransferManager::EPILOG_DELETE,vm->get_oid()); } /* -------------------------------------------------------------------------- */
#2 Updated by Ruben S. Montero about 11 years ago
Well we do not want to leave cluster workernodes with disk images from failed VMs. This will fill the workernode FS and will make the admin to manually delete those volumes/images.
#3 Updated by Shi Jin about 11 years ago
Agree.
But the delete action will definitely remove the images, right?
When a VM fails, the user still need to issue the delete command anyway, correct?
I guess they don't have to but the failed VM keeps showing up in the list of VMs, which can be very annoying.
So this is not a perfect solution indeed.
Shi
Ruben S. Montero wrote:
Well we do not want to leave cluster workernodes with disk images from failed VMs. This will fill the workernode FS and will make the admin to manually delete those volumes/images.
#4 Updated by Shi Jin about 11 years ago
More thought: on a properly setup system, a failed VM should get the admin's attention any way.
Then he/she has the chance to fix the problem and then delete the images.
#5 Updated by Ruben S. Montero about 11 years ago
Yes, I've not realized that a VM will end in a failed state and you have to delete it (and hence delete the images) anyway... Note that delete came after the life-cycle implementation so back to 1.2 when we did not have a delete there were no other means to delete failed VM images... Have you applied the patch to your system?
#6 Updated by Ruben S. Montero over 10 years ago
- Target version set to Release 3.0
#7 Updated by Ruben S. Montero almost 10 years ago
- Tracker changed from Request to Feature
#8 Updated by Ruben S. Montero almost 10 years ago
- Tracker changed from Feature to Bug
#9 Updated by Ruben S. Montero almost 10 years ago
- Status changed from New to Closed
- Resolution set to fixed
This is now implemented. When a failure occurs VM files are not removed from the host. Host files are cleaned up when the VM is finally deleted or resubmitted. Now sysadmins can easily debug any problems or keep VM images for a forensic analysis.