Bug #4221
disk detach / attach leaves VM in broken state
Status: | Closed | Start date: | 12/02/2015 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Jaime Melis | % Done: | 0% | |
Category: | Drivers - Storage | |||
Target version: | Release 5.0 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 4.14 |
Description
We have noticed (and can reproduce) that, sometimes (not sure what is the trigger yet), a symlink is not correctly created during a disk detach / attach operation. Instead of pointing to a file, the symlink points to a directory which leads to the following error message on the host that is trying to start the VM:
qemu-system-x86_64: -drive file=/var/lib/one//datastores/103/148/disk.3,if=none,id=drive-virtio-disk1,format=qcow2,cache=none,aio=native: could not open disk image /var/lib/one//datastores/103/148/disk.3: Could not open '/var/lib/one//datastores/103/148/disk.3': Is a directory
2015-12-02 16:43:25.116+0000: shutting down
At this point the system datastore directory for this VM looks like this:
/var/lib/one/datastores/103/148# ls lrt 1 oneadmin oneadmin 372736 Sep 2 18:41 disk.2
total 428
lrwxrwxrwx 1 oneadmin oneadmin 65 Sep 2 18:41 disk.0.snap -> /var/lib/one/datastores/104/f6da286fb8dd7c81bf8f7fa541a525ac.snap
lrwxrwxrwx 1 oneadmin oneadmin 65 Sep 2 18:41 disk.1.snap -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap
-rw-rw-r-
lrwxrwxrwx 1 oneadmin oneadmin 38 Sep 2 18:41 disk.2.iso > /var/lib/one/datastores/103/148/disk.2 1 oneadmin oneadmin 1301 Sep 2 18:41 deployment.0
-rw-rw-r-rw-rw-r- 1 oneadmin oneadmin 1301 Oct 2 21:14 deployment.9rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:04 deployment.10rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:05 deployment.11rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:06 deployment.12rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:10 deployment.14rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:17 deployment.16
lrwxrwxrwx 1 oneadmin oneadmin 68 Oct 23 13:18 disk.1 > /var/lib/one//datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap/2 1 oneadmin oneadmin 1301 Oct 23 13:24 deployment.17
-rw-rw-r-
lrwxrwxrwx 1 oneadmin oneadmin 68 Oct 23 13:39 disk.0 > /var/lib/one//datastores/104/f6da286fb8dd7c81bf8f7fa541a525ac.snap/7 1 oneadmin oneadmin 1301 Dec 2 16:59 deployment.18
-rw-rw-r-
lrwxrwxrwx 1 4294967294 4294967294 65 Dec 2 17:39 disk.3.snap -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap
lrwxrwxrwx 1 4294967294 4294967294 103 Dec 2 17:40 disk.3 -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap/019e519ccd4cab21fb0af369e6fe8ad9.snap
Interesting to note is that the directory "019e519ccd4cab21fb0af369e6fe8ad9.snap" is created twice. So instead of linking to a file inside "019e519ccd4cab21fb0af369e6fe8ad9.snap" it's linking to the director itself.
See attachment for corresponding opennebula front-end logging regarding the (several) detach / attach operations.
Associated revisions
Bug #4221: disk detach / attach leaves VM in broken state
Bug #4221: disk detach / attach leaves VM in broken state
(cherry picked from commit 06bedfe8f6227043d2804aef2479a69545af6346)
History
#1 Updated by Ruben S. Montero over 5 years ago
- Category set to Drivers - Storage
- Assignee set to Jaime Melis
- Target version set to 82
#2 Updated by Stefan Kooman over 5 years ago
We have noticed that one way te reproduce this issue is to give "detach" / "attach" operations in quick succession. After ~ 10 operations the VM will end up with a broken disk image.
#3 Updated by Stefan Kooman over 5 years ago
@Jaime Melis: we have a test environment which we can give you access to through a tmate session.
#4 Updated by Ruben S. Montero over 5 years ago
- Target version changed from 82 to Release 5.0
#5 Updated by Jaime Melis over 5 years ago
- Status changed from Pending to Closed
The bug was caused by this ln -s
behaviour:
$ mkdir a $ ln -sf a b $ find -ls 26476589 4 drwxr-xr-x 3 jmelis jmelis 4096 Jan 20 16:12 . 26480716 0 lrwxrwxrwx 1 jmelis jmelis 1 Jan 20 16:12 ./b -> a 30016735 4 drwxr-xr-x 2 jmelis jmelis 4096 Jan 20 16:05 ./a $ ln -sf a b $ find -ls 26476589 4 drwxr-xr-x 3 jmelis jmelis 4096 Jan 20 16:12 . 26480716 0 lrwxrwxrwx 1 jmelis jmelis 1 Jan 20 16:12 ./b -> a 30016735 4 drwxr-xr-x 2 jmelis jmelis 4096 Jan 20 16:12 ./a 30016737 0 lrwxrwxrwx 1 jmelis jmelis 1 Jan 20 16:12 ./a/a -> a #WTF??!?
Fix applied and backported to one-4.14
#6 Updated by Rolandas Naujikas over 5 years ago
You can use "ln -snf a b" to replace existing symlink by another even if it is pointing to the directory.
From man ln:
-n, --no-dereference treat LINK_NAME as a normal file if it is a symbolic link to a directory
#7 Updated by Jaime Melis over 5 years ago
- Status changed from Closed to Assigned
Reopen to apply Roland's suggestion (ln -nsf)
#8 Updated by Stefan Kooman over 5 years ago
We can confirm (after thorough stresstesting) that both pathces (incl. ln -nsf) patches work
#9 Updated by Javi Fontan about 5 years ago
- Status changed from Assigned to Closed
- Resolution set to fixed
We are going to use the rm
command before the link.