Bug #4221

disk detach / attach leaves VM in broken state

Added by Stefan Kooman over 5 years ago. Updated about 5 years ago.

Status:ClosedStart date:12/02/2015
Priority:NormalDue date:
Assignee:Jaime Melis% Done:

0%

Category:Drivers - Storage
Target version:Release 5.0
Resolution:fixed Pull request:
Affected Versions:OpenNebula 4.14

Description

We have noticed (and can reproduce) that, sometimes (not sure what is the trigger yet), a symlink is not correctly created during a disk detach / attach operation. Instead of pointing to a file, the symlink points to a directory which leads to the following error message on the host that is trying to start the VM:

qemu-system-x86_64: -drive file=/var/lib/one//datastores/103/148/disk.3,if=none,id=drive-virtio-disk1,format=qcow2,cache=none,aio=native: could not open disk image /var/lib/one//datastores/103/148/disk.3: Could not open '/var/lib/one//datastores/103/148/disk.3': Is a directory
2015-12-02 16:43:25.116+0000: shutting down

At this point the system datastore directory for this VM looks like this:

/var/lib/one/datastores/103/148# ls lrt
total 428
lrwxrwxrwx 1 oneadmin oneadmin 65 Sep 2 18:41 disk.0.snap -> /var/lib/one/datastores/104/f6da286fb8dd7c81bf8f7fa541a525ac.snap
lrwxrwxrwx 1 oneadmin oneadmin 65 Sep 2 18:41 disk.1.snap -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap
-rw-rw-r-
1 oneadmin oneadmin 372736 Sep 2 18:41 disk.2
lrwxrwxrwx 1 oneadmin oneadmin 38 Sep 2 18:41 disk.2.iso > /var/lib/one/datastores/103/148/disk.2
-rw-rw-r-
1 oneadmin oneadmin 1301 Sep 2 18:41 deployment.0
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 2 21:14 deployment.9
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:04 deployment.10
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:05 deployment.11
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:06 deployment.12
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:10 deployment.14
rw-rw-r- 1 oneadmin oneadmin 1301 Oct 23 13:17 deployment.16
lrwxrwxrwx 1 oneadmin oneadmin 68 Oct 23 13:18 disk.1 > /var/lib/one//datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap/2
-rw-rw-r-
1 oneadmin oneadmin 1301 Oct 23 13:24 deployment.17
lrwxrwxrwx 1 oneadmin oneadmin 68 Oct 23 13:39 disk.0 > /var/lib/one//datastores/104/f6da286fb8dd7c81bf8f7fa541a525ac.snap/7
-rw-rw-r-
1 oneadmin oneadmin 1301 Dec 2 16:59 deployment.18
lrwxrwxrwx 1 4294967294 4294967294 65 Dec 2 17:39 disk.3.snap -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap
lrwxrwxrwx 1 4294967294 4294967294 103 Dec 2 17:40 disk.3 -> /var/lib/one/datastores/104/019e519ccd4cab21fb0af369e6fe8ad9.snap/019e519ccd4cab21fb0af369e6fe8ad9.snap

Interesting to note is that the directory "019e519ccd4cab21fb0af369e6fe8ad9.snap" is created twice. So instead of linking to a file inside "019e519ccd4cab21fb0af369e6fe8ad9.snap" it's linking to the director itself.

See attachment for corresponding opennebula front-end logging regarding the (several) detach / attach operations.

one-148-broken-state.txt Magnifier (22 KB) Stefan Kooman, 12/02/2015 05:21 PM

Associated revisions

Revision 06bedfe8
Added by Jaime Melis over 5 years ago

Bug #4221: disk detach / attach leaves VM in broken state

Revision 9555a98c
Added by Jaime Melis over 5 years ago

Bug #4221: disk detach / attach leaves VM in broken state

(cherry picked from commit 06bedfe8f6227043d2804aef2479a69545af6346)

History

#1 Updated by Ruben S. Montero over 5 years ago

  • Category set to Drivers - Storage
  • Assignee set to Jaime Melis
  • Target version set to 82

#2 Updated by Stefan Kooman over 5 years ago

We have noticed that one way te reproduce this issue is to give "detach" / "attach" operations in quick succession. After ~ 10 operations the VM will end up with a broken disk image.

#3 Updated by Stefan Kooman over 5 years ago

@Jaime Melis: we have a test environment which we can give you access to through a tmate session.

#4 Updated by Ruben S. Montero over 5 years ago

  • Target version changed from 82 to Release 5.0

#5 Updated by Jaime Melis over 5 years ago

  • Status changed from Pending to Closed

The bug was caused by this ln -s behaviour:

$  mkdir a

$  ln -sf a b

$  find -ls  
26476589    4 drwxr-xr-x   3 jmelis   jmelis       4096 Jan 20 16:12 .
26480716    0 lrwxrwxrwx   1 jmelis   jmelis          1 Jan 20 16:12 ./b -> a
30016735    4 drwxr-xr-x   2 jmelis   jmelis       4096 Jan 20 16:05 ./a

$  ln -sf a b

$  find -ls  
26476589    4 drwxr-xr-x   3 jmelis   jmelis       4096 Jan 20 16:12 .
26480716    0 lrwxrwxrwx   1 jmelis   jmelis          1 Jan 20 16:12 ./b -> a
30016735    4 drwxr-xr-x   2 jmelis   jmelis       4096 Jan 20 16:12 ./a
30016737    0 lrwxrwxrwx   1 jmelis   jmelis          1 Jan 20 16:12 ./a/a -> a #WTF??!?

Fix applied and backported to one-4.14

#6 Updated by Rolandas Naujikas over 5 years ago

You can use "ln -snf a b" to replace existing symlink by another even if it is pointing to the directory.

From man ln:

       -n, --no-dereference
              treat LINK_NAME as a normal file if it is a symbolic link to a directory

#7 Updated by Jaime Melis over 5 years ago

  • Status changed from Closed to Assigned

Reopen to apply Roland's suggestion (ln -nsf)

#8 Updated by Stefan Kooman over 5 years ago

We can confirm (after thorough stresstesting) that both pathces (incl. ln -nsf) patches work

#9 Updated by Javi Fontan about 5 years ago

  • Status changed from Assigned to Closed
  • Resolution set to fixed

We are going to use the rm command before the link.

Also available in: Atom PDF