Bug #1682: Race condition when getting TID for iSCSI volumes - OpenNebula - OpenNebula Development pages

Bug #1682

Race condition when getting TID for iSCSI volumes

Added by Ruben S. Montero over 8 years ago. Updated over 8 years ago.

Status:

Closed

Start date:

12/06/2012

Priority:

Normal

Due date:

Assignee:

Jaime Melis

% Done:

Category:

Drivers - Auth

Target version:

Release 4.0

Resolution:

Pull request:

Affected Versions:

OpenNebula 3.8

Description

This issue has been reported by Alain Pannetrat in the mailing list:

I have been testing the bulk instantiation of virtual machines in
sunstone, where I initiate the bulk creation of 8 virtual machines in
parallel. I have noticed that between 2 and 4 machines just fail to
instantiate correctly with the typical following error message:

08 2012 [TM][I]: Command execution fail:
/var/lib/one/remotes/tm/iscsi/clone
iqn.2012-02.org.opennebula:san.vg-one.lv-one-26
compute.admin.lan:/var/lib/one//datastores/0/111/disk.0 111 101
Thu Dec  6 14:40:08 2012 [TM][E]: clone: Command "    set -e
Thu Dec  6 14:40:08 2012 [TM][I]: set -x
Thu Dec  6 14:40:08 2012 [TM][I]:
Thu Dec  6 14:40:08 2012 [TM][I]: # get size
Thu Dec  6 14:40:08 2012 [TM][I]: SIZE=$(sudo lvs --noheadings -o
lv_size "/dev/vg-one/lv-one-26")
Thu Dec  6 14:40:08 2012 [TM][I]:
Thu Dec  6 14:40:08 2012 [TM][I]: # create lv
Thu Dec  6 14:40:08 2012 [TM][I]: sudo lvcreate -L${SIZE} vg-one -n
lv-one-26-111
Thu Dec  6 14:40:08 2012 [TM][I]:
Thu Dec  6 14:40:08 2012 [TM][I]: # clone lv with dd
Thu Dec  6 14:40:08 2012 [TM][I]: sudo dd if=/dev/vg-one/lv-one-26
of=/dev/vg-one/lv-one-26-111 bs=64k
Thu Dec  6 14:40:08 2012 [TM][I]:
Thu Dec  6 14:40:08 2012 [TM][I]: # new iscsi target
Thu Dec  6 14:40:08 2012 [TM][I]: TID=$(sudo tgtadm --lld iscsi --op
show --mode target |             grep "Target" | tail -n 1 |
  awk '{split($2,tmp,":"); print tmp[1]+1;}')
Thu Dec  6 14:40:08 2012 [TM][I]:
Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new
--mode target --tid $TID  --targetname
iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111
Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op bind
--mode target --tid $TID -I ALL
Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new
--mode logicalunit --tid $TID  --lun 1 --backing-store
/dev/vg-one/lv-one-26-111
Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgt-admin --dump |sudo tee
/etc/tgt/targets.conf > /dev/null 2>&1" failed: + sudo lvs
--noheadings -o lv_size /dev/vg-one/lv-one-26
Thu Dec  6 14:40:08 2012 [TM][I]: 131072+0 records in
Thu Dec  6 14:40:08 2012 [TM][I]: 131072+0 records out
Thu Dec  6 14:40:08 2012 [TM][I]: 8589934592 bytes (8.6 GB) copied,
898.903 s, 9.6 MB/s
Thu Dec  6 14:40:08 2012 [TM][I]: tgtadm: this target already exists
Thu Dec  6 14:40:08 2012 [TM][E]: Error cloning
compute.admin.lan:/dev/vg-one/lv-one-26-111
Thu Dec  6 14:40:08 2012 [TM][I]: ExitCode: 22
Thu Dec  6 14:40:08 2012 [TM][E]: Error executing image transfer
script: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111
Thu Dec  6 14:40:09 2012 [DiM][I]: New VM state is FAILED

After adding traces in the code, I found that there seems to be a race
condition in /var/lib/one/remotes/tm/iscsi/clone here the following
commands get executed:

TID=\$($SUDO $(tgtadm_next_tid))
$SUDO $(tgtadm_target_new "\$TID" "$NEW_IQN")

These commands are typically expanded to something like this:

TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target" 
| tail -n 1 | awk '{split($2,tmp,":");
sudo tgtadm --lld iscsi --op new --mode target --tid $TID
--targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111

What seems to happens is two (or more) calls to the first command
tgtadm_next_tid happen simultaneously before the second command gets a
chance to get executed, and then TID as the same value for two (or
more) VMs.

The workaround I found is to replace the line:
TID=\$($SUDO $(tgtadm_next_tid))
with
TID=$VMID
in /var/lib/one/remotes/tm/iscsi/clone

Since $VMID is globally unique no race conditions can happen here.
I've tested this and the failures don't happen anymore in my setting.
Of course I'm not sure this is the ideal fix, since perhaps VMID can
take values that are out of range for tgtadm. So futher testing would
be needed.

Related issues

History

#1 Updated by Ruben S. Montero over 8 years ago

translation missing: en.field_story_points set to 0

#2 Updated by Ruben S. Montero over 8 years ago

Status changed from Assigned to Closed

This bug is solved as part of #1592 merge.

Also available in: Atom PDF

OpenNebula

Issues

Custom queries

Bug #1682

Race condition when getting TID for iSCSI volumes

History

#1 Updated by Ruben S. Montero over 8 years ago

#2 Updated by Ruben S. Montero over 8 years ago