Bug #1682
Race condition when getting TID for iSCSI volumes
Status: | Closed | Start date: | 12/06/2012 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Jaime Melis | % Done: | 0% | |
Category: | Drivers - Auth | |||
Target version: | Release 4.0 | |||
Resolution: | Pull request: | |||
Affected Versions: | OpenNebula 3.8 |
Description
This issue has been reported by Alain Pannetrat in the mailing list:
I have been testing the bulk instantiation of virtual machines in sunstone, where I initiate the bulk creation of 8 virtual machines in parallel. I have noticed that between 2 and 4 machines just fail to instantiate correctly with the typical following error message: 08 2012 [TM][I]: Command execution fail: /var/lib/one/remotes/tm/iscsi/clone iqn.2012-02.org.opennebula:san.vg-one.lv-one-26 compute.admin.lan:/var/lib/one//datastores/0/111/disk.0 111 101 Thu Dec 6 14:40:08 2012 [TM][E]: clone: Command " set -e Thu Dec 6 14:40:08 2012 [TM][I]: set -x Thu Dec 6 14:40:08 2012 [TM][I]: Thu Dec 6 14:40:08 2012 [TM][I]: # get size Thu Dec 6 14:40:08 2012 [TM][I]: SIZE=$(sudo lvs --noheadings -o lv_size "/dev/vg-one/lv-one-26") Thu Dec 6 14:40:08 2012 [TM][I]: Thu Dec 6 14:40:08 2012 [TM][I]: # create lv Thu Dec 6 14:40:08 2012 [TM][I]: sudo lvcreate -L${SIZE} vg-one -n lv-one-26-111 Thu Dec 6 14:40:08 2012 [TM][I]: Thu Dec 6 14:40:08 2012 [TM][I]: # clone lv with dd Thu Dec 6 14:40:08 2012 [TM][I]: sudo dd if=/dev/vg-one/lv-one-26 of=/dev/vg-one/lv-one-26-111 bs=64k Thu Dec 6 14:40:08 2012 [TM][I]: Thu Dec 6 14:40:08 2012 [TM][I]: # new iscsi target Thu Dec 6 14:40:08 2012 [TM][I]: TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target" | tail -n 1 | awk '{split($2,tmp,":"); print tmp[1]+1;}') Thu Dec 6 14:40:08 2012 [TM][I]: Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new --mode target --tid $TID --targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op bind --mode target --tid $TID -I ALL Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new --mode logicalunit --tid $TID --lun 1 --backing-store /dev/vg-one/lv-one-26-111 Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgt-admin --dump |sudo tee /etc/tgt/targets.conf > /dev/null 2>&1" failed: + sudo lvs --noheadings -o lv_size /dev/vg-one/lv-one-26 Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records in Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records out Thu Dec 6 14:40:08 2012 [TM][I]: 8589934592 bytes (8.6 GB) copied, 898.903 s, 9.6 MB/s Thu Dec 6 14:40:08 2012 [TM][I]: tgtadm: this target already exists Thu Dec 6 14:40:08 2012 [TM][E]: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111 Thu Dec 6 14:40:08 2012 [TM][I]: ExitCode: 22 Thu Dec 6 14:40:08 2012 [TM][E]: Error executing image transfer script: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111 Thu Dec 6 14:40:09 2012 [DiM][I]: New VM state is FAILED After adding traces in the code, I found that there seems to be a race condition in /var/lib/one/remotes/tm/iscsi/clone here the following commands get executed: TID=\$($SUDO $(tgtadm_next_tid)) $SUDO $(tgtadm_target_new "\$TID" "$NEW_IQN") These commands are typically expanded to something like this: TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target" | tail -n 1 | awk '{split($2,tmp,":"); sudo tgtadm --lld iscsi --op new --mode target --tid $TID --targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 What seems to happens is two (or more) calls to the first command tgtadm_next_tid happen simultaneously before the second command gets a chance to get executed, and then TID as the same value for two (or more) VMs. The workaround I found is to replace the line: TID=\$($SUDO $(tgtadm_next_tid)) with TID=$VMID in /var/lib/one/remotes/tm/iscsi/clone Since $VMID is globally unique no race conditions can happen here. I've tested this and the failures don't happen anymore in my setting. Of course I'm not sure this is the ideal fix, since perhaps VMID can take values that are out of range for tgtadm. So futher testing would be needed.
Related issues
History
#1 Updated by Ruben S. Montero over 8 years ago
- translation missing: en.field_story_points set to 0
#2 Updated by Ruben S. Montero over 8 years ago
- Status changed from Assigned to Closed
This bug is solved as part of #1592 merge.