Request #262

If a host fails too much one should disable it automatically.

Added by Marlon Nerling about 11 years ago. Updated about 8 years ago.

Status:ClosedStart date:06/14/2010
Priority:HighDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:-
Pull request:

Description

It is related to Bug #196 and Bug #187.
I could not apply the patches related to this bug, so I can not use the RANK option yet.
I have while the Weekend the case of one host where the mount of /var/lib/one was automatik remounted read only, so all tries to generate new VMs failed.
It would be no big Problem, had one not tried obstinately to place every other VM since then into this very Host:

4627 root win764DE runn 0 4117504 172.22.0.20 02 21:13:18
4628 root Txp32DE fail 0 0 172.22.0.2 00 00:00:16
4629 root Tvista64 fail 0 0 172.22.0.2 00 00:00:26
4630 root Twin764D fail 0 0 172.22.0.2 00 00:00:06
4631 root T200332D fail 0 0 172.22.0.2 00 00:00:16
4634 root Thardy-x fail 0 0 172.22.0.2 00 00:00:10
4635 root Txp32DE fail 0 0 172.22.0.2 00 00:00:21
4636 root Tvista64 fail 0 0 172.22.0.2 00 00:00:30
4637 root Twin764D fail 0 0 172.22.0.2 00 00:00:10
4638 root T200332D fail 0 0 172.22.0.2 00 00:00:20
4639 root Twin764D fail 0 0 172.22.0.2 00 00:00:15
4640 root Twin764D fail 0 0 172.22.0.2 00 00:00:13
4641 root 200332DE fail 0 0 172.22.0.2 00 00:00:04
4642 root xp32DE fail 0 0 172.22.0.2 00 00:00:07

Today morning I noticed the failure and disable the host.
I muss say, the reputation of our Virtual Machine Management has reached a new low!

I propose one show control the writability of /var/lib/one on the hosts and disable automatically the host in case of Filesystem failure.
I'm working on a patch for this and will post it ASAP.

Associated revisions

Revision 711e0f1b
Added by Sergio Semedi about 4 years ago

Bug #5113 Improve exception handling ec2 drivers (#262)

History

#1 Updated by Marlon Nerling about 11 years ago

I workaround it in /usr/lib/one/tm_commands/nfs/tm_clone.sh:
(SNIP)
--- tm_clone.sh (revision 1890)
+++ tm_clone.sh (working copy)
@ -39,7 +39,13 @
DST_DIR=`dirname $DST_PATH`

log "Creating directory $DST_DIR" 
-exec_and_log "ssh $DST_HOST mkdir -p $DST_DIR"
if ssh $DST_HOST mkdir -p $DST_DIR
+then
echo "sucessfully created Directory"
else
ONE_AUTH=/root/.one/auth onehost disable $DST_HOST
+ exec_and_log ' echo Could not create directory ; false'
+fi
exec_and_log "ssh $DST_HOST chmod a+w $DST_DIR"
(SNIP)

It is a dirty patch, but I could not think of a better.
Best regards

#2 Updated by Javi Fontan over 10 years ago

  • Target version changed from Release 1.4 to Release 1.4.2

#3 Updated by Ruben S. Montero over 10 years ago

  • Tracker changed from Bug to Request

#4 Updated by Ruben S. Montero almost 10 years ago

  • Target version deleted (Release 1.4.2)

#5 Updated by Ruben S. Montero about 8 years ago

  • Status changed from New to Closed

We have the error state, the scheduler will not assign VMs and we'll keep polling it in case of temporal failure. Closing

Also available in: Atom PDF