Request #262: If a host fails too much one should disable it automatically. - OpenNebula - OpenNebula Development pages

Request #262

If a host fails too much one should disable it automatically.

Added by Marlon Nerling about 11 years ago. Updated about 8 years ago.

Status:

Closed

Start date:

06/14/2010

Priority:

High

Due date:

Assignee:

% Done:

Category:

Core & System

Target version:

Pull request:

Description

It is related to Bug #196 and Bug #187.
I could not apply the patches related to this bug, so I can not use the RANK option yet.
I have while the Weekend the case of one host where the mount of /var/lib/one was automatik remounted read only, so all tries to generate new VMs failed.
It would be no big Problem, had one not tried obstinately to place every other VM since then into this very Host:

4627 root win764DE runn 0 4117504 172.22.0.20 02 21:13:18
4628 root Txp32DE fail 0 0 172.22.0.2 00 00:00:16
4629 root Tvista64 fail 0 0 172.22.0.2 00 00:00:26
4630 root Twin764D fail 0 0 172.22.0.2 00 00:00:06
4631 root T200332D fail 0 0 172.22.0.2 00 00:00:16
4634 root Thardy-x fail 0 0 172.22.0.2 00 00:00:10
4635 root Txp32DE fail 0 0 172.22.0.2 00 00:00:21
4636 root Tvista64 fail 0 0 172.22.0.2 00 00:00:30
4637 root Twin764D fail 0 0 172.22.0.2 00 00:00:10
4638 root T200332D fail 0 0 172.22.0.2 00 00:00:20
4639 root Twin764D fail 0 0 172.22.0.2 00 00:00:15
4640 root Twin764D fail 0 0 172.22.0.2 00 00:00:13
4641 root 200332DE fail 0 0 172.22.0.2 00 00:00:04
4642 root xp32DE fail 0 0 172.22.0.2 00 00:00:07

Today morning I noticed the failure and disable the host.
I muss say, the reputation of our Virtual Machine Management has reached a new low!

I propose one show control the writability of /var/lib/one on the hosts and disable automatically the host in case of Filesystem failure.
I'm working on a patch for this and will post it ASAP.

Associated revisions

Revision 711e0f1b
Added by Sergio Semedi about 4 years ago

Bug #5113 Improve exception handling ec2 drivers (#262)

History

#1 Updated by Marlon Nerling about 11 years ago

I workaround it in /usr/lib/one/tm_commands/nfs/tm_clone.sh:
(SNIP)
--- tm_clone.sh (revision 1890)
+++ tm_clone.sh (working copy)
@ -39,7 +39,13 @
DST_DIR=`dirname $DST_PATH`

log "Creating directory $DST_DIR" 
-exec_and_log "ssh $DST_HOST mkdir -p $DST_DIR" 
if  ssh $DST_HOST mkdir -p $DST_DIR
+then
       echo "sucessfully created Directory" 
else
       ONE_AUTH=/root/.one/auth onehost disable $DST_HOST
+       exec_and_log ' echo Could not create directory ; false'
+fi
 exec_and_log "ssh $DST_HOST chmod a+w $DST_DIR" 
(SNIP)

It is a dirty patch, but I could not think of a better.
Best regards