Bug #2001

Add timeout to ssh polling

Added by Laurent Grawet about 8 years ago. Updated about 8 years ago.

Status:ClosedStart date:05/07/2013
Priority:HighDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:Release 4.0
Resolution:fixed Pull request:
Affected Versions:OpenNebula 3.8

Description

Hi,

We had a problem with a xen hypervisor. It became unreachable due to a crash and ssh polls were accumulating in the frontend.
There were a lot of processes like :

ssh -n xen1.mydomain if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/one/vmm/
sh -c ssh -n xen1.mydomain 'if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/o

As a consequence, monitoring was not working anymore in opennebula. All hosts were in "init" state, crashed VMs from the failed hypervisor were still in running state. Scheduler was not working anymore. I tried to reinstanciate VMs, they were stuck in "BOOT" state but nothing was done on the target hosts. I had to shutdown oned and reboot frontend for quick recovery.

There must be an ssh timeout to handle this case. Something like "-o ConnectTimeout=15" or shorter timeout like 5 seconds and implement retries if it is not already done. The timeout should ideally be configured in oned.conf.

Best regards,

Laurent

History

#1 Updated by Ruben S. Montero about 8 years ago

  • Status changed from New to Closed
  • Resolution set to fixed

The monitor process has been improved in version 4.0, specially to tackle problems like the one you have just described.

However it is a good idea to tune the ssh config parameters. It'd be better to use the oneadmin ssh config file for this. I've updated the documentation to include you suggestion.

http://opennebula.org/documentation:rel4.0:ignc#secure_shell_access_front-end

THANKS

#2 Updated by Laurent Grawet about 8 years ago

Hi,

Thanks for the information. I had the same idea about .ssh/config. I've updated the config. Is it better to put a small timeout (5 sec) like in doc, are there retries ? Or is it safer to use longer timeout like 15 sec ?

#3 Updated by Laurent Grawet about 8 years ago

Ok, I've just seen I can configure retries in oned.conf with "-r". Default is 0.

VM_MAD = [
    name       = "vmm_xen",
    executable = "one_vmm_exec",
    arguments  = "-t 15 -r 0 xen",
    default    = "vmm_exec/vmm_exec_xen.conf",
    type       = "xen" ]

Thanks

Also available in: Atom PDF