Bug #2001
Add timeout to ssh polling
| Status: | Closed | Start date: | 05/07/2013 | |
|---|---|---|---|---|
| Priority: | High | Due date: | ||
| Assignee: | - | % Done: | 0% | |
| Category: | Core & System | |||
| Target version: | Release 4.0 | |||
| Resolution: | fixed | Pull request: | ||
| Affected Versions: | OpenNebula 3.8 | 
Description
Hi,
We had a problem with a xen hypervisor. It became unreachable due to a crash and ssh polls were accumulating in the frontend.
There were a lot of processes like :
ssh -n xen1.mydomain if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/one/vmm/ sh -c ssh -n xen1.mydomain 'if [ -x "/var/tmp/one/vmm/xen/poll" ]; then /var/tmp/o
As a consequence, monitoring was not working anymore in opennebula. All hosts were in "init" state, crashed VMs from the failed hypervisor were still in running state. Scheduler was not working anymore. I tried to reinstanciate VMs, they were stuck in "BOOT" state but nothing was done on the target hosts. I had to shutdown oned and reboot frontend for quick recovery.
There must be an ssh timeout to handle this case. Something like "-o ConnectTimeout=15" or shorter timeout like 5 seconds and implement retries if it is not already done. The timeout should ideally be configured in oned.conf.
Best regards,
Laurent
History
#1
     Updated by Ruben S. Montero about 8 years ago
    Updated by Ruben S. Montero about 8 years ago
    - Status changed from New to Closed
- Resolution set to fixed
The monitor process has been improved in version 4.0, specially to tackle problems like the one you have just described.
However it is a good idea to tune the ssh config parameters. It'd be better to use the oneadmin ssh config file for this. I've updated the documentation to include you suggestion.
http://opennebula.org/documentation:rel4.0:ignc#secure_shell_access_front-end
THANKS
#2
     Updated by Laurent Grawet about 8 years ago
    Updated by Laurent Grawet about 8 years ago
    Hi,
Thanks for the information. I had the same idea about .ssh/config. I've updated the config. Is it better to put a small timeout (5 sec) like in doc, are there retries ? Or is it safer to use longer timeout like 15 sec ?
#3
     Updated by Laurent Grawet about 8 years ago
    Updated by Laurent Grawet about 8 years ago
    Ok, I've just seen I can configure retries in oned.conf with "-r". Default is 0.
VM_MAD = [
    name       = "vmm_xen",
    executable = "one_vmm_exec",
    arguments  = "-t 15 -r 0 xen",
    default    = "vmm_exec/vmm_exec_xen.conf",
    type       = "xen" ]
	Thanks