Bug #5353

undeploy fails when using ceph system datastore

Added by Tobias Fischer almost 2 years ago. Updated over 1 year ago.

Status:PendingStart date:09/06/2017
Priority:NormalDue date:
Assignee:Ruben S. Montero% Done:

0%

Category:Core & System
Target version:Release 5.6
Resolution: Pull request:
Affected Versions:OpenNebula 5.4

Description

Hello,

when I use a Ceph System Datastore the undeployment of VMs fails with following error:

Wed Sep 6 12:31:51 2017 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ceph/mv node01.example.com:/var/lib/one//datastores/130/23214 opennebula:/var/lib/one//datastores/130/23214 23214 130
Wed Sep 6 12:31:51 2017 [Z0][TM][I]: mv: Moving node01.example.com:/var/lib/one/datastores/130/23214 to opennebula:/var/lib/one/datastores/130/23214
Wed Sep 6 12:31:51 2017 [Z0][TM][E]: mv: Command "set -e -o pipefail
Wed Sep 6 12:31:51 2017 [Z0][TM][I]:
Wed Sep 6 12:31:52 2017 [Z0][TM][I]: tar -C /var/lib/one/datastores/130 --sparse -cf - 23214 | ssh opennebula 'tar -C /var/lib/one/datastores/130 --sparse -xf -'
Wed Sep 6 12:31:52 2017 [Z0][TM][I]: rm -rf /var/lib/one/datastores/130/23214" failed: ssh: Could not resolve hostname opennebula: Name or service not known
Wed Sep 6 12:31:52 2017 [Z0][TM][E]: Error copying disk directory to target host
Wed Sep 6 12:31:52 2017 [Z0][TM][I]: ExitCode: 255
Wed Sep 6 12:31:53 2017 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Wed Sep 6 12:31:53 2017 [Z0][VM][I]: New LCM state is EPILOG_UNDEPLOY_FAILURE
Wed Sep 6 12:34:36 2017 [Z0][VM][I]: New LCM state is EPILOG_UNDEPLOY

When using NFS System Datastore then Undeployment works as expected.

The question is why for the controller "opennebula" is used instead of "opennebula.example.com"? Do I have to configure it somewhere?
Temporary fix is to add "opennebula" with IP to /etc/hosts. But would be nice to fix it differently so we don't have to change /etc/hosts on all blades in case we have to change the IP of controller :-)

Thanks

History

#1 Updated by Ruben S. Montero almost 2 years ago

  • Category set to Drivers - Storage
  • Assignee set to Vlastimil Holer
  • Target version set to Release 5.4.3

#2 Updated by Vlastimil Holer almost 2 years ago

This is a problem with the core. Frontend hostname is detected by the gethostname, which doesn't return the FQDN. It can return FQDN only in case the FQDN is set as the hostname.
https://github.com/OpenNebula/one/blob/512da1ee67ee83aef9df736aaa9988349a62d0d2/src/nebula/Nebula.cc#L53

Example 1:

$ hostname
thunder
$ hostname -f
thunder.localdomain

and gethostname returns thunder.

Example 2:

$ hostname
thunder.localdomain
$ hostname -f
thunder.localdomain

and gethostname returns thunder.localdomain.

There'll have to be more sophisticated frontend FQDN detection, preferably also configurable in the oned.conf (frontend can have multiple IPs for public and cluster-private communication, without the override option it can use wrong interface with e.g. some performance penalty).

#3 Updated by Vlastimil Holer almost 2 years ago

Tobias,

Temporary fix is to add "opennebula" with IP to /etc/hosts. But would be nice to fix it differently so we don't have to change /etc/hosts on all blades in case we have to change the IP of controller :-)

a better fix for you, for now, is to ensure you have FQDN as a hostname before starting the OpenNebula (check with just the "hostname" command without parameters, see examples in my previous comment).

Despite this is a bad practice, it's now often the recommended way
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-configure_host_names

However, Red Hat recommends that both static and transient names
match the fully-qualified domain name (FQDN) used for the machine
in DNS, such as host.example.com. 

Best regards,
Vlastimil

#4 Updated by Tobias Fischer almost 2 years ago

Hello Vlastimil,

thanks for your help - very appreciated!

Best Regards,
Tobias

#5 Updated by Ruben S. Montero over 1 year ago

  • Target version changed from Release 5.4.3 to Release 5.6

#6 Updated by Vlastimil Holer over 1 year ago

  • Category changed from Drivers - Storage to Core & System
  • Assignee changed from Vlastimil Holer to Ruben S. Montero

Also available in: Atom PDF