Bug #2099

Possible race condition preventing driver from loading

Added by Simon Boulet about 8 years ago. Updated about 8 years ago.

Status:ClosedStart date:05/30/2013
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:Release 4.2
Resolution:fixed Pull request:
Affected Versions:OpenNebula 4.0

Description

Sometime when restarting OpenNebula, some of my drivers wont load correctly. OpenNebula is complaining that the driver "did not answer INIT command". I haven't had time to look into this yet.

...
Thu May 30 14:31:18 2013 [VMM][I]: Loading driver: a (XML)
Thu May 30 14:31:18 2013 [VMM][I]: Driver a loaded.
Thu May 30 14:31:18 2013 [InM][I]: Loading Information Manager drivers.
Thu May 30 14:31:18 2013 [InM][I]: Loading driver: b
Thu May 30 14:31:18 2013 [AuM][E]: Auth Error: Could not find Authorization driver
Thu May 30 14:31:18 2013 [ReM][D]: Req:416 UID:- ClusterPoolInfo invoked
Thu May 30 14:31:18 2013 [ReM][E]: Req:416 UID:- ClusterPoolInfo result FAILURE [ClusterPoolInfo] User couldn't be authenticated, aborting call.
Thu May 30 14:31:21 2013 [ReM][D]: Req:9600 UID:0 VirtualMachinePoolInfo invoked, 2, -1, -1, -1
Thu May 30 14:31:21 2013 [ReM][D]: Req:9600 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL><VM><ID>375..."
Thu May 30 14:31:21 2013 [ReM][D]: Req:9408 UID:0 VirtualMachinePoolInfo invoked, -2, -1, -1, -1
Thu May 30 14:31:21 2013 [ReM][D]: Req:9408 UID:0 VirtualMachinePoolInfo result SUCCESS, "<VM_POOL><VM><ID>375..."
Thu May 30 14:31:21 2013 [AuM][E]: Auth Error: Could not find Authorization driver
Thu May 30 14:31:21 2013 [ReM][D]: Req:8240 UID:
UserPoolInfo invoked
Thu May 30 14:31:21 2013 [ReM][E]: Req:8240 UID:- UserPoolInfo result FAILURE [UserPoolInfo] User couldn't be authenticated, aborting call.
Thu May 30 14:31:23 2013 [MAD][E]: MAD did not answer INIT command
Thu May 30 14:31:23 2013 [TM][I]: Loading Transfer Manager driver.
Thu May 30 14:31:23 2013 [TM][I]: Transfer manager driver loaded
Thu May 30 14:31:23 2013 [HKM][I]: Loading Hook Manager driver.
Thu May 30 14:31:23 2013 [HKM][I]: Hook Manager loaded
Thu May 30 14:31:23 2013 [ImM][I]: Loading Image Manager driver.
Thu May 30 14:31:23 2013 [ImM][I]: Image Manager loaded
Thu May 30 14:31:23 2013 [AuM][I]: Loading Auth. Manager driver.
Thu May 30 14:31:23 2013 [AuM][I]: Auth Manager loaded
...

As you can see the problem seems to happen when a user hits the API at the exact time the driver is being loaded.

I don't remind having this issue back with 3.8.

One very simple solution could be to enable the XML-RPC server only after all drivers have been loaded. However perhaps we should look at why the driver is failing at the first place.

Also perhaps OpenNebula should fail to start if one of the driver can't be loaded.

Simon

opennebula-start-rm-after-mads.patch Magnifier (1.73 KB) Simon Boulet, 06/03/2013 05:28 PM

Associated revisions

Revision 0b8325da
Added by Ruben S. Montero about 8 years ago

bug #2099: Move XML-RPC initialization after MADS are loaded. It prevents load failures, by Simon Boulet

Revision 6586e85b
Added by Ruben S. Montero about 8 years ago

bug #2099: Move XML-RPC initialization after MADS are loaded. It prevents load failures, by Simon Boulet

(cherry picked from commit 0b8325da7ca73da2f553175f04ed86e08326d3f9)

Revision 8155950f
Added by Ruben S. Montero about 8 years ago

bug #2099: Increased MAD timeout to 30s

Revision fcc859db
Added by Ruben S. Montero about 8 years ago

bug #2099: Increased MAD timeout to 30s

(cherry picked from commit 8155950f015b6178db3dcaa4b11622916fd96349)

History

#1 Updated by Simon Boulet about 8 years ago

Simple patch that moves the start of the XML-RPC endpoint (request manager) after all drivers (mads) have been initialized. Seems to fix the issue for me. Not sure why an incoming XML-RPC request in the middle of the drivers initialization is causing this issue.

#2 Updated by Ruben S. Montero about 8 years ago

  • Category set to Core & System
  • Status changed from Pending to Closed
  • Target version set to Release 4.2

Applied.

This is probably because of the timeout in the core to receive the driver answer to the initialization command.

We've usually seen this error when running ONE in a VM. Any overhead will make the timeout expire, thus failing the load of the driver. I guess that you are seeing the same, the RM thread may induce that additional load...

#3 Updated by Simon Boulet about 8 years ago

Do you think we could increase the timeout from 5 seconds to 30 seconds too? src/mad/Mad.cc line 200.

Or perhaps make the timeout a config option?

Simon

#4 Updated by Ruben S. Montero about 8 years ago

Simon Boulet wrote:

Do you think we could increase the timeout from 5 seconds to 30 seconds too? src/mad/Mad.cc line 200.

Or perhaps make the timeout a config option?

Simon

Indeed, We've increased that value in the past, 5 secs seems to be a lot already... I could not reproduce the original problem. Did increasing the timeout help in your case?

#5 Updated by Simon Boulet about 8 years ago

Ruben S. Montero wrote:

Indeed, We've increased that value in the past, 5 secs seems to be a lot already... I could not reproduce the original problem. Did increasing the timeout help in your case?

It seems that moving the XML-RPC after the mads initialization did help. However, I still face issues where sometime the driver will not initialize (INIT timeout). I'm seeing this on my development server, which is a virtualized environment.

My main problem is that when the INIT times out, OpenNebula starts without my driver being loaded, which obviously causes issues for creating new VMs (in case my VMM timed out), monitoring host (in cases the IM timed out), etc.

Also a longer timeout opens the door for further customization using more complex drivers that take longer to load. I don't see 30 seconds being an issue for regular users since their driver should load faster than that. Also perhaps OpenNebula should fail to start if driver fails to initialize, instead of starting with a non-working environment (with missing drivers).

Otherwise I think we should make the timeout a configuration parameter for people running complex drivers are running OpenNebula on busy hosts or in a virtualized environment with unpredictable performances.

I've set the timeout to 30 secs and everything has been working fine since.

Simon

#6 Updated by Ruben S. Montero about 8 years ago

I've set the timeout to 30 secs and everything has been working fine since.

Simon

OK. I've incremented the timeout in master and one-4.0

Cheers

#7 Updated by Ruben S. Montero about 8 years ago

  • Resolution set to fixed

Also available in: Atom PDF