Bug #5386

After setting up the opennebula HA,oned crashes every day

Added by Bing Wang over 2 years ago. Updated over 2 years ago.

Status:ClosedStart date:09/22/2017
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:-
Resolution:duplicate Pull request:
Affected Versions:OpenNebula 5.4

Description

I have a 3 nodes opennebula cluster,and set up the opennebula HA.

oneadmin@137:~$ onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server137 http://192.168.137.137:2633/RPC2
1 server138 http://192.168.137.138:2633/RPC2
2 server139 http://192.168.137.139:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server137 follower 76812 80481 80481 -1 -1
1 server138 follower 76812 82756 82756 -1 -1
2 server139 leader 76812 82756 82756 2 -1

ZONE TEMPLATE
ENDPOINT="http://localhost:2633/RPC2"
oneadmin@137:~$

Next day I'll find two of the nodes' oned crash. Such as:

root@137:~# onezone show 0
ZONE 0 INFORMATION
ID : 0
NAME : OpenNebula

ZONE SERVERS
ID NAME ENDPOINT
0 server137 http://192.168.137.137:2633/RPC2
1 server138 http://192.168.137.138:2633/RPC2
2 server139 http://192.168.137.139:2633/RPC2

HA & FEDERATION SYNC STATUS
ID NAME STATE TERM INDEX COMMIT VOTE FED_INDEX
0 server137 candidate 75978 79991 79991 -1 -1
1 server138 error - - - -
2 server139 error - - - -

ZONE TEMPLATE
ENDPOINT="http://localhost:2633/RPC2"

Program terminated with signal SIGABRT, Aborted.(core dumped)
(gdb) bt
#0 0x00007fb3c07c6c37 in _GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fb3c07ca028 in __GI_abort () at abort.c:89
#2 0x00007fb3c10d5535 in __gnu_cxx::
_verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fb3c10d36d6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fb3c10d3703 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fb3c10d3922 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x0000000000615bd0 in (anonymous namespace)::throwIfError (env=...) at value.cpp:53
#7 0x0000000000618d82 in xmlrpc_c::cNewStringWrapper::cNewStringWrapper (this=0x7fb3527fb7d0,
cppvalue="\030\t\000L\263\177\000\000\200\272\177R\263\177\000\000\300\272\177R\263\177\000\000\000\000\000\000\000\000\000\000\060\274\177R\263\177\000\000݇^\000\000\000\000\000`\274\177R\263\177\000\000T\274\177R\263\177\000\000S\274\177R\263\177\000\000н\177R\263\177\000\000\300\270\177R\002\000\000\000\220\317\031\001\000\000\000\000\000\000\000\000\\:\001\000\v,\001\000\000\000\000\000\030\t\000L\263\177\000\000x\212\022\301\263\177\000\000\370Q\000\260\263\177\000\000r\264\200\300\263\177\000\000\020(\000\260\263\177\000\000[\253\200\300", '\000' <repeats 12 times>, "P\337\002L\263\177", '\000' <repeats 11 times>, "\272\177R\263\177\000\000@\t\000L\263\177\000\000"..., nlCode=xmlrpc_c::value_string::nlCode_all) at value.cpp:638
#8 0x00000000006174aa in xmlrpc_c::value_string::value_string (this=0x7fb3527fb890,
cppvalue="\177ELF\002\001\001\000\000\000\000\000\000\000\000\000\002\000>\000\001\000\000\000\v\002B\000\000\000\000\000@\000\000\000\000\000\000\000\270\215\017\000\000\000\000\000\000\000\000\000@\000\070\000\t\000@\000\034\000\033\000\006\000\000\000\005\000\000\000@\000\000\000\000\000\000\000@\000@\000\000\000\000\000@\000@\000\000\000\000\000\370\001\000\000\000\000\000\000\370\001\000\000\000\000\000\000\b\000\000\000\000\000\000\000\003\000\000\000\004\000\000\000\070\002\000\000\000\000\000\000\070\002@\000\000\000\000\000\070\002@\000\000\000\000\000\034\000\000\000\000\000\000\000\034\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\001\000\000\000\005", '\000' <repeats 13 times>, "@\000\000\000\000\000"...) at value.cpp:659
#9 0x00000000005e87dd in RaftManager::xmlrpc_replicate_log (this=0x119cf90, follower_id=2, lr=0x7fb3527fbdd0, success=@0x7fb3527fbc53: true, fterm=@0x7fb3527fbc54: 76811, error="")
at src/raft/RaftManager.cc:1024
#10 0x00000000005efe67 in HeartBeatThread::replicate (this=0x7fb3b000b680) at src/raft/ReplicaThread.cc:309
#11 0x00000000005ef6ac in ReplicaThread::do_replication (this=0x7fb3b000b680) at src/raft/ReplicaThread.cc:112
#12 0x00000000005ef539 in replication_thread (arg=0x7fb3b000b680) at src/raft/ReplicaThread.cc:56
#13 0x00007fb3c1381184 in start_thread (arg=0x7fb3527fc700) at pthread_create.c:312
#14 0x00007fb3c088dffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

And sometimes:

Program terminated with signal SIGABRT, Aborted.
#0 0x00007f526236dc37 in _quicksort (pbase=0x0, total_elems=<optimized out>, size=32, cmp=0x7f51f80008d8, arg=0x7f51f8000940) at qsort.c:125
125 qsort.c: No such file or directory.

uname -a
Linux 138 4.1.35-server #1 SMP Wed Sep 13 00:31:07 CST 2017 x86_64 x86_64 x86_64 GNU/Linux

My opennebula is 5.4.0-1

RAFT configuration is:
RAFT = [
LOG_RETENTION = 500000,
LOG_PURGE_TIMEOUT = 600,
ELECTION_TIMEOUT_MS = 5000,
BROADCAST_TIMEOUT_MS = 500,
XMLRPC_TIMEOUT_MS = 2000
]

History

#1 Updated by Ruben S. Montero over 2 years ago

  • Status changed from Pending to Closed
  • Resolution set to duplicate

Could you please upgrade to 5.4.1? You are being hit by a bug fixed in here:

https://github.com/OpenNebula/one/commit/a6addb314e63361aeabc2a63803572456debd85c

There was an error setting up the session string for the replicate api call (the one related to xmlrpc_c::cNewStringWrapper::cNewStringWrapper in your logs)

Cheers

Also available in: Atom PDF