Bug #4662
Segfaul in Debian Jessie
Status: | Closed | Start date: | 07/20/2016 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | - | % Done: | 0% | |
Category: | Core & System | |||
Target version: | Release 5.2 | |||
Resolution: | fixed | Pull request: | ||
Affected Versions: | OpenNebula 4.12, OpenNebula 4.14, OpenNebula 5.0 |
Description
As already reported [1] OpenNebula uses the libpthread wrong. This results in SEGFAULTs on startup.
This has behavior can be reconstructed with OpenNebula 4.12 and 5.0.
This affects only systems with Intel CPUs with TSX feature.
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824191
Associated revisions
bug #4662: Do not try to unlock non locked mutex for PoolObjectSQL class
bug #4662: Do not try to unlock non locked mutex for PoolObjectSQL class
(cherry picked from commit 979c506becd33c59fe737f1e0091bc6acb1b4336)
bug #4662: Removed addional double unlock()
bug #4662: Removed addional double unlock()
(cherry picked from commit 7676ab6c62b979489c5b686329b0ef54d9243eb7)
bug #4662: Removes double unlocks from MarketPlace. Adds SSL callbacks
for multithread calls
bug #4662: Removes double unlocks from MarketPlace. Adds SSL callbacks
for multithread calls
(cherry picked from commit 8bba5b8b7380c84e38d08fccae13ae48bc4d4a59)
bug #4662: Removed addional double unlock()
(cherry picked from commit 7676ab6c62b979489c5b686329b0ef54d9243eb7)
bug #4662: Do not try to unlock non locked mutex for PoolObjectSQL class
(cherry picked from commit 979c506becd33c59fe737f1e0091bc6acb1b4336)
bug #4662: Do not double unlock on deploy/migrate operations
bug #4662: Do not double unlock on deploy/migrate operations
(cherry picked from commit 61a246400f65156529008cc34d760ca0fe86fc8a)
bug #4662: Do not double unlock on deploy/migrate operations
(cherry picked from commit 61a246400f65156529008cc34d760ca0fe86fc8a)
History
#1 Updated by Tino Vázquez almost 5 years ago
- Target version set to Release 5.2
#2 Updated by Ruben S. Montero almost 5 years ago
Hi Benjamin
I've followed the Debian issue and commit a patch to address this. I do not have access to a tsx processors. Could you give it a try?
There are patches for master and one-5.0, although the may apply to one-4.14
THANKS!
#3 Updated by Benjamin Taubmann almost 5 years ago
Hi Ruben,
I applied your patch to the src package of Debian (opennebula version 4.12).
At least oned does not segfault at startup.
However, it still fails at other places now (dmesg output):
traps: mm_sched27955 general protection ip:7f160737e440
sp:7ffe709bc0c8 error:0 in libpthread-2.23.so[7f160736c000+18000]
traps: oned27954 general protection ip:7f33d7457440 sp:7fff8aa22048
error:0 in libpthread-2.23.so[7f33d7445000+18000]
I might provide more debugging information if required.
Thanks for helping!
#4 Updated by Ruben S. Montero almost 5 years ago
Thank you!
It seems that we missed some double mutex_unlock. It would be super to have a GDB backtrace as the one posted in the Debian forums. That help us a lot to trace this. Note that it seems that we have the same problem for the scheduler and oned.
THANKS again!!
#5 Updated by Benjamin Taubmann almost 5 years ago
Yes it seems so!
Here are the traces:
The backtrace for mm_sched
Thread 1 "mm_sched" received signal SIGSEGV, Segmentation fault.
lll_unlock_elision (lock=lock@entry=0x7ffd8894baf0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt
#0 __lll_unlock_elision (lock=lock@entry=0x7ffd8894baf0, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007f25989f5187 in __pthread_mutex_unlock_usercnt (mutex=mutex@entry=0x7ffd8894baf0, decr=decr@entry=1)
at pthread_mutex_unlock.c:64
#2 0x00007f25989f520a in __GI_pthread_mutex_unlock (mutex=mutex@entry=0x7ffd8894baf0)
at pthread_mutex_unlock.c:314
#3 0x00007f25992c8ace in ActionManager::unlock (this=0x7ffd8894ba98) at include/ActionManager.h:150
#4 ActionManager::~ActionManager (this=0x7ffd8894ba98, __in_chrg=<optimized out>)
at src/common/ActionManager.cc:45
#5 0x00007f259929c195 in Scheduler::~Scheduler (this=0x7ffd8894b980, __in_chrg=<optimized out>)
at src/scheduler/include/Scheduler.h:69
#6 RankScheduler::~RankScheduler (this=0x7ffd8894b980, __in_chrg=<optimized out>)
at src/scheduler/src/sched/mm_sched.cc:38
#7 0x00007f25992987a0 in main (argc=<optimized out>, argv=<optimized out>)
at src/scheduler/src/sched/mm_sched.cc:68
[Switching to Thread 0x7f6ef0600400 (LWP 9369)]
lll_unlock_elision (lock=lock@entry=0x7f6ef05ff138, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt
#0 _lll_unlock_elision (lock=lock@entry=0x7f6ef05ff138, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007f6eef114187 in __pthread_mutex_unlock_usercnt (mutex=mutex@entry=0x7f6ef05ff138, decr=decr@entry=1)
at pthread_mutex_unlock.c:64
#2 0x00007f6eef11420a in __GI_pthread_mutex_unlock (mutex=mutex@entry=0x7f6ef05ff138)
at pthread_mutex_unlock.c:314
#3 0x00007f6ef042334e in ActionManager::unlock (this=0x7f6ef05ff0e0) at include/ActionManager.h:150
#4 ActionManager::~ActionManager (this=0x7f6ef05ff0e0, __in_chrg=<optimized out>)
at src/common/ActionManager.cc:45
#5 0x00007f6ef036f3a2 in SyncRequest::~SyncRequest (this=0x7f6ef05ff0a0, __in_chrg=<optimized out>)
at include/SyncRequest.h:41
#6 AuthRequest::~AuthRequest (this=0x7f6ef05ff0a0, __in_chrg=<optimized out>) at include/AuthRequest.h:42
#7 UserPool::authenticate_server (this=this@entry=0x7f6ef11c3b60, user=<optimized out>,
user@entry=0x7f6e94001600, token="oneadmin:BDnC1cFbbGN2jFGygE/U/hJCtnRmsiQfHymPulSCm5k=",
password="7fee24db4289fa5d98c6aca31380cb62ef29acb7", user_id=@0x7f6ef05ff490: 0,
group_id=@0x7f6ef05ff494: 0, uname="oneadmin", gname="oneadmin",
group_ids=std::set with 1 elements = {...}, umask=@0x7f6ef05ff54c: 127) at src/um/UserPool.cc:586
#8 0x00007f6ef0371fab in UserPool::authenticate (this=0x7f6ef11c3b60,
session="serveradmin:oneadmin:BDnC1cFbbGN2jFGygE/U/hJCtnRmsiQfHymPulSCm5k=",
password="7fee24db4289fa5d98c6aca31380cb62ef29acb7", user_id=@0x7f6ef05ff490: 0,
group_id=@0x7f6ef05ff494: 0, uname="oneadmin", gname="oneadmin",
group_ids=std::set with 1 elements = {...}, umask=@0x7f6ef05ff54c: 127) at src/um/UserPool.cc:953
#9 0x00007f6ef0351aa7 in Request::execute (this=0x7f6ef11e41a0, _paramList=..., _retval=<optimized out>)
at src/rm/Request.cc:48
#10 0x00007f6eee726999 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#11 0x00007f6ee9d05029 in xmlrpc_dispatchCall () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#12 0x00007f6ee9d05178 in xmlrpc_registry_process_call2 () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#13 0x00007f6eee726543 in xmlrpc_c::registry::processCall(std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xmlrpc_c::callInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#14 0x00007f6eee92f164 in xmlrpc_c::serverAbyss_impl::processCall(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, TSession, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#15 0x00007f6eee92f232 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#16 0x00007f6ee9f0e3ca in xmlrpc_handleIfXmlrpcReq () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss.so.3
#17 0x00007f6ee9af77a6 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#18 0x00007f6ee9af7a6e in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#19 0x00007f6ee9af20b7 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#20 0x00007f6ee9afac3b in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#21 0x00007f6eef110464 in start_thread (arg=0x7f6ef0600400) at pthread_create.c:333
#22 0x00007f6eed7a230d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
#6 Updated by Ruben S. Montero almost 5 years ago
THANKS!!!
Hopefully both seems to be the same one. There is a new commit to fix this. We went through the code and it seems there is no other one, we'll run a couple of our tests with valgrind to double check it.
In the mean time if you can try would be great!
#7 Updated by Benjamin Taubmann almost 5 years ago
I tried to patch Opennebula 4.14 [1]. I hope is correct.
But it still segfaults when i try to login.
Here is the backtrace of mm_sched:
Thread 1 "mm_sched" received signal SIGSEGV, Segmentation fault.
lll_unlock_elision (lock=lock@entry=0x7ffd62905100, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt
#0 __lll_unlock_elision (lock=lock@entry=0x7ffd62905100, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007f76c5a56187 in __pthread_mutex_unlock_usercnt (mutex=mutex@entry=0x7ffd62905100, decr=decr@entry=1)
at pthread_mutex_unlock.c:64
#2 0x00007f76c5a5620a in __GI_pthread_mutex_unlock (mutex=mutex@entry=0x7ffd62905100)
at pthread_mutex_unlock.c:314
#3 0x00007f76c6329ace in ActionManager::unlock (this=0x7ffd629050a8) at include/ActionManager.h:150
#4 ActionManager::~ActionManager (this=0x7ffd629050a8, __in_chrg=<optimized out>)
at src/common/ActionManager.cc:45
#5 0x00007f76c62fd195 in Scheduler::~Scheduler (this=0x7ffd62904f90, __in_chrg=<optimized out>)
at src/scheduler/include/Scheduler.h:69
#6 RankScheduler::~RankScheduler (this=0x7ffd62904f90, __in_chrg=<optimized out>)
at src/scheduler/src/sched/mm_sched.cc:38
#7 0x00007f76c62f97a0 in main (argc=<optimized out>, argv=<optimized out>)
at src/scheduler/src/sched/mm_sched.cc:68
#8 Updated by Ruben S. Montero almost 5 years ago
This has to be applied also
As you can see it removes line 45 of ActionManager.cc which is the one causing the error:
#4 ActionManager::~ActionManager (this=0x7ffd629050a8, __in_chrg=<optimized out>)
at src/common/ActionManager.cc:45
If you have applied it seems that mm_sched didn't get installed/restarted properly
#9 Updated by Ruben S. Montero almost 5 years ago
By the way, I catch a couple of additional errors with valgrind for the MarketPlace, but you should be fine in the 4.14 branch
#10 Updated by Benjamin Taubmann almost 5 years ago
I applied the patch in ActionManager.cc and now it seems to work :)
Thanks!
It would be nice when these changes are applied to Debian packages.
#11 Updated by Benjamin Taubmann almost 5 years ago
Actually, it fails now at a different place. I will try to provide you traces asap
#12 Updated by Benjamin Taubmann almost 5 years ago
When I added a new network interface the oned daemon crashes again.
Here is the trace:
#0 lll_unlock_elision (lock=lock@entry=0x7f1040001098, private=0)
at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007f1093c5a187 in _pthread_mutex_unlock_usercnt (mutex=mutex@entry=0x7f1040001098, decr=decr@entry=1)
at pthread_mutex_unlock.c:64
#2 0x00007f1093c5a20a in __GI_pthread_mutex_unlock (mutex=mutex@entry=0x7f1040001098)
at pthread_mutex_unlock.c:314
#3 0x00007f1094f24f74 in PoolObjectSQL::~PoolObjectSQL (this=0x7f1040000f80, __in_chrg=<optimized out>)
at include/PoolObjectSQL.h:127
#4 VirtualNetwork::~VirtualNetwork (this=0x7f1040000f80, __in_chrg=<optimized out>)
at src/vnm/VirtualNetwork.cc:66
#5 0x00007f1094f250c1 in VirtualNetwork::~VirtualNetwork (this=0x7f1040000f80, __in_chrg=<optimized out>)
at src/vnm/VirtualNetwork.cc:69
#6 0x00007f1094f0574a in PoolSQL::allocate (this=this@entry=0x7f1096d8d910,
objsql=objsql@entry=0x7f1040000f80, error_str="") at src/pool/PoolSQL.cc:130
#7 0x00007f1094f26a1e in VirtualNetworkPool::allocate (this=0x7f1096d8d910, uid=0, gid=0, uname="oneadmin",
gname="oneadmin", umask=127, pvid=-1, vn_template=0x7f1040003b60, oid=0x7f109512424c, cluster_id=-1,
cluster_name="", error_str="") at src/vnm/VirtualNetworkPool.cc:133
#8 0x00007f1094e59d06 in VirtualNetworkAllocate::pool_allocate (this=<optimized out>, paramList=...,
tmpl=<optimized out>, id=<optimized out>, error_str=..., att=..., cluster_id=-1, cluster_name="")
at src/rm/RequestManagerAllocate.cc:284
#9 0x00007f1094e5b0be in RequestManagerAllocate::request_execute (this=0x7f1096da3270, params=..., att=...)
at src/rm/RequestManagerAllocate.cc:182
#10 0x00007f1094e97bb4 in Request::execute (this=0x7f1096da3270, _paramList=..., _retval=<optimized out>)
at src/rm/Request.cc:58
#11 0x00007f109326c999 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#12 0x00007f108e84b029 in xmlrpc_dispatchCall () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#13 0x00007f108e84b178 in xmlrpc_registry_process_call2 () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#14 0x00007f109326c543 in xmlrpc_c::registry::processCall(std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xmlrpc_c::callInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#15 0x00007f1093475164 in xmlrpc_c::serverAbyss_impl::processCall(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, TSession, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#16 0x00007f1093475232 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#17 0x00007f108ea543ca in xmlrpc_handleIfXmlrpcReq () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss.so.3
#18 0x00007f108e63d7a6 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#19 0x00007f108e63da6e in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#20 0x00007f108e6380b7 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#21 0x00007f108e640c3b in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#22 0x00007f1093c56464 in start_thread (arg=0x7f1095125400) at pthread_create.c:333
#23 0x00007f10922e830d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
#13 Updated by Ruben S. Montero almost 5 years ago
mmm
I don't have nothing at 130:
#6 0x00007f1094f0574a in PoolSQL::allocate (this=this@entry=0x7f1096d8d910,
objsql=objsql@entry=0x7f1040000f80, error_str="") at src/pool/PoolSQL.cc:130
Note that the first patch removes the unlock() call at line 129, see http://dev.opennebula.org/projects/opennebula/repository/revisions/979c506becd33c59fe737f1e0091bc6acb1b4336
Could you check that both patches are applied. I have backported this to one-4.14 branch to make it easier, you should be able to compiled from head of one-4.14
THANKS
#14 Updated by Benjamin Taubmann almost 5 years ago
Sorry i forgot to remove this line! It seems to work now!
#15 Updated by Benjamin Taubmann almost 5 years ago
It seems like there is another bug. When I start a VM it segfaults again with this backtrace:
Thread 31 "oned" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fdc5e328400 (LWP 5103)]
lll_unlock_elision (lock=lock@entry=0x7fdbfc003078, private=0) at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.
(gdb) bt
#0 _lll_unlock_elision (lock=lock@entry=0x7fdbfc003078, private=0) at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007fdc5ce3c187 in __pthread_mutex_unlock_usercnt (mutex=mutex@entry=0x7fdbfc003078, decr=decr@entry=1) at pthread_mutex_unlock.c:64
#2 0x00007fdc5ce3c20a in __GI_pthread_mutex_unlock (mutex=mutex@entry=0x7fdbfc003078) at pthread_mutex_unlock.c:314
#3 0x00007fdc5e04bd62 in PoolObjectSQL::unlock (this=0x7fdbfc002f60) at include/PoolObjectSQL.h:290
#4 VirtualMachineDeploy::request_execute (this=0x7fdc5f9ca8c0, paramList=..., att=...) at src/rm/RequestManagerVirtualMachine.cc:859
#5 0x00007fdc5e079bb4 in Request::execute (this=0x7fdc5f9ca8c0, _paramList=..., _retval=<optimized out>) at src/rm/Request.cc:58
#6 0x00007fdc5c44e999 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#7 0x00007fdc57a2d029 in xmlrpc_dispatchCall () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#8 0x00007fdc57a2d178 in xmlrpc_registry_process_call2 () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server.so.3
#9 0x00007fdc5c44e543 in xmlrpc_c::registry::processCall(std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xmlrpc_c::callInfo const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server++.so.8
#10 0x00007fdc5c657164 in xmlrpc_c::serverAbyss_impl::processCall(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, TSession, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#11 0x00007fdc5c657232 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss++.so.8
#12 0x00007fdc57c363ca in xmlrpc_handleIfXmlrpcReq () from /usr/lib/x86_64-linux-gnu/libxmlrpc_server_abyss.so.3
#13 0x00007fdc5781f7a6 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#14 0x00007fdc5781fa6e in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#15 0x00007fdc5781a0b7 in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#16 0x00007fdc57822c3b in ?? () from /usr/lib/x86_64-linux-gnu/libxmlrpc_abyss.so.3
#17 0x00007fdc5ce38464 in start_thread (arg=0x7fdc5e328400) at pthread_create.c:333
#18 0x00007fdc5b4ca30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
I hope i didn't forget another line to patch!
#16 Updated by Ruben S. Montero almost 5 years ago
What version are you using, is this 4.14, 5.0?
#17 Updated by Ruben S. Montero almost 5 years ago
Hold on, I saw it
#18 Updated by Benjamin Taubmann almost 5 years ago
I have 4.12 (from the debian src package).
Here you can find the RequestManagerVirtualMachine.cc
#19 Updated by Ruben S. Montero almost 5 years ago
Ok, just uploaded a patch for master, 5.0 and 4.14
https://github.com/OpenNebula/one/commit/61a246400f65156529008cc34d760ca0fe86fc8a
#20 Updated by Benjamin Taubmann almost 5 years ago
I updated now to 4.14 and it seems to work as good as it can work with Xen
#21 Updated by Ruben S. Montero almost 5 years ago
- Category set to Core & System
- Status changed from Pending to Closed
- Resolution set to fixed
- Affected Versions OpenNebula 4.12, OpenNebula 4.14 added
Great, I'm closing it, but will reopen if more failures appear.