Bug #4447

Possible mem leak on 'vm.info' in oned

Added by Boris Parak about 5 years ago. Updated about 5 years ago.

Status:ClosedStart date:05/02/2016
Priority:HighDue date:
Assignee:Ruben S. Montero% Done:

0%

Category:Core & System
Target version:Release 5.0.1
Resolution:fixed Pull request:
Affected Versions:OpenNebula 4.14

Description

Iterating over large (60k+) VM pools causes oned to accumulate memory at an alarming rate. The problem is not caused by the initial pool.info call, but by the subsequent vm.info calls. This will usually result in oned being killed by OOM killer.

This should help to reproduce the problem:

require 'opennebula'

vm_pool = OpenNebula::VirtualMachinePool.new(OpenNebula::Client.new)
vm_pool.info(OpenNebula::Pool::INFO_ALL, -1, -1, OpenNebula::VirtualMachinePool::INFO_ALL_VM)

vm_pool.each { |vmachine| vmachine.info }

Associated revisions

Revision 565d0f04
Added by Ruben S. Montero about 5 years ago

bug #4447: Reduce the default cache size to reduce the memory
requirements when big VMs (in terms of number of snapshots, history
records...) are used.

History

#1 Updated by Carlos Martín about 5 years ago

  • Target version set to Release 5.0

#2 Updated by Ruben S. Montero about 5 years ago

  • Assignee set to Ruben S. Montero

#3 Updated by Ruben S. Montero about 5 years ago

Hi Boris,

I am not able to reproduce this, in order to speed up tests I've reduce the cache of PoolObjects to 150, in PoolSql.cc,

const unsigned int PoolSQL::MAX_POOL_SIZE = 15000;   

This is the max number of objects in memory. I've done several runs with 200, 500, and 1000 onevm shows through valgrind, and I get exactly the same numbers each time:

                                                                             
==20501== HEAP SUMMARY:                                                               
==20501==     in use at exit: 78,592 bytes in 17 blocks                               
==20501==   total heap usage: 380,892 allocs, 380,875 frees, 74,788,992 bytes allocated 

==23456== HEAP SUMMARY:                                                               
==23456==     in use at exit: 78,592 bytes in 17 blocks                               
==23456==   total heap usage: 901,107 allocs, 901,090 frees, 175,591,059 bytes allocated 

==26857== HEAP SUMMARY:                                                               
==26857==     in use at exit: 78,592 bytes in 17 blocks                               
==26857==   total heap usage: 1,617,203 allocs, 1,617,186 frees, 314,881,092 bytes allocated  

As you can see there are the same amount of lost bytes (because of xmlrpc library and static initialization of some pool mutex).

I've also tried for different cache sizes (50, 100, 150 and 200) objects and the memory also gets bound to a maximum no matter how many VMs are shown (tried x10 times cache of size)...

So:
- For 15K cache size, we expect no more than 800MB od RSS. Could you check oned evolution, e.g. ps -v -p `pgrep oned` and/or car /proc/`pgrep oned`/status

- Could it be other process in the server?

- If using ruby, maybe the client side is also eating a lot memory, you can try to use the shell, that terminates each ruby process each time

for i in `seq 0 30000` ; do echo $i; onevm show $i > /dev/null ; done

So far I cannot reproduce this

#4 Updated by Boris Parak about 5 years ago

Hi Ruben,

thanks for taking the time to investigate this. I will take your advice with useful pointers and do a few local tests. I will get back to you with results ASAP.

#5 Updated by Ruben S. Montero about 5 years ago

  • Status changed from Pending to Closed
  • Resolution set to worksforme

Hi Boris,

I'm doing housekeeping in the issues list for 5.0. I'm closing this as worksforme but we will reopen if needed if you find something. Thanks again for the feedback.

#6 Updated by Boris Parak about 5 years ago

Hi Ruben,

I have an update for you. I now believe this is not a memory leak, just strange caching behavior. I'm not sure why, but it looks like oned's cache is filling up faster than it can purge/replace old cache entries (or the entries themselves grow in size). I have a sanitized DB dump, several GBs in size. If you are up for it, I can send it to you via a private channel and it should help you replicate this issue in your testbed with reasonable effort.

Thanks!

#7 Updated by Boris Parak about 5 years ago

Or we could give you access to a running testbed. If that would be easier for you.

#8 Updated by Ruben S. Montero about 5 years ago

Sorry Boris, Yes I'm interested in it. If you send me a link (rsmontero at opennebula dot org) I'll download the DB and take a look. Sorry for the late response, the 5.0 release took all my cycles ;)

#9 Updated by Ruben S. Montero about 5 years ago

  • Status changed from Closed to New
  • Target version changed from Release 5.0 to Release 5.2
  • Resolution deleted (worksforme)

Reopen it to test with the DB

#10 Updated by Boris Parak about 5 years ago

Ruben S. Montero wrote:

Sorry Boris, Yes I'm interested in it. If you send me a link (rsmontero at opennebula dot org) I'll download the DB and take a look. Sorry for the late response, the 5.0 release took all my cycles ;)

Done. Thank you!

#11 Updated by Ruben S. Montero about 5 years ago

  • Status changed from New to Closed
  • Resolution set to fixed

This issue was because of specific DB contents, no memory leak has been found after profiling the code.

After looking at this issue, it may be useful to reduce the cache size to reduce the memory footprint of oned and better deal with "big" VM documents (e.g. big number of snapshots, history records...)

#12 Updated by Ruben S. Montero about 5 years ago

  • Target version changed from Release 5.2 to Release 5.0.1

Also available in: Atom PDF