Bug #4796

When failing over HA controllers, hypervisor collectd probes do not switch to new controller

Added by John Noss almost 5 years ago. Updated almost 4 years ago.

Status:ClosedStart date:09/19/2016
Priority:NormalDue date:
Assignee:Javi Fontan% Done:

0%

Category:Core & System
Target version:Release 5.4
Resolution:fixed Pull request:
Affected Versions:OpenNebula 5.0

Description

After a failover of HA controllers, the collectd monitoring probes on hypervisors do not switch to sending data to the new controller. The new controller is able to monitor the hypervisors (looks like via the ssh pull that happens if oned has not received data in a little while), but the collectd-client.rb processes running on the hypervisors continue sending data to the old controller's IP (It looks like the monitoring scripts just make sure that the collectd-client.rb is running, they do not restart it).

Fix is to run a onehost sync (or offline/online the hosts). (Tested for a few hours and this did not switch over automatically.) This is running with IM_MAD kvm udp-push.


Related issues

Related to Feature #4809: Simplify HA management in OpenNebula Closed 09/21/2016

Associated revisions

Revision 5db34212
Added by Javi Fontan almost 4 years ago

B #4796: restart collectd one active monitorization

Revision b22664c2
Added by Javi Fontan almost 4 years ago

B #4796: restart collectd one active monitorization

(cherry picked from commit 5db34212ef837b5314205aa89cedd3f4c229418a)

History

#1 Updated by Ruben S. Montero almost 5 years ago

  • Related to Feature #4809: Simplify HA management in OpenNebula added

#2 Updated by Ruben S. Montero almost 5 years ago

  • Category set to Core & System
  • Target version set to Release 5.4

#3 Updated by Kristian Feldsam almost 5 years ago

In standard clustered HA setup, you should have one floating IP on which sits collectd, oned, sunstone, nginx.... So when active node fails, second node get floating IP and continue running...

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_add_a_resource.html

So you have probably problem in you HA setup, when collectd get other IP.

#4 Updated by Ruben S. Montero about 4 years ago

  • Assignee set to Jaime Melis

#5 Updated by Javi Fontan almost 4 years ago

  • Status changed from Pending to Closed
  • Assignee changed from Jaime Melis to Javi Fontan
  • Resolution set to fixed

Fixed both in master and one-5.2

Also available in: Atom PDF