Bug #4796: When failing over HA controllers, hypervisor collectd probes do not switch to new controller - OpenNebula - OpenNebula Development pages

Bug #4796

When failing over HA controllers, hypervisor collectd probes do not switch to new controller

Added by John Noss almost 5 years ago. Updated almost 4 years ago.

Status:

Closed

Start date:

09/19/2016

Priority:

Normal

Due date:

Assignee:

Javi Fontan

% Done:

Category:

Core & System

Target version:

Release 5.4

Resolution:

fixed

Pull request:

Affected Versions:

OpenNebula 5.0

Description

After a failover of HA controllers, the collectd monitoring probes on hypervisors do not switch to sending data to the new controller. The new controller is able to monitor the hypervisors (looks like via the ssh pull that happens if oned has not received data in a little while), but the collectd-client.rb processes running on the hypervisors continue sending data to the old controller's IP (It looks like the monitoring scripts just make sure that the collectd-client.rb is running, they do not restart it).

Fix is to run a onehost sync (or offline/online the hosts). (Tested for a few hours and this did not switch over automatically.) This is running with IM_MAD kvm udp-push.

Related issues

Associated revisions

Revision 5db34212
Added by Javi Fontan almost 4 years ago

B #4796: restart collectd one active monitorization

Revision b22664c2
Added by Javi Fontan almost 4 years ago

B #4796: restart collectd one active monitorization

(cherry picked from commit 5db34212ef837b5314205aa89cedd3f4c229418a)

History

#1 Updated by Ruben S. Montero almost 5 years ago

Related to Feature #4809: Simplify HA management in OpenNebula added

#2 Updated by Ruben S. Montero almost 5 years ago

Category set to Core & System
Target version set to Release 5.4

#3 Updated by Kristian Feldsam almost 5 years ago

In standard clustered HA setup, you should have one floating IP on which sits collectd, oned, sunstone, nginx.... So when active node fails, second node get floating IP and continue running...

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_add_a_resource.html

So you have probably problem in you HA setup, when collectd get other IP.

#4 Updated by Ruben S. Montero about 4 years ago

Assignee set to Jaime Melis

#5 Updated by Javi Fontan almost 4 years ago

Status changed from Pending to Closed
Assignee changed from Jaime Melis to Javi Fontan
Resolution set to fixed

Fixed both in master and one-5.2

Also available in: Atom PDF

OpenNebula

Issues

Custom queries