Feature #4206

Poll script on host should skip failed checks

Added by Stefan Kooman over 5 years ago. Updated about 5 years ago.

Status:ClosedStart date:11/26/2015
Priority:NormalDue date:
Assignee:Javi Fontan% Done:

0%

Category:Drivers - Monitor
Target version:Release 5.0
Resolution:fixed Pull request:

Description

When the poll script on a host encounters a error then no information is send to the frontend. Example:

Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: 2015-11-26 08:45:34.511115 7f6590276840 0 monclient(hunting): authenticate timed out after 300
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: 2015-11-26 08:45:34.511178 7f6590276840 0 librados: client.libvirt authentication error (110) Connection timed out
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: rbd: couldn't connect to the cluster!
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: ../../vmm/kvm/poll:349:in `block in get_disk_usage': undefined method `text' for nil:NilClass (NoMethodError)
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from /usr/lib/ruby/1.9.1/rexml/element.rb:905:in `block in each'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from /usr/lib/ruby/1.9.1/rexml/xpath.rb:67:in `each'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from /usr/lib/ruby/1.9.1/rexml/xpath.rb:67:in `each'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from /usr/lib/ruby/1.9.1/rexml/element.rb:905:in `each'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:329:in `get_disk_usage'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:145:in `block in get_all_vm_info'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:129:in `each'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:129:in `get_all_vm_info'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:856:in `print_all_vm_template'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: from ../../vmm/kvm/poll:908:in `<main>'
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][E]: Error executing poll.sh
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][E]: Error executing collectd-client_control.sh
Nov 26 08:45:34 test-oned2 oned8488: [Z0][InM][I]: ExitCode: 1

In this case the (test) Ceph cluster is unavailable which results in a "blackout" of all VM's running on the host. Environments which depend on accounting info from ONE will lose billing info. Instead of failing completely I would suggest the poll script should report succesfully collected metrics and mention an error / warning error for the failed check.

Associated revisions

Revision 7d6f91a3
Added by Javi Fontan over 5 years ago

feature #4206: do not crash getting disk info in poll

Revision d8ffdf33
Added by Javi Fontan over 5 years ago

feature #4206: do not crash getting disk info in poll

(cherry picked from commit 7d6f91a369cdd31a788d8d44cece531357fa7fb3)

Revision 253416bc
Added by Javi Fontan over 5 years ago

feature #4206: do not crash getting disk info in poll

(cherry picked from commit 7d6f91a369cdd31a788d8d44cece531357fa7fb3)

History

#1 Updated by Ruben S. Montero over 5 years ago

  • Category set to Drivers - Monitor
  • Target version set to Release 5.0

#2 Updated by Ruben S. Montero over 5 years ago

  • Status changed from Pending to New

#3 Updated by Javi Fontan over 5 years ago

  • Assignee set to Javi Fontan

The drivers only log STDERR in case the command fails. This should be changed so error messages for a successful execution (warnings?) can be seen in log files.

#4 Updated by Ruben S. Montero about 5 years ago

  • Status changed from New to Closed
  • Resolution set to fixed

Also available in: Atom PDF