Bug #3726

Ceph monitor check broken during normal cluster degredation

Added by Joel Merrick almost 5 years ago. Updated almost 5 years ago.

Status:ClosedStart date:03/30/2015
Priority:HighDue date:
Assignee:Javi Fontan% Done:

0%

Category:Documentation
Target version:Release 4.12.1
Resolution: Pull request:
Affected Versions:OpenNebula 4.12

Description

There is a bug in the new Ceph monitor. It's assuming the columns are always static in the output of the ceph/rados output are static - however when there's any degredation (perfectly normal if you've lost a disk and rebalancing happening) then the awk picks up the wrong column (0 in my case). See https://gist.github.com/joelio/64ae2b9fe9116fcca4c6

This is a serious fault as it stops all provisioning on an otherwise healthy cluster.

I think it'd be more prudent to use the JSON output and parse that properly?

History

#1 Updated by Joel Merrick almost 5 years ago

Been notified that XML output supported

ceph df -f xml

Might be better than JSON for ONE usage

#2 Updated by Ruben S. Montero almost 5 years ago

Are you running 0.87.1?

Anyway, using xml/json should be better

#3 Updated by Ruben S. Montero almost 5 years ago

  • Target version set to Release 4.14

#4 Updated by Joel Merrick almost 5 years ago

Yea, current stable (0.87 Giant)

ceph:
Installed: 0.87.1-1trusty
Candidate: 0.87.1-1trusty
Version table: *** 0.87.1-1trusty 0
999 http://ceph.com/debian-giant/ trusty/main amd64 Packages

#5 Updated by Ruben S. Montero almost 5 years ago

  • Status changed from Pending to New
  • Target version changed from Release 4.14 to Release 4.12.1

#6 Updated by Ruben S. Montero almost 5 years ago

  • Assignee set to Javi Fontan

#7 Updated by Ruben S. Montero almost 5 years ago

Hi,

Checking it; I have

# rados df
pool name       category                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
...

but with ceph:

# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
...
POOLS:
    NAME                ID     USED       %USED     MAX AVAIL     OBJECTS 
...

so you say that ceph df becomes rados df under degradation?

Cheers

#8 Updated by Joel Merrick almost 5 years ago

I'm noticing no MAX_AVAIL in my output too

root@vm-head-01:/var/log/one# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
59304G 53306G 5960G 10.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 0 0
cephfs_data 1 1930G 3.26 0 2062873
cephfs_metadata 2 38567k 0 0 75344
one 3 44924M 0.07 0 11350

#9 Updated by Joel Merrick almost 5 years ago

root@vm-head-01:/var/log/one# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
59304G 53306G 5960G 10.05
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 0 0
cephfs_data 1 1930G 3.26 0 2062873
cephfs_metadata 2 38567k 0 0 75344
one 3 44924M 0.07 0 11350

#10 Updated by Joel Merrick almost 5 years ago

This is a fresh install too, nothing further done with Ceph for opennebula apart from setting auth up.

If it's just a case of getting MAX_AVAIL working then need to understand what we need to do and add to the docs :)

#11 Updated by Joel Merrick almost 5 years ago

Ok, so I'm thinking this is a bug in Ceph now, sorry for the confusion!

http://tracker.ceph.com/issues/10257

#12 Updated by Joel Merrick almost 5 years ago

Yes, I can confirm this bug with Ceph, not ONE.

You can close (or perhaps mark documentation?) this bug as there's not a lot you guys can do to mitigate.

Basically ensure ALL OSDs are in and up or do not use 0.87.1. specifically - looks fixed in later releases.

Thanks again guys! :)

#13 Updated by Ruben S. Montero almost 5 years ago

  • Category changed from Drivers - Storage to Documentation

Great! I'm no 100% sure but I think this bug also appears when the OSD's are weight'ed to 0... A note to the docs would be nice, at least in the release note pages.

Thanks for letting us know

Cheers

Joel Merrick wrote:

Yes, I can confirm this bug with Ceph, not ONE.

You can close (or perhaps mark documentation?) this bug as there's not a lot you guys can do to mitigate.

Basically ensure ALL OSDs are in and up or do not use 0.87.1. specifically - looks fixed in later releases.

Thanks again guys! :)

#14 Updated by Ruben S. Montero almost 5 years ago

  • Status changed from New to Closed
  • Resolution set to worksforme

#15 Updated by Ruben S. Montero almost 5 years ago

  • Status changed from Closed to Pending
  • Resolution deleted (worksforme)

#16 Updated by Jaime Melis almost 5 years ago

  • Status changed from Pending to Closed

Documented

Also available in: Atom PDF