Bug #3335
Use CEPH "MAX AVAIL"
Status: | Closed | Start date: | 11/13/2014 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | - | % Done: | 0% | |
Category: | Drivers - Storage | |||
Target version: | - | |||
Resolution: | worksforme | Pull request: | ||
Affected Versions: | OpenNebula 4.12 |
Description
It accounts for the number of replicas.
Proposed modification:
MONITOR_SCRIPT=$(cat <<EOF
to_megabytes() {
value=\$(echo "\$1" | sed 's/[^0123456789].*\$//g')
units=\$(echo "\$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]')
case "\$units" in
t|tb) value=\$(expr \$value * 1024 * 1024) ;;
g|gb) value=\$(expr \$value * 1024) ;;
m|mb) value=\$value ;;
k|kb) value=\$(expr \$value / 1024) ;;
b|'') value=0 ;;
*)
value=
echo "Unsupported units '\$units'" >&2
;;
esac
echo "\$value"
}
used=\$(ceph df | awk '{
if (\$1 == "$POOL_NAME") {
print \$3
}
}')
total=\$(ceph df | awk '{
if (\$1 == "$POOL_NAME") {
print \$5
}
}')
used_mb=\$(to_megabytes \$used)
total_mb=\$(to_megabytes \$total)
free_mb=\$(expr \$total_mb - \$used_mb)
echo "USED_MB=\$used_mb"
echo "FREE_MB=\$free_mb"
echo "TOTAL_MB=\$total_mb"
EOF
)
Associated revisions
feature #3335: Use MAX AVAIL to compute Ceph datastore usage
feature #3335: Better compute of TOTAL. Cluster total cannot be used as it does not consider placement rules in the crushmap
History
#1 Updated by Ruben S. Montero over 6 years ago
- Tracker changed from Request to Feature
- Status changed from Pending to New
- Target version set to Release 4.12
#2 Updated by Bill Campbell over 6 years ago
Great idea, however I see one problem with your script. It is using the MAX_AVAIL metric in place of the TOTAL in the datastore, which ends up shrinking as you allocate images. I recommend the following modifications to the above:
- Replica size is pulled from the cluster (utilizing 'ceph osd pool get $POOL_NAME size')
- TOTAL is calculated by pulling the total size from 'rados df'->convert to megabytes->divide by replica size
- MAX_AVAIL is collected to show free space in the cluster
- USED stays the same
This way, you have a static total, the MAX_AVAIL is shown as the free space, and we don't need to worry about changing this per pool (in the case you use multiple pools in your cluster with different replication sizes), all while still taking into consideration the replication size of the pool.
Example modification to above, I'm sure OpenNebula gurus can clean this up a bit:
MONITOR_SCRIPT=$(cat <<EOF REPSIZE=\$(/usr/bin/ceph osd pool get "$POOL_NAME" size | awk '{print \$2}') to_megabytes() { value=\$(echo "\$1" | sed 's/[^0123456789].*\$//g') units=\$(echo "\$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]') case "\$units" in t|tb) value=\$(expr \$value * 1024 * 1024) ;; g|gb) value=\$(expr \$value * 1024) ;; m|mb) value=\$value ;; k|kb) value=\$(expr \$value / 1024) ;; b|'') value=0 ;; *) value= echo "Unsupported units '\$units'" >&2 ;; esac echo "\$value" } used=\$(ceph df | awk '{ if (\$1 == "$POOL_NAME") { print \$3 } }') total=\$($RADOS df | grep 'total space' | awk '{print \$3}') free=\$(ceph df | awk '{ if (\$1 == "$POOL_NAME") { print \$5 } }') used_mb=\$(to_megabytes \$used) total_mb=\$(expr \$total / 1024) free_mb=\$(to_megabytes \$free) total_pool=\$(expr \$total_mb / \$REPSIZE) echo "USED_MB=\$used_mb" echo "FREE_MB=\$free_mb" echo "TOTAL_MB=\$total_pool" EOF )
#3 Updated by Ruben S. Montero over 6 years ago
- Status changed from New to Closed
- Resolution set to fixed
This is now in master. I've used the suggestions by Bill. A few notes for reference
MAX AVAIL, is very conservative. It takes the less "projected" available space (per OSD) in the PG MAP. (The relative weight of the OSD in the crush placement rule is used to make the projection see [1], i.e. if an OSD is 0.5 - because its weight is half of total weight sum in the ruleset - the estimated free space in the pool is 2x the free space of the OSD). When a cluster is beeing re-weighted this should be considered.
Also I'm getting "max_avail": 4611686018427387904 for our develop cluster (4096P), It seems the code is not considering overall weights of 0. This should not happen in production, but I added a fallback when MAX AVAIL is 0
TOTAL is computed as suggested, dividing it by the replica size. The very same is done in the Ceph code for max avail. This will give a better overview than the raw total. Note also that this will not work when the osd type is set to erasure code.
[1] https://github.com/ceph/ceph/blob/master/src/mon/PGMonitor.cc#L1303
#4 Updated by Mo McRoberts over 6 years ago
Hi,
For info, this (I'm pretty sure it's this, though I could be wrong) breaks on Ceph 0.80.7-0ubuntu0.14.04.1 — it reports FREE_MB as zero, causing OpenNebula to refuse to schedule any VMs dependent upon images in the Ceph pool. I just upgraded to 4.12.0 and discovered this the hard way and had to do a little digging to track down (fortunately, I've got copies of the old Ceph datastore scripts to compare with).
The culprit appears to be that MAX_AVAIL=\$($CEPH df | grep "$POOL_NAME" | awk '{print \$5}')
in remotes/datastore/ceph/monitor
always results in MAX_AVAIL
being zero.
My ceph df
output looks like this:
GLOBAL: SIZE AVAIL RAW USED %RAW USED 327T 319T 8530G 2.54 POOLS: NAME ID USED %USED OBJECTS data 0 73039M 0.02 28120 metadata 1 247M 0 1784 rbd 2 0 0 0 one 3 2747G 0.82 736798 13 0 0 0 .rgw 15 3273 0 21 .rgw.root 16 822 0 3 .rgw.control 17 0 0 8 .rgw.gc 18 0 0 32 .rgw.buckets 19 8414M 0 322262 .rgw.buckets.index 20 0 0 11 .log 21 0 0 0 .intent-log 22 0 0 0 .usage 23 0 0 0 .users 24 24 0 3 .users.email 25 24 0 3 .users.swift 26 0 0 0 .users.uid 27 1033 0 6
I've worked around it by reverting the monitor script to:
MONITOR_SCRIPT=$(cat <<EOF $RADOS df | $AWK '{ if (\$1 == "total") { space = int(\$3/1024) if (\$2 == "used") {var = "USED_MB"} else if (\$2 == "avail") {var = "FREE_MB"} else if (\$2 == "space") {var = "TOTAL_MB"} print var "=" space } }' EOF )
(And ensuring that RADOS
is set to pass --id
if CEPH_USER
is set).
I haven't seen any indication of minimum version requirement for Ceph+OpenNebula, nor that 0.80.7 is too old, although it might be because I'm stupid. It's also possible that this broke in a previous release and was worked around by somebody else who did the platform upgrade in our setup. Apologies if this is unhelpful!
#5 Updated by Jaime Melis over 6 years ago
- Tracker changed from Feature to Bug
- Status changed from Closed to Pending
- Target version deleted (
Release 4.12) - Resolution deleted (
fixed) - Affected Versions OpenNebula 4.12 added
Review the FREE_MB value in Ceph 0.80.7
#6 Updated by Ruben S. Montero over 6 years ago
Just double checking this... In the testing environment we have:
ii ceph 0.80.7-0ubuntu0.14.10.1 amd64 distributed storage and file system ii ceph-common 0.80.7-0ubuntu0.14.10.1 amd64 common utilities to mount and interact with a ceph storage cluster
And the output:
oneadmin@kvm1:~$ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 1951M 1445M 372M 19.07 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS data 0 0 0 4096P 0 metadata 1 0 0 4096P 0 rbd 2 0 0 4096P 0 one 3 164M 8.40 4096P 45
It seems that we are using the same version. I do not understand the diferences...
EDIT: BTW we don't have 4096 Petabytes for testing it seems a bug when all de OSDs have 0 weight, but the column its there
#7 Updated by Mo McRoberts over 6 years ago
Hello!
Thanks for digging into this… and I've done some of my own.
It turns out the reason seems to be that some of the packages on one of the nodes in the cluster were out of date, and it was holding the whole cluster back to an older version.
My output looks a lot more sensible now, so I should be able to restore the monitor
script back to its shipped version:
GLOBAL: SIZE AVAIL RAW USED %RAW USED 327T 317T 10102G 3.01 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS data 0 360G 0.11 105T 104339 metadata 1 248M 0 105T 1716 rbd 2 0 0 105T 0 one 3 2990G 0.89 105T 799143 13 0 0 105T 0 .rgw 15 3273 0 105T 21 .rgw.root 16 822 0 105T 3 .rgw.control 17 0 0 105T 8 .rgw.gc 18 0 0 105T 32 .rgw.buckets 19 8504M 0 105T 325058 .rgw.buckets.index 20 0 0 105T 11 .log 21 0 0 105T 0 .intent-log 22 0 0 105T 0 .usage 23 0 0 105T 0 .users 24 24 0 105T 3 .users.email 25 24 0 105T 3 .users.swift 26 0 0 105T 0 .users.uid 27 1033 0 105T 6
Sorry about the noise!
M.
#8 Updated by Ruben S. Montero over 6 years ago
- Status changed from Pending to Closed
- Resolution set to worksforme
Thanks for leetting us know, closing this one
Mo McRoberts wrote:
Hello!
Thanks for digging into this… and I've done some of my own.
It turns out the reason seems to be that some of the packages on one of the nodes in the cluster were out of date, and it was holding the whole cluster back to an older version.
My output looks a lot more sensible now, so I should be able to restore the
monitor
script back to its shipped version:[...]
Sorry about the noise!
M.