OSD information in a scriptable format

In case you are trying to get the OSD ID and the corresponding node IP address mappings in a script-able format, use the following command:

# ceph osd find <OSD-num>

This will print the OSD number, the IP address, the host name, and the default root in the CRUSH map, as a python dictionary.

# ceph osd find 2
{ “osd”: 2,
“ip”: “\/5311”,
“crush_location”: { “host”: “node4”, “root”: “default”}}

The output is in json format, which has a key:value format. This can be parsed using awk/sed, or any programming languages that support json. All recent ones do.

For a listing of all the OSDs and related information, get the number of OSDs in the cluster, and then use that number to probe the OSDs.

# for i in `seq 0 $(ceph osd stat | awk {‘print $3’})`; do

ceph osd find $i; echo; done

This should output:

{ “osd”: 0,
“ip”: “\/2579”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 1,
“ip”: “\/955”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 2,
“ip”: “\/5311”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 3,
“ip”: “\/5626”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 4,
“ip”: “\/4194”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 5,
“ip”: “\/4521”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 6,
“ip”: “\/5614”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 7,
“ip”: “\/1719”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 8,
“ip”: “\/5842”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 9,
“ip”: “\/4356”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 10,
“ip”: “\/4517”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}
{ “osd”: 11,
“ip”: “\/4821”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}

What is data scrubbing?

Data Scrubbing is an error checking and correction method or routine check to ensure that the data on file systems are in pristine condition, and has no errors. Data integrity is of primary concern in today’s conditions, given the humongous amounts of data being read and written daily.

A simple example for a scrubbing, is a file system check done on file systems with tools like ‘e2fsck’ in EXT2/3/4, or ‘xfs_repair’ in XFS. Ceph also includes a daily scrubbing as well as weekly scrubbing, which we will talk about in detail in another article.

This feature is available on most hardware RAID controllers, backup tools, as well as softwares that emulate RAID such as MD-RAID.

Btrfs is one of the file systems that can schedule a internal scrubbing automatically, to ensure that corruptions are detected and preventive measures taken automatically. Since Btrfs can maintain multiple copies of data, once it finds an error in the primary copy, it can check for a good copy (if mirroring is used) and replace it.

We will be looking more into scrubbing, especially how it is implemented in Ceph, and the various tunables, in an upcoming post.

‘noout’ flag in Ceph

You may have seen the ‘noout‘ flag set in the output of ‘ceph -s‘. What does this actually mean?

This is a global flag for the cluster, which means that if an OSD is out, the said OSD is not marked out of the cluster and data balancing shouldn’t start to maintain the replica count. By default, the monitors mark the OSDs out of the acting set if it is not reachable for 300 seconds, ie.. 5 minutes.

To know the default value set in your cluster, use:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_report_timeout

When an OSD is marked as out, another OSD takes its place and data replication starts to that OSD depending on the number of replica counts each pool has.

If this flag (noout) is set, the monitor will not mark the OSDs out from the acting set. The PGs will be reporting an inconsistent state, but the OSD will still be in the acting set.

This can be helpful when we want to remove an OSD from the server, but don’t want the data objects to be replicated over to another OSD.

To set the ‘noout‘ flag, use:

# ceph osd set noout

Once everything you’ve planned has been done/finished, you can reset it back using:

# ceph osd unset noout

How to change the filling ratio for a Ceph OSD?

There could be many scenarios where you’d need to change the percentage of space usage on a Ceph OSD. One such use case would be when your OSD space is about to hit the hard limit, and is constantly sending you warnings.

For some reason or other, you may need to extend the threshold limit for some time. In such a case, you don’t need to change/add the configuration in ceph.conf and push it across. Rather you can do it while the cluster is online, via command mode.

The ‘ceph tell’ is a very useful command in the sense the administrator don’t need to stop/start the OSDs, MONs etc.. after a configuration change. In our case, we are looking to set the ‘mon_osd_full_ratio’ to 98%. We can do it by using:

# ceph tell mon.* injectargs "--mon_osd_full_ratio .98"

In an earlier post (https://goo.gl/xjXOoI) we had seen how to get all the configurable options from a monitor. If I understand correct, almost all the configuration values can be changed online by injecting the values using ‘ceph tell’.

How to remove a host from a Ceph cluster?

I’m still studying Ceph, and recently faced a scenario in which one of my Ceph nodes went down due to hardware failure. Even though my data was safe due to the replication factor, I was not able to remove the node from the cluster.

I could remove the OSDs on the node, but I didn’t find a way to remove the node being listed in ‘ceph osd tree’. I ended up editing the CRUSH map by hand, to remove the host, and uploaded it back. This worked as expected. Following are the steps I did to achieve this.

a) This was the state just after the node went down:

# ceph osd tree

# id     weight    type     name                up/down        reweight
 -10        .08997    root     default
 -20        .01999            host hp-m300-5
 00        .009995            osd.0                up             1
 40        .009995            osd.4                up             1
 -30        .009995            host hp-m300-9
 10        .009995            osd.1                 down         0
 -40        .05998            host hp-m300-4
 20        .04999            osd.2                up             1
 30        .009995            osd.3                up             1

# ceph -w

    cluster 62a6a880-fb65-490c-bc98-d689b4d1a3cb
     health HEALTH_WARN 64 pgs degraded; 64 pgs stuck unclean; recovery 261/785 objects degraded (33.248%)
     monmap e1: 1 mons at {hp-m300-4=}, election epoch 1, quorum 0 hp-m300-4
     osdmap e130: 5 osds: 4 up, 4 in
     pgmap v8465: 196 pgs, 4 pools, 1001 MB data, 262 objects
         7672 MB used, 74192 MB / 81865 MB avail
         261/785 objects degraded (33.248%)
         64 active+degraded
         132 active+clean

I started with marking the OSDs on the node out, and removing them. Note that I don’t need to stop the OSD (osd.1) since the node carrying osd.1 is down and not accessible.

b) If not, you would’ve to stop the OSD using:

 # sudo service osd stop osd.1

c) Mark the OSD out, this is not ideally needed in this case since the node is already out.

 # ceph osd out osd.1

d) Remove the OSD from the CRUSH map, so that it does not receive any data. You can also get the crushmap, de-compile it, remove the OSD, re-compile, and upload it back.

Remove item id 1 with the name ‘osd.1’ from the CRUSH map.

 # ceph osd crush remove osd.1

e) Remove the OSD authentication key

 # ceph auth del osd.1

f) At this stage, I had to remove the OSD host from the listing but was not able to find a way to do so. The ‘ceph-deploy’ didn’t have any tools to do this, other than ‘purge’, and ‘uninstall’. Since the node was not f) accessible, these won’t work anyways. A ‘ceph-deploy purge’ failed with the following errors, which is expected since the node is not accessible.

 # ceph-deploy purge hp-m300-9

[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.22-rc1): /usr/bin/ceph-deploy purge hp-m300-9
 [ceph_deploy.install][INFO  ] note that some dependencies *will not* be removed because they can cause issues with qemu-kvm
 [ceph_deploy.install][INFO  ] like: librbd1 and librados2
 [ceph_deploy.install][DEBUG ] Purging from cluster ceph hosts hp-m300-9
 [ceph_deploy.install][DEBUG ] Detecting platform for host hp-m300-9 ...
 ssh: connect to host hp-m300-9 port 22: No route to host
 [ceph_deploy][ERROR ] RuntimeError: connecting to host: hp-m300-9 resulted in errors: HostNotFound hp-m300-9

I ended up fetching the CRUSH map, removing the OSD host from it, and uploading it back.

g) Get the CRUSH map

 # ceph osd getcrushmap -o /tmp/crushmap

h) De-compile the CRUSH map

 # crushtool -d /tmp/crushmap -o crush_map

i) I had to remove the entries pertaining to the host-to-be-removed from the following sections:

a) devices
b) types
c) And from the ‘root’ default section as well.

j) Once I had the entries removed, I went ahead compiling the map, and inserted it back.

 # crushtool -c crush_map -o /tmp/crushmap
 # ceph osd setcrushmap -i /tmp/crushmap

k) A ‘ceph osd tree’ looks much cleaner now 🙂

 # ceph osd tree

# id         weight             type         name                up/down        reweight
 -1             0.07999            root         default
 -2            0.01999                        host hp-m300-5
 0            0.009995                    osd.0                down        0
 4            0.009995                    osd.4                 down         0
 -4            0.06                        host hp-m300-4
 2            0.04999                        osd.2                 up             1
 3            0.009995                    osd.3                 up             1

There may be a more direct method to remove the OSD host from the listing. I’m not aware of anything relevant, based on my limited knowledge. Perhaps I’ll come across something as I progress with Ceph. Comments welcome.