‘noout’ flag in Ceph

You may have seen the ‘noout‘ flag set in the output of ‘ceph -s‘. What does this actually mean?

This is a global flag for the cluster, which means that if an OSD is out, the said OSD is not marked out of the cluster and data balancing shouldn’t start to maintain the replica count. By default, the monitors mark the OSDs out of the acting set if it is not reachable for 300 seconds, ie.. 5 minutes.

To know the default value set in your cluster, use:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_report_timeout

When an OSD is marked as out, another OSD takes its place and data replication starts to that OSD depending on the number of replica counts each pool has.

If this flag (noout) is set, the monitor will not mark the OSDs out from the acting set. The PGs will be reporting an inconsistent state, but the OSD will still be in the acting set.

This can be helpful when we want to remove an OSD from the server, but don’t want the data objects to be replicated over to another OSD.

To set the ‘noout‘ flag, use:

# ceph osd set noout

Once everything you’ve planned has been done/finished, you can reset it back using:

# ceph osd unset noout

How to dynamically change a configuration value in a Ceph cluster?

It is possible to change a particular configuration setting in a Ceph cluster dynamically, and I think it is a very neat and useful feature.

Imagine the case where you want to change the replica count of a particular PG from 3 to 4. How would you change this without restarting the Ceph cluster itself? That is where the ‘ceph tell’ command comes in.

As we saw in the previous post, you can get the list of configuration settings using the administrator socket, from either a monitor or an OSD node.

To change a configuration use:

# ceph tell mon.* injectargs '--{tunable value_to_be_set}'

For example, to change the timeout value after which an OSD is out and down, can be changed with:

# ceph tell mon.* injectargs '--mon_osd_report_timeout 400'

By default, it is 300 seconds, ie.. 5 minute

How to fetch the entire list of tunables along with the values for a Ceph cluster node?

In many cases we would like to get the active configurations from a Ceph node, either a monitor or an OSD node. A neat feature, I must say, is to probe the administrative socket file to get a listing of all the active configurations, be it on the OSD node or the monitor node.

This comes handy when we have changed a setting and wants to confirm if it had indeed changed or not.

The admin socket file exists for both the monitors and the OSD nodes. The monitor node will have a single admin socket file, while the OSD nodes will have an admin socket for each of the OSDs present on the node.

  • Listing of the admin socket on a monitor node

 # ls /var/run/ceph/ -l
 total 4
 srwxr-xr-x. 1 root root 0 May 13 05:13 ceph-mon.hp-m300-2.asok
 -rw-r--r--. 1 root root 7 May 13 05:13 mon.hp-m300-2.pid

  • Listing of the admin sockets on an OSD node

 # ls -l /var/run/ceph/
 total 20
 srwxr-xr-x. 1 root root 0 May  8 02:42 ceph-osd.0.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.2.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.3.asok
 srwxr-xr-x. 1 root root 0 May  8 02:42 ceph-osd.4.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.5.asok
 -rw-r--r--. 1 root root 8 May  8 02:42 osd.0.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.2.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.3.pid
 -rw-r--r--. 1 root root 8 May  8 02:42 osd.4.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.5.pid

For example, consider that we have changed the ‘mon_osd_full_ratio’ value, and need to confirm that the cluster has picked up the change.

We can get a listing of the active configured settings and grep out the setting we are interested in.

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show

The above command prints out a listing of all the active configurations and their current values. We can easily grep out ‘mon_osd_full_ratio’ from this list.

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_full_ratio

On my test cluster, this printed out ‘0.75’ which is the default setting. The cluster should print out ‘near full’ warnings once any OSD has reached 75% of its size.

This can be checked by probing the OSD admin socket as well.

NOTE: In case you are probing a particular OSD, please make sure to use the OSD admin socket on the node in which the OSD is. In order to locate the OSD and the node it is on, use :

# ceph osd tree

Example: We try probing the OSD admin socket on its node, for ‘mon_osd_full_ratio’ as we did on the monitor. It should return the same value.

# ceph daemon /var/run/ceph/ceph-osd.5.asok config show | grep mon_osd_full_ratio

NOTE: Another command exists which should print the same configuration settings, but only for OSDs.

# ceph daemon osd.5 config show

A drawback worth mentioning, this should be executed on the node on which the OSD is present. To find that the OSD to node mapping, use ‘ceph osd tree’.

How to change the filling ratio for a Ceph OSD?

There could be many scenarios where you’d need to change the percentage of space usage on a Ceph OSD. One such use case would be when your OSD space is about to hit the hard limit, and is constantly sending you warnings.

For some reason or other, you may need to extend the threshold limit for some time. In such a case, you don’t need to change/add the configuration in ceph.conf and push it across. Rather you can do it while the cluster is online, via command mode.

The ‘ceph tell’ is a very useful command in the sense the administrator don’t need to stop/start the OSDs, MONs etc.. after a configuration change. In our case, we are looking to set the ‘mon_osd_full_ratio’ to 98%. We can do it by using:

# ceph tell mon.* injectargs "--mon_osd_full_ratio .98"

In an earlier post (https://goo.gl/xjXOoI) we had seen how to get all the configurable options from a monitor. If I understand correct, almost all the configuration values can be changed online by injecting the values using ‘ceph tell’.

How to remove a host from a Ceph cluster?

I’m still studying Ceph, and recently faced a scenario in which one of my Ceph nodes went down due to hardware failure. Even though my data was safe due to the replication factor, I was not able to remove the node from the cluster.

I could remove the OSDs on the node, but I didn’t find a way to remove the node being listed in ‘ceph osd tree’. I ended up editing the CRUSH map by hand, to remove the host, and uploaded it back. This worked as expected. Following are the steps I did to achieve this.

a) This was the state just after the node went down:

# ceph osd tree

# id     weight    type     name                up/down        reweight
 -10        .08997    root     default
 -20        .01999            host hp-m300-5
 00        .009995            osd.0                up             1
 40        .009995            osd.4                up             1
 -30        .009995            host hp-m300-9
 10        .009995            osd.1                 down         0
 -40        .05998            host hp-m300-4
 20        .04999            osd.2                up             1
 30        .009995            osd.3                up             1

# ceph -w

    cluster 62a6a880-fb65-490c-bc98-d689b4d1a3cb
     health HEALTH_WARN 64 pgs degraded; 64 pgs stuck unclean; recovery 261/785 objects degraded (33.248%)
     monmap e1: 1 mons at {hp-m300-4=}, election epoch 1, quorum 0 hp-m300-4
     osdmap e130: 5 osds: 4 up, 4 in
     pgmap v8465: 196 pgs, 4 pools, 1001 MB data, 262 objects
         7672 MB used, 74192 MB / 81865 MB avail
         261/785 objects degraded (33.248%)
         64 active+degraded
         132 active+clean

I started with marking the OSDs on the node out, and removing them. Note that I don’t need to stop the OSD (osd.1) since the node carrying osd.1 is down and not accessible.

b) If not, you would’ve to stop the OSD using:

 # sudo service osd stop osd.1

c) Mark the OSD out, this is not ideally needed in this case since the node is already out.

 # ceph osd out osd.1

d) Remove the OSD from the CRUSH map, so that it does not receive any data. You can also get the crushmap, de-compile it, remove the OSD, re-compile, and upload it back.

Remove item id 1 with the name ‘osd.1’ from the CRUSH map.

 # ceph osd crush remove osd.1

e) Remove the OSD authentication key

 # ceph auth del osd.1

f) At this stage, I had to remove the OSD host from the listing but was not able to find a way to do so. The ‘ceph-deploy’ didn’t have any tools to do this, other than ‘purge’, and ‘uninstall’. Since the node was not f) accessible, these won’t work anyways. A ‘ceph-deploy purge’ failed with the following errors, which is expected since the node is not accessible.

 # ceph-deploy purge hp-m300-9

[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.22-rc1): /usr/bin/ceph-deploy purge hp-m300-9
 [ceph_deploy.install][INFO  ] note that some dependencies *will not* be removed because they can cause issues with qemu-kvm
 [ceph_deploy.install][INFO  ] like: librbd1 and librados2
 [ceph_deploy.install][DEBUG ] Purging from cluster ceph hosts hp-m300-9
 [ceph_deploy.install][DEBUG ] Detecting platform for host hp-m300-9 ...
 ssh: connect to host hp-m300-9 port 22: No route to host
 [ceph_deploy][ERROR ] RuntimeError: connecting to host: hp-m300-9 resulted in errors: HostNotFound hp-m300-9

I ended up fetching the CRUSH map, removing the OSD host from it, and uploading it back.

g) Get the CRUSH map

 # ceph osd getcrushmap -o /tmp/crushmap

h) De-compile the CRUSH map

 # crushtool -d /tmp/crushmap -o crush_map

i) I had to remove the entries pertaining to the host-to-be-removed from the following sections:

a) devices
b) types
c) And from the ‘root’ default section as well.

j) Once I had the entries removed, I went ahead compiling the map, and inserted it back.

 # crushtool -c crush_map -o /tmp/crushmap
 # ceph osd setcrushmap -i /tmp/crushmap

k) A ‘ceph osd tree’ looks much cleaner now 🙂

 # ceph osd tree

# id         weight             type         name                up/down        reweight
 -1             0.07999            root         default
 -2            0.01999                        host hp-m300-5
 0            0.009995                    osd.0                down        0
 4            0.009995                    osd.4                 down         0
 -4            0.06                        host hp-m300-4
 2            0.04999                        osd.2                 up             1
 3            0.009995                    osd.3                 up             1

There may be a more direct method to remove the OSD host from the listing. I’m not aware of anything relevant, based on my limited knowledge. Perhaps I’ll come across something as I progress with Ceph. Comments welcome.

How to list all the configuration settings in a Ceph cluster monitor?

It can be really helpful to have a single command to list all the configuration settings in a monitor node, in a Ceph cluster.

This is possible by interacting directly with the monitor’s unix socket file. This can be found under /var/run/ceph/. By default, the admin socket for the monitor will be in the path /var/run/ceph/ceph-mon.<hostname-s>.asok.

The default location can vary in case you have defined it to be a different one, at the time of the installation. To know the actual socket path, use the following command:

# ceph-conf --name mon.$(hostname -s) --show-config-value admin_socket

This should print the location of the admin socket. In most cases, it should be something like /var/run/ceph/ceph-mon.$(hostname -s).asok

Once you have the monitor admin socket, use that location to show the various configuration settings with:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show

The output would be long, and won’t fit in a single screen. You can either pipe it to ‘less’ or grep for a specific value in case you know what you are looking for.

For example, if I need to look at the ratio at which the OSD would be considered full, I’ll be using:

#  ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_full_ratio