Ceph OSD heartbeats

Ceph OSD daemons need to ensure that the neighbouring OSDs are functioning properly so that the cluster remains in a healthy state.

For this, each Ceph OSD process (ceph-osd) sends a heartbeat signal to the neighbouring OSDs. By default, the heartbeat signal is sent every 6 seconds [1], which is configurable of course.

If the heartbeat check from one OSD doesn’t hear from the other within the set value for `osd_heartbeat_grace` [2], which is set to 20 seconds by default, the OSD that sends the heartbeat check reports the other OSD (the one that didn’t respond within 20 seconds) as down, to the MONs. Once an OSD reports three times that the non-responding OSD is indeed `down`, the MON acknowledges it and mark the OSD as down.

The Monitor will update the Cluster map and send it over to the participating nodes in the cluster.

OSD-heartbeat-1

When an OSD can’t reach another OSD for a heartbeat, it reports the following in the OSD logs:

osd.510 1497 heartbeat_check: no reply from osd.11 since back 2016-04-28 20:49:42.088802

In Ceph Jewel, the MONs require a minimum of two ceph OSDs report a specific OSD as down from two nodes which are in different CRUSH subtrees, in order to actually mark the OSD as down. These are controlled by the following tunables :

From ‘common/config_opts.h’:

[1] OPTION(mon_osd_min_down_reporters, OPT_INT, 2) // number of OSDs from different subtrees who need to report a down OSD for it to count

[2] OPTION(mon_osd_reporter_subtree_level , OPT_STR, “host”) // in which level of parent bucket the reporters are counted

Image Courtsey : Red Hat Ceph Storage 1.3.2 Configuration guide

Monitor maps, how to edit them?

The MON map is used by the monitors in a Ceph cluster, where they keep track of various attributes relevant to the working of the cluster.

Similar to the CRUSH map, a monitor map can be pulled out of the cluster, inspected, changed, and injected back to the monitors, manually. A frequent use-case is when the IP address of a monitor changes and the monitors cannot agree on a quorum.

Monitors use the monitor map (monmap) to get the details of other monitors. So just changing the monitor address in ‘ceph.conf‘ and pushing the configuration to all the nodes won’t help to propagate the changes.

In most cases, starting the monitor with a wrong monitor map would make the monitors commit suicide, since they would find conflicting information about themself in the mon map due to the IP address change.

There are two methods to fix this problem, the first being adding enough new monitors, let them form a quorum, and remove the faulty monitors. This doesn’t need any explanation. The second and more crude way, is to edit the monitor map directly, set the new IP address, and upload the monmap back to the monitors.

This article discusses the second method, ie.. how to edit the monmap, and re-inject it back. This can be done using the ‘monmap‘ tool.

1. As the first step, login to one of the monitors, and get the monitor map:

# ceph mon getmap -o /tmp/monitor_map.bin

2. Inspect what the monitor map contains:

# monmaptool –print /tmp/monitor_map.bin

  • An example from my cluster :

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: 192.168.122.73:6789/0 mon.node2

3. Remove the node which has the wrong IP address, referring it’s hostname

# monmaptool –rm node2 /tmp/monitor_map.bin

4. Inspect the monitor map to see if the monitor is indeed removed.

# monmaptool –print /tmp/monitor_map.bin

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000

5. Add a new monitor (or the existing monitor with it’s new IP)

# monmaptool –add node3  192.168.122.76:6789  /tmp/monitor_map.bin

monmaptool: monmap file monmap
monmaptool: writing epoch 1 to monmap (1 monitors)

6. Check the monitor map to confirm the changes

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: 192.168.122.76:6789/0 mon.node3

7. Make sure the mon processes are not running on the monitor nodes

# service ceph stop mon

8. Upload the changes

# ceph-mon -i monitor_node –inject-monmap /tmp/mon_map.bin

9. Start the mon process on each monitor

# service ceph start mon

10. Check if the cluster has taken in the changes.

# ceph -s

 

Compacting a Ceph monitor store

The Ceph monitor store growing to a big size is a common occurrence in a busy Ceph cluster.

If a ‘ceph -s‘ takes considerable time to return information, one of the possibility is the monitor database being large.

Other reasons included network lags between the client and the monitor, the monitor not responding properly due to the system load, firewall settings on the client or monitor etc..

The best way to deal with a large monitor database is to compact the monitor store. The monitor store is a leveldb store which stores key/value pairs.

There are two ways to compact a levelDB store, either on the fly or at the monitor process startup.

To compact the store dynamically, use :

# ceph tell mon.[ID] compact

To compact the levelDB store every time the monitor process starts, add the following in /etc/ceph/ceph.conf under the [mon] section:

mon compact on start = true

The second option would compact the levelDB store each and every time the monitor process starts.

The monitor database is stored at /var/lib/ceph/mon/<hostname>/store.db/ as files with the extension ‘.sst‘, which is the synonym for ‘Sorted String Table

To read more on levelDB, please refer:

https://en.wikipedia.org/wiki/LevelDB

http://leveldb.googlecode.com/svn/trunk/doc/impl.html

http://google-opensource.blogspot.in/2011/07/leveldb-fast-persistent-key-value-store.html

How to list all the configuration settings in a Ceph cluster monitor?

It can be really helpful to have a single command to list all the configuration settings in a monitor node, in a Ceph cluster.

This is possible by interacting directly with the monitor’s unix socket file. This can be found under /var/run/ceph/. By default, the admin socket for the monitor will be in the path /var/run/ceph/ceph-mon.<hostname-s>.asok.

The default location can vary in case you have defined it to be a different one, at the time of the installation. To know the actual socket path, use the following command:


# ceph-conf --name mon.$(hostname -s) --show-config-value admin_socket

This should print the location of the admin socket. In most cases, it should be something like /var/run/ceph/ceph-mon.$(hostname -s).asok

Once you have the monitor admin socket, use that location to show the various configuration settings with:


# ceph daemon /var/run/ceph/ceph-mon.*.asok config show

The output would be long, and won’t fit in a single screen. You can either pipe it to ‘less’ or grep for a specific value in case you know what you are looking for.

For example, if I need to look at the ratio at which the OSD would be considered full, I’ll be using:


#  ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_full_ratio