Custom CRUSH rulesets and pools

Ceph supports custom rulesets via CRUSH, which can be used to sort hardware based on various features such as speed and other factors, set custom weights, and do a lot of other useful things.

Pools, or the buckets were the data is written to, can be created on the custom rulesets, hence positioning the pools on specific hardware as per the administrator’s need.

A large Ceph cluster may have lots of pools and rulesets specific for multiple use-cases. There may be times when we’d like to understand the pool to ruleset mapping.

The default CRUSH ruleset is named ‘replicated_ruleset’. The available CRUSH rulesets can be listed with:

$ ceph osd crush rule ls

On a fresh cluster, or one without any custom rulesets, you’d find the following being printed to stdout.

# ceph osd crush rule ls

I’ve got a couple more on my cluster, and this is how it looks:

# ceph osd crush rule ls

Since this article looks into the mapping of pools to CRUSH rulesets, it’d be good to add in how to list the pools, as a refresher.

# ceph osd lspools

On my Ceph cluster, it turned out to be:

# ceph osd lspools
0 data,1 metadata,2 rbd,21 .rgw,22 .rgw.root,23 .rgw.control,24 .rgw.gc,25 .users.uid,26 .users,27 .users.swift,28 test_pool,

Since you have the pool name you’re interested in, let’s see how to map it to the ruleset. The command syntax is:

# ceph osd pool get <pool_name> crush_ruleset

I was interested to understand the ruleset on which the pool ‘test_pool’ was created. The command to list this was:

# ceph osd pool get test_pool crush_ruleset
crush_ruleset: 1

Please note that the rulesets are numbered from ‘0’, and hence ‘1’ would map to the CRUSH ruleset ‘replicated_ssd’.

We’ll try to understand how a custom ruleset is created, in another article.

OSD information in a scriptable format

In case you are trying to get the OSD ID and the corresponding node IP address mappings in a script-able format, use the following command:

# ceph osd find <OSD-num>

This will print the OSD number, the IP address, the host name, and the default root in the CRUSH map, as a python dictionary.

# ceph osd find 2
{ “osd”: 2,
“ip”: “\/5311”,
“crush_location”: { “host”: “node4”, “root”: “default”}}

The output is in json format, which has a key:value format. This can be parsed using awk/sed, or any programming languages that support json. All recent ones do.

For a listing of all the OSDs and related information, get the number of OSDs in the cluster, and then use that number to probe the OSDs.

# for i in `seq 0 $(ceph osd stat | awk {‘print $3’})`; do

ceph osd find $i; echo; done

This should output:

{ “osd”: 0,
“ip”: “\/2579”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 1,
“ip”: “\/955”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 2,
“ip”: “\/5311”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 3,
“ip”: “\/5626”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 4,
“ip”: “\/4194”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 5,
“ip”: “\/4521”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 6,
“ip”: “\/5614”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 7,
“ip”: “\/1719”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 8,
“ip”: “\/5842”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 9,
“ip”: “\/4356”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 10,
“ip”: “\/4517”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}
{ “osd”: 11,
“ip”: “\/4821”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}

Monitor maps, how to edit them?

The MON map is used by the monitors in a Ceph cluster, where they keep track of various attributes relevant to the working of the cluster.

Similar to the CRUSH map, a monitor map can be pulled out of the cluster, inspected, changed, and injected back to the monitors, manually. A frequent use-case is when the IP address of a monitor changes and the monitors cannot agree on a quorum.

Monitors use the monitor map (monmap) to get the details of other monitors. So just changing the monitor address in ‘ceph.conf‘ and pushing the configuration to all the nodes won’t help to propagate the changes.

In most cases, starting the monitor with a wrong monitor map would make the monitors commit suicide, since they would find conflicting information about themself in the mon map due to the IP address change.

There are two methods to fix this problem, the first being adding enough new monitors, let them form a quorum, and remove the faulty monitors. This doesn’t need any explanation. The second and more crude way, is to edit the monitor map directly, set the new IP address, and upload the monmap back to the monitors.

This article discusses the second method, ie.. how to edit the monmap, and re-inject it back. This can be done using the ‘monmap‘ tool.

1. As the first step, login to one of the monitors, and get the monitor map:

# ceph mon getmap -o /tmp/monitor_map.bin

2. Inspect what the monitor map contains:

# monmaptool –print /tmp/monitor_map.bin

  • An example from my cluster :

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: mon.node2

3. Remove the node which has the wrong IP address, referring it’s hostname

# monmaptool –rm node2 /tmp/monitor_map.bin

4. Inspect the monitor map to see if the monitor is indeed removed.

# monmaptool –print /tmp/monitor_map.bin

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000

5. Add a new monitor (or the existing monitor with it’s new IP)

# monmaptool –add node3  /tmp/monitor_map.bin

monmaptool: monmap file monmap
monmaptool: writing epoch 1 to monmap (1 monitors)

6. Check the monitor map to confirm the changes

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: mon.node3

7. Make sure the mon processes are not running on the monitor nodes

# service ceph stop mon

8. Upload the changes

# ceph-mon -i monitor_node –inject-monmap /tmp/mon_map.bin

9. Start the mon process on each monitor

# service ceph start mon

10. Check if the cluster has taken in the changes.

# ceph -s


Calculate a PG id from the hex values in Ceph OSD debug logs

Recently, I had an incident where the OSDs were crashing at the time of startup. Obviously, the next step was to enable debug logs for the OSDs and understand where they were crashing.

Enabled OSD debug logs dynamically by injecting it with:

# ceph tell osd.* injectargs –debug-osd 20 –debug-ms 1

NOTE: This command can be run from the MON nodes.

Once this was done, the OSDs were started manually (since it were crashing and not running) and watched out for the next crash. It crashed with the following logs :

*read_log 107487’1 (0’0) modify f6b07b93/rbd_data.hash/head//12 by client.version date, time
*osd/ In function ‘static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t, const pg_info_t&amp;,
std::mapeversion_t, hobject_t&amp;, PGLog::IndexedLog&amp;, pg_missing_t&amp;, std::ostringstream&amp;,
std::setstd::basic_stringchar *)’ thread thread time date, time
*osd/ 809: FAILED assert(last_e.version.version e.version.version)ceph version version-details

1: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&amp;, std::mapeversion_t, hobject_t,
std::lesseversion_t, std::allocatorstd::paireversion_t const,hobject_t , PGLog::IndexedLog&amp;,
pg_missing_t&amp;, std::basic_ostringstreamchar, std::char_traitschar, std::allocatorchar,
std::setstd::string, std::lessstd:string, std::allocatorstd::string *)+0x13ee) [0x6efcae]
2: (PG::read_state(ObjectStore*, ceph::buffer::list&amp;)+0x315) [0x7692f5]
3: (OSD::load_pgs()+0xfff) [0x639f8f]
4: (OSD::init()+0x7bd) [0x63c10d]
5: (main()+0x2613) [0x5ecd43]
6: (__libc_start_main()+0xf5) [0x7fdc338f9af5]
7: /usr/bin/ceph-osd() [0x5f0f69]

The above is a log snippet at which the OSD process was crashing. The ceph-osd process was reading through the log areas of each PG in the OSD, and once it reached the problematic PG it crashed due to failing an assert condition.

Checking the source at ‘osd/’, we see that this error is logged from ‘PGLog::read_log’.

void PGLog::read_log(ObjectStore *store, coll_t pg_coll,
coll_t log_coll,
ghobject_t log_oid,
const pg_info_tinfo,
mapeversion_t, hobject_tdivergent_priors,
setstring *log_keys_debug)

if (!log.log.empty()) {
pg_log_entry_t last_e(log.log.back());
assert(last_e.version.version e.version.version);    == The assert condition at which read_log is failing for a particular PG
assert(last_e.version.epoch = e.version.epoch);

In order to make the OSD start, we needed to move this PG to a different location using the ‘ceph_objectstore_tool’ so that the ceph-osd can bypass the problematic PG. To understand the PG where it was crashing, we had to do some calculations based on the logs.

The ‘read_log’ line in the debug logs contain a hex value after the string “modify” and that is the hash of the PG number. The last number in that series is the pool id (12 in our case). The following python code will help to calculate the PG id based on the arguments passed to it.

This program accepts three arguments, the first being the hex value we talked about, the second being the pg_num of the pool, and the third one being the pool id.

#!/usr/bin/env python
# Calculate the PG ID from the object hash
import sys

def pg_id_calc(*args):
    if any([len(args) == 0, len(args) > 3, len(args) < 3]):
        hash_hex = args[0]
        pg_num = int(args[1])
        pool_id = int(args[2])
        hash_dec = int(hash_hex, 16)
        id_dec = hash_dec % pg_num
        id = hex(id_dec)
        pg_id = str(pool_id) + "." + str(id)[2:]
        print("\nThe PG ID is %s\n" % pg_id)

def help():
    print("This script expects the hash (in Hex), pg_num of the pool, and the pool id as arguments, in order")
    print("./ 0x8e2fe5d7 2048 12")

if __name__ == '__main__':

An example of the program in action:

# python 0xf6b07b93 2048 12
The PG ID is 12.393

Once we get the PG ID, we can proceed using ‘ceph_objectstore_tool’ to move the PG to a different location altogether. More on how to use ‘ceph_objectstore_tool’ in an upcoming journal.

Mapping Placement Groups and Pools

Understanding the mapping of Pools and Placement Groups can be very useful while troubleshooting Ceph problems.

A direct method is to dump information on the PGs via :

# ceph pg dump

This command should output something like the following:

pg_stat    objects    mip    degr    unf    bytes    log    disklog   state
5.7a           0                0         0          0        0            0       0            active+clean

The output will have more information, and I’ve omitted it for the sake of explanation.

The first field is the PG ID, which are two values separated by a single dot (.). The left side value is the POOL ID, while the right side value is the actual PG number. It means that a specific PG can only be present under a specific pool, ie.. no PGs can be shared across pools. But please note that OSDs can be shared across multiple PGs.

To get the pools and associated numbers, use:

# ceph osd lspools

0 data,1 metadata,2 rbd,5 ssdtest,6 ec_pool,

So, the PG 5.7a belongs to the pool numbered ‘5’, ie.. ‘ssdtest’, and the PG number is ‘7a’.

The output of ‘ceph pg dump’ also shows various important informations such as the Acting OSD set, the primary OSD, the last time the PG was reported, the state of the PG, the time at which a normal scrub as well as a deep-scrub was run etc..

Resetting Calamari password

Calamari‘ is the monitoring interface for a Ceph cluster.

The Calamari interface password can be reset/changed using the ‘calamari-ctl’ command.

# calamari-ctl change_password –password {password} {user-name}

calamari-ctl can also be used to add a user, as well as disable, enable, and rename the user account. A ‘–help’ should print out all the available ones.

# calamari-ctl –help

Compacting a Ceph monitor store

The Ceph monitor store growing to a big size is a common occurrence in a busy Ceph cluster.

If a ‘ceph -s‘ takes considerable time to return information, one of the possibility is the monitor database being large.

Other reasons included network lags between the client and the monitor, the monitor not responding properly due to the system load, firewall settings on the client or monitor etc..

The best way to deal with a large monitor database is to compact the monitor store. The monitor store is a leveldb store which stores key/value pairs.

There are two ways to compact a levelDB store, either on the fly or at the monitor process startup.

To compact the store dynamically, use :

# ceph tell mon.[ID] compact

To compact the levelDB store every time the monitor process starts, add the following in /etc/ceph/ceph.conf under the [mon] section:

mon compact on start = true

The second option would compact the levelDB store each and every time the monitor process starts.

The monitor database is stored at /var/lib/ceph/mon/<hostname>/store.db/ as files with the extension ‘.sst‘, which is the synonym for ‘Sorted String Table

To read more on levelDB, please refer: