Sharding the Ceph RADOS Gateway bucket index

Sharding is the process of breaking down data onto multiple locations so as to increase parallelism, as well as distribute load. This is a common feature used in databases. Read more on this at Wikipedia.

The concept of sharding is used in Ceph, for splitting the bucket index in a RADOS Gateway.

RGW or RADOS Gateway keeps an index for all the objects in its buckets for faster and easier lookup. For each RGW bucket created in a pool, the corresponding index is created in the XX.index pool.

For example, for each of the buckets created in .rgw pool, the bucket index is created in .rgw.buckets.index pool. For each bucket, the index is stored in a single RADOS object.

When the number of objects increases, the size of the RADOS object increases as well. Two problems arise due to the increased index size.

  1. RADOS does not work good with large objects since it’s not designed as such. Operations such as recovery, scrubbing etc.. work on a single object. If the object size increases, OSDs may start hitting timeouts because reading a large object may take a long time. This is one of the reason that all RADOS client interfaces such as RBD, RGW, CephFS use a standard 4MB object size.
  2. Since the index is stored in a single RADOS object, only a single operation can be done on it at any given time. When the number of objects increases, the index stored in the RADOS object grows. Since a single index is handling a large number of objects, and there is a chance the number of operations also increase, parallelism is not possible which can end up being a bottleneck. Multiple operations will need to wait in a queue since a single operation is possible at a time.

In order to work around these problems, the bucket index is sharded into multiple parts. Each shard is kept on a separate RADOS object within the index pool.

Sharding is configured with the tunable bucket_index_max_shards . By default, this tunable is set to 0 which means that there are no shards.

How to check if Sharding is set?

  1. List the buckets
    # radosgw-admin metadata bucket list
    [
     "my-new-bucket"
    ]
    
  2. Get information on the bucket in question
    
    # radosgw-admin metadata get bucket:my-new-bucket
    {
        "key": "bucket:my-new-bucket",
        "ver": {
            "tag": "_bGZAVUgayKVwGNgNvI0328G",
            "ver": 1
        },
        "mtime": 1458940225,
        "data": {
            "bucket": {
                "name": "my-new-bucket",
                "pool": ".rgw.buckets",
                "data_extra_pool": ".rgw.buckets.extra",
                "index_pool": ".rgw.buckets.index",
                "marker": "default.2670570.1",
                "bucket_id": "default.2670570.1"
            },
            "owner": "rgw_user",
            "creation_time": 1458940225,
            "linked": "true",
            "has_bucket_info": "false"
        }
    }
    
    
  3. Use the bucket ID to get more information, including the number of shards.
radosgw-admin metadata get bucket.instance:my-new-bucket:default.2670570.1
{
    "key": "bucket.instance:my-new-bucket:default.2670570.1",
    "ver": {
        "tag": "_xILkVKbfQD7reDFSOB4a5VU",
        "ver": 1
    },
    "mtime": 1458940225,
    "data": {
        "bucket_info": {
            "bucket": {
                "name": "my-new-bucket",
                "pool": ".rgw.buckets",
                "data_extra_pool": ".rgw.buckets.extra",
                "index_pool": ".rgw.buckets.index",
                "marker": "default.2670570.1",
                "bucket_id": "default.2670570.1"
            },
            "creation_time": 1458940225,
            "owner": "rgw_user",
            "flags": 0,
            "region": "default",
            "placement_rule": "default-placement",
            "has_instance_obj": "true",
            "quota": {
                "enabled": false,
                "max_size_kb": -1,
                "max_objects": -1
            },
            "num_shards": 0,
            "bi_shard_hash_type": 0
        },
        "attrs": [
            {
                "key": "user.rgw.acl",
                "val": "AgKPAAAAAgIaAAAACAAAAHJnd191c2VyCgAAAEZpcnN0IFVzZXIDA2kAAAABAQAAAAgAAAByZ3dfdXNlcg8AAAABAAAACAAAAHJnd191c2VyAwM6AAAAAgIEAAAAAAAAAAgAAAByZ3dfdXNlcgAAAAAAAAAAAgIEAAAADwAAAAoAAABGaXJzdCBVc2VyAAAAAAAAAAA="
            },
            {
                "key": "user.rgw.idtag",
                "val": ""
            },
            {
                "key": "user.rgw.manifest",
                "val": ""
            }
        ]
    }
}

Note that `num_shards` is set to 0, which means that sharding is not enabled.

How to configure Sharding?

To configure sharding, we need to first dump the region info.

NOTE: By default, RGW has a region named default even if regions are not configured.

# radosgw-admin region get > /tmp/region.txt 

# cat /tmp/region.txt
{
    "name": "default",
    "api_name": "",
    "is_master": "true",
    "endpoints": [],
    "hostnames": [],
    "master_zone": "",
    "zones": [
        {
            "name": "default",
            "endpoints": [],
            "log_meta": "false",
            "log_data": "false",
            "bucket_index_max_shards": 0
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement"
}

Edit the file /tmp/region.txt, change the value for `bucket_index_max_shards` to the needed shard value (we’re setting it to 8 in this example), and inject it back to the region.

# radosgw-admin region set < /tmp/region.txt
{
    "name": "default",
    "api_name": "",
    "is_master": "true",
    "endpoints": [],
    "hostnames": [],
    "master_zone": "",
    "zones": [
        {
            "name": "default",
            "endpoints": [],
            "log_meta": "false",
            "log_data": "false",
            "bucket_index_max_shards": 8
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement"
}

Reference:

  1. Red Hat Ceph Storage 1.3 Rados Gateway documentation
  2. https://en.wikipedia.org/wiki/Shard_(database_architecture)

Ceph OSD heartbeats

Ceph OSD daemons need to ensure that the neighbouring OSDs are functioning properly so that the cluster remains in a healthy state.

For this, each Ceph OSD process (ceph-osd) sends a heartbeat signal to the neighbouring OSDs. By default, the heartbeat signal is sent every 6 seconds [1], which is configurable of course.

If the heartbeat check from one OSD doesn’t hear from the other within the set value for `osd_heartbeat_grace` [2], which is set to 20 seconds by default, the OSD that sends the heartbeat check reports the other OSD (the one that didn’t respond within 20 seconds) as down, to the MONs. Once an OSD reports three times that the non-responding OSD is indeed `down`, the MON acknowledges it and mark the OSD as down.

The Monitor will update the Cluster map and send it over to the participating nodes in the cluster.

OSD-heartbeat-1

When an OSD can’t reach another OSD for a heartbeat, it reports the following in the OSD logs:

osd.510 1497 heartbeat_check: no reply from osd.11 since back 2016-04-28 20:49:42.088802

In Ceph Jewel, the MONs require a minimum of two ceph OSDs report a specific OSD as down from two nodes which are in different CRUSH subtrees, in order to actually mark the OSD as down. These are controlled by the following tunables :

From ‘common/config_opts.h’:

[1] OPTION(mon_osd_min_down_reporters, OPT_INT, 2) // number of OSDs from different subtrees who need to report a down OSD for it to count

[2] OPTION(mon_osd_reporter_subtree_level , OPT_STR, “host”) // in which level of parent bucket the reporters are counted

Image Courtsey : Red Hat Ceph Storage 1.3.2 Configuration guide

Ceph Rados Block Device (RBD) and TRIM

I recently came across a scenario where the objects in a RADOS pool used for an RBD block device doesn’t get removed, even if the files created through the mount point were removed.

I had an RBD image from an RHCS1.3 cluster mapped to a RHEL7.1 client machine, with an XFS filesystem created on it, and mounted locally. Created a 5GB file, and I could see the objects being created in the rbd pool in the ceph cluster.

1.RBD block device information

# rbd info rbd_img
rbd image 'rbd_img':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.1fcbe.2ae8944a
format: 1

An XFS file system was created on this block device, and mounted at /test.

2.Write a file onto the RBD mapped mount point. Used ‘dd’ to write a 5GB file.

# dd if=/dev/zero of=/mnt/rbd_image.img bs=1G count=5
 5+0 records in
 5+0 records out
 5368709120 bytes (5.4 GB) copied, 8.28731 s, 648 MB/s

3.Check the objects in the backend RBD pool

# rados -p rbd ls | wc -l
 &lt; Total number of objects in the 'rbd' pool&gt;

4.Delete the file from the mount point.

# rm /test/rbd_image.img -f
 # ls /test/
 --NO FILES LISTED--

5.List the objects in the RBD pool

# rados -p rbd ls | wc -l
< Total number of objects in the 'rbd' pool >

The number of objects doesn’t go down as we expect, after the file deletion. It remains the same, wrt to step 3.

Why does this happen? This is due to the fact that traditional file systems do not delete the underlying data blocks even if the files are deleted.

The process of writing a file onto a file system involves several steps like finding free blocks and allocating them for the new file, creating an entry in the directory entry structure of the parent folder, setting the name and inode number in the directory entry structure, setting pointers from the inode to the data blocks allocated for the file etc..

When data is written to the file, the data blocks are used to store the data. Additional information such as the file size, access times etc.. are updated in the inode after the writes.

Deleting a file involves removing the pointers from the inode to the corresponding data blocks, and also clearing the name<->inode mapping from the directory entry structure of the parent folder. But, the underlying data blocks are not cleared off, since that is a high I/O intensive operation. So, the data remains on the disk, even if the file is not present. A new write will make the allocator take these blocks for the new data, since they are marked as not-in-use.

This applies for the files created on an RBD device as well. The files created on top of the RBD-mapped mount point will ultimately be mapped to objects in the RADOS cluster. When the file is deleted from the mount point, since the entry is removed, it doesn’t show up in the mount point.

But, since the file system doesn’t clear off the underlying block device, the objects remain in the RADOS pool. These would be normally over-written when a new file is created via the mount point.

But this has a catch though. Since the pool contains the objects even if the files are deleted, it consumes space in the rados pool (even if they’ll be overwritten). An administrator won’t be able to get a clear understanding on the space usage, if the pool is used heavily, and multiple writes are coming in.

In order to clear up the underlying blocks, or actually remove them, we can rely on the TRIM support most modern disks offer. Read more about TRIM at Wikipedia.

TRIM is a set of commands supported by HDD/SSDs which allow the operating systems to let the disk know about the locations which are not currently being used. Upon receiving a confirmation from the file system layer, the disk can go ahead and mark the blocks as not used.

For the TRIM commands to work, the disks and the file system has to have the support. All the modern file systems have built-in support for TRIM. Mount the file system with the ‘discard‘ option, and you’re good to go.

# mount -o discard /dev/rbd{X}{Y} /{mount-point}

OSD information in a scriptable format

In case you are trying to get the OSD ID and the corresponding node IP address mappings in a script-able format, use the following command:

# ceph osd find <OSD-num>

This will print the OSD number, the IP address, the host name, and the default root in the CRUSH map, as a python dictionary.

# ceph osd find 2
{ “osd”: 2,
“ip”: “192.168.122.112:6800\/5311”,
“crush_location”: { “host”: “node4”, “root”: “default”}}

The output is in json format, which has a key:value format. This can be parsed using awk/sed, or any programming languages that support json. All recent ones do.

For a listing of all the OSDs and related information, get the number of OSDs in the cluster, and then use that number to probe the OSDs.

# for i in `seq 0 $(ceph osd stat | awk {‘print $3’})`; do

ceph osd find $i; echo; done

This should output:

{ “osd”: 0,
“ip”: “192.168.122.244:6805\/2579”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 1,
“ip”: “192.168.122.244:6800\/955”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 2,
“ip”: “192.168.122.112:6800\/5311”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 3,
“ip”: “192.168.122.112:6805\/5626”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 4,
“ip”: “192.168.122.82:6800\/4194”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 5,
“ip”: “192.168.122.82:6805\/4521”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 6,
“ip”: “192.168.122.73:6801\/5614”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 7,
“ip”: “192.168.122.73:6800\/1719”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 8,
“ip”: “192.168.122.10:6805\/5842”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 9,
“ip”: “192.168.122.10:6800\/4356”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 10,
“ip”: “192.168.122.109:6800\/4517”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}
{ “osd”: 11,
“ip”: “192.168.122.109:6805\/4821”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}

Monitor maps, how to edit them?

The MON map is used by the monitors in a Ceph cluster, where they keep track of various attributes relevant to the working of the cluster.

Similar to the CRUSH map, a monitor map can be pulled out of the cluster, inspected, changed, and injected back to the monitors, manually. A frequent use-case is when the IP address of a monitor changes and the monitors cannot agree on a quorum.

Monitors use the monitor map (monmap) to get the details of other monitors. So just changing the monitor address in ‘ceph.conf‘ and pushing the configuration to all the nodes won’t help to propagate the changes.

In most cases, starting the monitor with a wrong monitor map would make the monitors commit suicide, since they would find conflicting information about themself in the mon map due to the IP address change.

There are two methods to fix this problem, the first being adding enough new monitors, let them form a quorum, and remove the faulty monitors. This doesn’t need any explanation. The second and more crude way, is to edit the monitor map directly, set the new IP address, and upload the monmap back to the monitors.

This article discusses the second method, ie.. how to edit the monmap, and re-inject it back. This can be done using the ‘monmap‘ tool.

1. As the first step, login to one of the monitors, and get the monitor map:

# ceph mon getmap -o /tmp/monitor_map.bin

2. Inspect what the monitor map contains:

# monmaptool –print /tmp/monitor_map.bin

  • An example from my cluster :

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: 192.168.122.73:6789/0 mon.node2

3. Remove the node which has the wrong IP address, referring it’s hostname

# monmaptool –rm node2 /tmp/monitor_map.bin

4. Inspect the monitor map to see if the monitor is indeed removed.

# monmaptool –print /tmp/monitor_map.bin

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000

5. Add a new monitor (or the existing monitor with it’s new IP)

# monmaptool –add node3  192.168.122.76:6789  /tmp/monitor_map.bin

monmaptool: monmap file monmap
monmaptool: writing epoch 1 to monmap (1 monitors)

6. Check the monitor map to confirm the changes

# monmaptool –print monmap

monmaptool: monmap file monmap epoch 1
fsid d978794d-5835-4ac3-8fe3-3855b18b9572
last_changed 0.000000 created 0.000000
0: 192.168.122.76:6789/0 mon.node3

7. Make sure the mon processes are not running on the monitor nodes

# service ceph stop mon

8. Upload the changes

# ceph-mon -i monitor_node –inject-monmap /tmp/mon_map.bin

9. Start the mon process on each monitor

# service ceph start mon

10. Check if the cluster has taken in the changes.

# ceph -s

 

Calculate a PG id from the hex values in Ceph OSD debug logs

Recently, I had an incident where the OSDs were crashing at the time of startup. Obviously, the next step was to enable debug logs for the OSDs and understand where they were crashing.

Enabled OSD debug logs dynamically by injecting it with:

# ceph tell osd.* injectargs –debug-osd 20 –debug-ms 1

NOTE: This command can be run from the MON nodes.

Once this was done, the OSDs were started manually (since it were crashing and not running) and watched out for the next crash. It crashed with the following logs :

*read_log 107487’1 (0’0) modify f6b07b93/rbd_data.hash/head//12 by client.version date, time
*osd/PGLog.cc: In function ‘static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t, const pg_info_t&amp;,
std::mapeversion_t, hobject_t&amp;, PGLog::IndexedLog&amp;, pg_missing_t&amp;, std::ostringstream&amp;,
std::setstd::basic_stringchar *)’ thread thread time date, time
*osd/PGLog.cc: 809: FAILED assert(last_e.version.version e.version.version)ceph version version-details

1: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&amp;, std::mapeversion_t, hobject_t,
std::lesseversion_t, std::allocatorstd::paireversion_t const,hobject_t , PGLog::IndexedLog&amp;,
pg_missing_t&amp;, std::basic_ostringstreamchar, std::char_traitschar, std::allocatorchar,
std::setstd::string, std::lessstd:string, std::allocatorstd::string *)+0x13ee) [0x6efcae]
2: (PG::read_state(ObjectStore*, ceph::buffer::list&amp;)+0x315) [0x7692f5]
3: (OSD::load_pgs()+0xfff) [0x639f8f]
4: (OSD::init()+0x7bd) [0x63c10d]
5: (main()+0x2613) [0x5ecd43]
6: (__libc_start_main()+0xf5) [0x7fdc338f9af5]
7: /usr/bin/ceph-osd() [0x5f0f69]

The above is a log snippet at which the OSD process was crashing. The ceph-osd process was reading through the log areas of each PG in the OSD, and once it reached the problematic PG it crashed due to failing an assert condition.

Checking the source at ‘osd/PGLog.cc’, we see that this error is logged from ‘PGLog::read_log’.

void PGLog::read_log(ObjectStore *store, coll_t pg_coll,
coll_t log_coll,
ghobject_t log_oid,
const pg_info_tinfo,
mapeversion_t, hobject_tdivergent_priors,
IndexedLoglog,
pg_missing_tmissing,
ostringstreamoss,
setstring *log_keys_debug)
{

if (!log.log.empty()) {
pg_log_entry_t last_e(log.log.back());
assert(last_e.version.version e.version.version);    == The assert condition at which read_log is failing for a particular PG
assert(last_e.version.epoch = e.version.epoch);

In order to make the OSD start, we needed to move this PG to a different location using the ‘ceph_objectstore_tool’ so that the ceph-osd can bypass the problematic PG. To understand the PG where it was crashing, we had to do some calculations based on the logs.

The ‘read_log’ line in the debug logs contain a hex value after the string “modify” and that is the hash of the PG number. The last number in that series is the pool id (12 in our case). The following python code will help to calculate the PG id based on the arguments passed to it.

This program accepts three arguments, the first being the hex value we talked about, the second being the pg_num of the pool, and the third one being the pool id.


#!/usr/bin/env python
# Calculate the PG ID from the object hash
# vimal@redhat.com
import sys

def pg_id_calc(*args):
    if any([len(args) == 0, len(args) > 3, len(args) < 3]):
        help()
    else:
        hash_hex = args[0]
        pg_num = int(args[1])
        pool_id = int(args[2])
        hash_dec = int(hash_hex, 16)
        id_dec = hash_dec % pg_num
        id = hex(id_dec)
        pg_id = str(pool_id) + "." + str(id)[2:]
        print("\nThe PG ID is %s\n" % pg_id)

def help():
    print("Usage:")
    print("This script expects the hash (in Hex), pg_num of the pool, and the pool id as arguments, in order")
    print("\nExample:")
    print("./pg_id_calc.py 0x8e2fe5d7 2048 12")
    sys.exit()

if __name__ == '__main__':
    pg_id_calc(*sys.argv[1:])

An example of the program in action:

# python pg_id_calc.py 0xf6b07b93 2048 12
The PG ID is 12.393

Once we get the PG ID, we can proceed using ‘ceph_objectstore_tool’ to move the PG to a different location altogether. More on how to use ‘ceph_objectstore_tool’ in an upcoming journal.

Mapping Placement Groups and Pools

Understanding the mapping of Pools and Placement Groups can be very useful while troubleshooting Ceph problems.

A direct method is to dump information on the PGs via :

# ceph pg dump

This command should output something like the following:

pg_stat    objects    mip    degr    unf    bytes    log    disklog   state
5.7a           0                0         0          0        0            0       0            active+clean

The output will have more information, and I’ve omitted it for the sake of explanation.

The first field is the PG ID, which are two values separated by a single dot (.). The left side value is the POOL ID, while the right side value is the actual PG number. It means that a specific PG can only be present under a specific pool, ie.. no PGs can be shared across pools. But please note that OSDs can be shared across multiple PGs.

To get the pools and associated numbers, use:

# ceph osd lspools

0 data,1 metadata,2 rbd,5 ssdtest,6 ec_pool,

So, the PG 5.7a belongs to the pool numbered ‘5’, ie.. ‘ssdtest’, and the PG number is ‘7a’.

The output of ‘ceph pg dump’ also shows various important informations such as the Acting OSD set, the primary OSD, the last time the PG was reported, the state of the PG, the time at which a normal scrub as well as a deep-scrub was run etc..