Calculate a PG id from the hex values in Ceph OSD debug logs

Recently, I had an incident where the OSDs were crashing at the time of startup. Obviously, the next step was to enable debug logs for the OSDs and understand where they were crashing.

Enabled OSD debug logs dynamically by injecting it with:

# ceph tell osd.* injectargs –debug-osd 20 –debug-ms 1

NOTE: This command can be run from the MON nodes.

Once this was done, the OSDs were started manually (since it were crashing and not running) and watched out for the next crash. It crashed with the following logs :

*read_log 107487’1 (0’0) modify f6b07b93/rbd_data.hash/head//12 by client.version date, time
*osd/PGLog.cc: In function ‘static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t, const pg_info_t&,
std::mapeversion_t, hobject_t&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
std::setstd::basic_stringchar *)’ thread thread time date, time
*osd/PGLog.cc: 809: FAILED assert(last_e.version.version e.version.version)ceph version version-details

1: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&, std::mapeversion_t, hobject_t,
std::lesseversion_t, std::allocatorstd::paireversion_t const,hobject_t , PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstreamchar, std::char_traitschar, std::allocatorchar,
std::setstd::string, std::lessstd:string, std::allocatorstd::string *)+0x13ee) [0x6efcae]
2: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x315) [0x7692f5]
3: (OSD::load_pgs()+0xfff) [0x639f8f]
4: (OSD::init()+0x7bd) [0x63c10d]
5: (main()+0x2613) [0x5ecd43]
6: (__libc_start_main()+0xf5) [0x7fdc338f9af5]
7: /usr/bin/ceph-osd() [0x5f0f69]

The above is a log snippet at which the OSD process was crashing. The ceph-osd process was reading through the log areas of each PG in the OSD, and once it reached the problematic PG it crashed due to failing an assert condition.

Checking the source at ‘osd/PGLog.cc’, we see that this error is logged from ‘PGLog::read_log’.

void PGLog::read_log(ObjectStore *store, coll_t pg_coll,
coll_t log_coll,
ghobject_t log_oid,
const pg_info_tinfo,
mapeversion_t, hobject_tdivergent_priors,
IndexedLoglog,
pg_missing_tmissing,
ostringstreamoss,
setstring *log_keys_debug)
{

if (!log.log.empty()) {
pg_log_entry_t last_e(log.log.back());
assert(last_e.version.version e.version.version);    == The assert condition at which read_log is failing for a particular PG
assert(last_e.version.epoch = e.version.epoch);

In order to make the OSD start, we needed to move this PG to a different location using the ‘ceph_objectstore_tool’ so that the ceph-osd can bypass the problematic PG. To understand the PG where it was crashing, we had to do some calculations based on the logs.

The ‘read_log’ line in the debug logs contain a hex value after the string “modify” and that is the hash of the PG number. The last number in that series is the pool id (12 in our case). The following python code will help to calculate the PG id based on the arguments passed to it.

This program accepts three arguments, the first being the hex value we talked about, the second being the pg_num of the pool, and the third one being the pool id.


#!/usr/bin/env python
# Calculate the PG ID from the object hash
# vimal@redhat.com
import sys

def pg_id_calc(*args):
    if any([len(args) == 0, len(args) > 3, len(args) < 3]):
        help()
    else:
        hash_hex = args[0]
        pg_num = int(args[1])
        pool_id = int(args[2])
        hash_dec = int(hash_hex, 16)
        id_dec = hash_dec % pg_num
        id = hex(id_dec)
        pg_id = str(pool_id) + "." + str(id)[2:]
        print("\nThe PG ID is %s\n" % pg_id)

def help():
    print("Usage:")
    print("This script expects the hash (in Hex), pg_num of the pool, and the pool id as arguments, in order")
    print("\nExample:")
    print("./pg_id_calc.py 0x8e2fe5d7 2048 12")
    sys.exit()

if __name__ == '__main__':
    pg_id_calc(*sys.argv[1:])

An example of the program in action:

# python pg_id_calc.py 0xf6b07b93 2048 12
The PG ID is 12.393

Once we get the PG ID, we can proceed using ‘ceph_objectstore_tool’ to move the PG to a different location altogether. More on how to use ‘ceph_objectstore_tool’ in an upcoming journal.

Mapping Placement Groups and Pools

Understanding the mapping of Pools and Placement Groups can be very useful while troubleshooting Ceph problems.

A direct method is to dump information on the PGs via :

# ceph pg dump

This command should output something like the following:

pg_stat    objects    mip    degr    unf    bytes    log    disklog   state
5.7a           0                0         0          0        0            0       0            active+clean

The output will have more information, and I’ve omitted it for the sake of explanation.

The first field is the PG ID, which are two values separated by a single dot (.). The left side value is the POOL ID, while the right side value is the actual PG number. It means that a specific PG can only be present under a specific pool, ie.. no PGs can be shared across pools. But please note that OSDs can be shared across multiple PGs.

To get the pools and associated numbers, use:

# ceph osd lspools

0 data,1 metadata,2 rbd,5 ssdtest,6 ec_pool,

So, the PG 5.7a belongs to the pool numbered ‘5’, ie.. ‘ssdtest’, and the PG number is ‘7a’.

The output of ‘ceph pg dump’ also shows various important informations such as the Acting OSD set, the primary OSD, the last time the PG was reported, the state of the PG, the time at which a normal scrub as well as a deep-scrub was run etc..

Resetting Calamari password

Calamari‘ is the monitoring interface for a Ceph cluster.

The Calamari interface password can be reset/changed using the ‘calamari-ctl’ command.

# calamari-ctl change_password –password {password} {user-name}

calamari-ctl can also be used to add a user, as well as disable, enable, and rename the user account. A ‘–help’ should print out all the available ones.

# calamari-ctl –help

Compacting a Ceph monitor store

The Ceph monitor store growing to a big size is a common occurrence in a busy Ceph cluster.

If a ‘ceph -s‘ takes considerable time to return information, one of the possibility is the monitor database being large.

Other reasons included network lags between the client and the monitor, the monitor not responding properly due to the system load, firewall settings on the client or monitor etc..

The best way to deal with a large monitor database is to compact the monitor store. The monitor store is a leveldb store which stores key/value pairs.

There are two ways to compact a levelDB store, either on the fly or at the monitor process startup.

To compact the store dynamically, use :

# ceph tell mon.[ID] compact

To compact the levelDB store every time the monitor process starts, add the following in /etc/ceph/ceph.conf under the [mon] section:

mon compact on start = true

The second option would compact the levelDB store each and every time the monitor process starts.

The monitor database is stored at /var/lib/ceph/mon/<hostname>/store.db/ as files with the extension ‘.sst‘, which is the synonym for ‘Sorted String Table

To read more on levelDB, please refer:

https://en.wikipedia.org/wiki/LevelDB

http://leveldb.googlecode.com/svn/trunk/doc/impl.html

http://google-opensource.blogspot.in/2011/07/leveldb-fast-persistent-key-value-store.html

What is data scrubbing?

Data Scrubbing is an error checking and correction method or routine check to ensure that the data on file systems are in pristine condition, and has no errors. Data integrity is of primary concern in today’s conditions, given the humongous amounts of data being read and written daily.

A simple example for a scrubbing, is a file system check done on file systems with tools like ‘e2fsck’ in EXT2/3/4, or ‘xfs_repair’ in XFS. Ceph also includes a daily scrubbing as well as weekly scrubbing, which we will talk about in detail in another article.

This feature is available on most hardware RAID controllers, backup tools, as well as softwares that emulate RAID such as MD-RAID.

Btrfs is one of the file systems that can schedule a internal scrubbing automatically, to ensure that corruptions are detected and preventive measures taken automatically. Since Btrfs can maintain multiple copies of data, once it finds an error in the primary copy, it can check for a good copy (if mirroring is used) and replace it.

We will be looking more into scrubbing, especially how it is implemented in Ceph, and the various tunables, in an upcoming post.

Another method to dynamically change a Ceph configuration

In a previous post, we saw how to dynamically change a tunable on a running Ceph cluster dynamically. Unfortunately, such a change is not permanent, and will revert back to the previous setting once ceph is restarted.

Rather than using the command ‘ceph tell‘, I recently came upon another way to change configuration values.

We’ll try changing the tunable ‘mon_osd_full_ratio‘ once again.

1. Get the current setting

# ceph daemon osd.1 config get mon_osd_full_ratio
{ “mon_osd_full_ratio”: “0.75”}

2. Change the configuration value using ‘ceph daemon’.

# ceph daemon osd.1 config set mon_osd_full_ratio 0.85
{ “success”: “mon_osd_full_ratio = ‘0.85’ “}

3. Check if the change has been introduced.

# ceph daemon osd.1 config get mon_osd_full_ratio
{ “mon_osd_full_ratio”: “0.85”}

4. Restart the ‘ceph’ service

# service ceph restart

5. Check the status

# ceph daemon osd.1 config get mon_osd_full_ratio
{ “mon_osd_full_ratio”: “0.75”}

NOTE: Please note that the changes introduced with ‘ceph tell’ as well as ‘ceph daemon’ is not persistent across process restarts.

‘noout’ flag in Ceph

You may have seen the ‘noout‘ flag set in the output of ‘ceph -s‘. What does this actually mean?

This is a global flag for the cluster, which means that if an OSD is out, the said OSD is not marked out of the cluster and data balancing shouldn’t start to maintain the replica count. By default, the monitors mark the OSDs out of the acting set if it is not reachable for 300 seconds, ie.. 5 minutes.

To know the default value set in your cluster, use:

# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_report_timeout

When an OSD is marked as out, another OSD takes its place and data replication starts to that OSD depending on the number of replica counts each pool has.

If this flag (noout) is set, the monitor will not mark the OSDs out from the acting set. The PGs will be reporting an inconsistent state, but the OSD will still be in the acting set.

This can be helpful when we want to remove an OSD from the server, but don’t want the data objects to be replicated over to another OSD.

To set the ‘noout‘ flag, use:

# ceph osd set noout

Once everything you’ve planned has been done/finished, you can reset it back using:

# ceph osd unset noout

How to dynamically change a configuration value in a Ceph cluster?

It is possible to change a particular configuration setting in a Ceph cluster dynamically, and I think it is a very neat and useful feature.

Imagine the case where you want to change the replica count of a particular PG from 3 to 4. How would you change this without restarting the Ceph cluster itself? That is where the ‘ceph tell’ command comes in.

As we saw in the previous post, you can get the list of configuration settings using the administrator socket, from either a monitor or an OSD node.

To change a configuration use:


# ceph tell mon.* injectargs '--{tunable value_to_be_set}'

For example, to change the timeout value after which an OSD is out and down, can be changed with:


# ceph tell mon.* injectargs '--mon_osd_report_timeout 400'

By default, it is 300 seconds, ie.. 5 minute

How to fetch the entire list of tunables along with the values for a Ceph cluster node?

In many cases we would like to get the active configurations from a Ceph node, either a monitor or an OSD node. A neat feature, I must say, is to probe the administrative socket file to get a listing of all the active configurations, be it on the OSD node or the monitor node.

This comes handy when we have changed a setting and wants to confirm if it had indeed changed or not.

The admin socket file exists for both the monitors and the OSD nodes. The monitor node will have a single admin socket file, while the OSD nodes will have an admin socket for each of the OSDs present on the node.

  • Listing of the admin socket on a monitor node

 # ls /var/run/ceph/ -l
 total 4
 srwxr-xr-x. 1 root root 0 May 13 05:13 ceph-mon.hp-m300-2.asok
 -rw-r--r--. 1 root root 7 May 13 05:13 mon.hp-m300-2.pid
 

  • Listing of the admin sockets on an OSD node

 # ls -l /var/run/ceph/
 total 20
 srwxr-xr-x. 1 root root 0 May  8 02:42 ceph-osd.0.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.2.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.3.asok
 srwxr-xr-x. 1 root root 0 May  8 02:42 ceph-osd.4.asok
 srwxr-xr-x. 1 root root 0 May 26 11:18 ceph-osd.5.asok
 -rw-r--r--. 1 root root 8 May  8 02:42 osd.0.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.2.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.3.pid
 -rw-r--r--. 1 root root 8 May  8 02:42 osd.4.pid
 -rw-r--r--. 1 root root 8 May 26 11:18 osd.5.pid
 

For example, consider that we have changed the ‘mon_osd_full_ratio’ value, and need to confirm that the cluster has picked up the change.

We can get a listing of the active configured settings and grep out the setting we are interested in.


# ceph daemon /var/run/ceph/ceph-mon.*.asok config show

The above command prints out a listing of all the active configurations and their current values. We can easily grep out ‘mon_osd_full_ratio’ from this list.


# ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep mon_osd_full_ratio

On my test cluster, this printed out ‘0.75’ which is the default setting. The cluster should print out ‘near full’ warnings once any OSD has reached 75% of its size.

This can be checked by probing the OSD admin socket as well.

NOTE: In case you are probing a particular OSD, please make sure to use the OSD admin socket on the node in which the OSD is. In order to locate the OSD and the node it is on, use :


# ceph osd tree

Example: We try probing the OSD admin socket on its node, for ‘mon_osd_full_ratio’ as we did on the monitor. It should return the same value.


# ceph daemon /var/run/ceph/ceph-osd.5.asok config show | grep mon_osd_full_ratio

NOTE: Another command exists which should print the same configuration settings, but only for OSDs.


# ceph daemon osd.5 config show

A drawback worth mentioning, this should be executed on the node on which the OSD is present. To find that the OSD to node mapping, use ‘ceph osd tree’.

How to change the filling ratio for a Ceph OSD?

There could be many scenarios where you’d need to change the percentage of space usage on a Ceph OSD. One such use case would be when your OSD space is about to hit the hard limit, and is constantly sending you warnings.

For some reason or other, you may need to extend the threshold limit for some time. In such a case, you don’t need to change/add the configuration in ceph.conf and push it across. Rather you can do it while the cluster is online, via command mode.

The ‘ceph tell’ is a very useful command in the sense the administrator don’t need to stop/start the OSDs, MONs etc.. after a configuration change. In our case, we are looking to set the ‘mon_osd_full_ratio’ to 98%. We can do it by using:


# ceph tell mon.* injectargs "--mon_osd_full_ratio .98"

In an earlier post (https://goo.gl/xjXOoI) we had seen how to get all the configurable options from a monitor. If I understand correct, almost all the configuration values can be changed online by injecting the values using ‘ceph tell’.