Calculate a PG id from the hex values in Ceph OSD debug logs

Recently, I had an incident where the OSDs were crashing at the time of startup. Obviously, the next step was to enable debug logs for the OSDs and understand where they were crashing.

Enabled OSD debug logs dynamically by injecting it with:

# ceph tell osd.* injectargs –debug-osd 20 –debug-ms 1

NOTE: This command can be run from the MON nodes.

Once this was done, the OSDs were started manually (since it were crashing and not running) and watched out for the next crash. It crashed with the following logs :

*read_log 107487’1 (0’0) modify f6b07b93/rbd_data.hash/head//12 by client.version date, time
*osd/PGLog.cc: In function ‘static bool PGLog::read_log(ObjectStore*, coll_t, hobject_t, const pg_info_t&,
std::mapeversion_t, hobject_t&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
std::setstd::basic_stringchar *)’ thread thread time date, time
*osd/PGLog.cc: 809: FAILED assert(last_e.version.version e.version.version)ceph version version-details

1: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const&, std::mapeversion_t, hobject_t,
std::lesseversion_t, std::allocatorstd::paireversion_t const,hobject_t , PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstreamchar, std::char_traitschar, std::allocatorchar,
std::setstd::string, std::lessstd:string, std::allocatorstd::string *)+0x13ee) [0x6efcae]
2: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x315) [0x7692f5]
3: (OSD::load_pgs()+0xfff) [0x639f8f]
4: (OSD::init()+0x7bd) [0x63c10d]
5: (main()+0x2613) [0x5ecd43]
6: (__libc_start_main()+0xf5) [0x7fdc338f9af5]
7: /usr/bin/ceph-osd() [0x5f0f69]

The above is a log snippet at which the OSD process was crashing. The ceph-osd process was reading through the log areas of each PG in the OSD, and once it reached the problematic PG it crashed due to failing an assert condition.

Checking the source at ‘osd/PGLog.cc’, we see that this error is logged from ‘PGLog::read_log’.

void PGLog::read_log(ObjectStore *store, coll_t pg_coll,
coll_t log_coll,
ghobject_t log_oid,
const pg_info_tinfo,
mapeversion_t, hobject_tdivergent_priors,
IndexedLoglog,
pg_missing_tmissing,
ostringstreamoss,
setstring *log_keys_debug)
{

if (!log.log.empty()) {
pg_log_entry_t last_e(log.log.back());
assert(last_e.version.version e.version.version);    == The assert condition at which read_log is failing for a particular PG
assert(last_e.version.epoch = e.version.epoch);

In order to make the OSD start, we needed to move this PG to a different location using the ‘ceph_objectstore_tool’ so that the ceph-osd can bypass the problematic PG. To understand the PG where it was crashing, we had to do some calculations based on the logs.

The ‘read_log’ line in the debug logs contain a hex value after the string “modify” and that is the hash of the PG number. The last number in that series is the pool id (12 in our case). The following python code will help to calculate the PG id based on the arguments passed to it.

This program accepts three arguments, the first being the hex value we talked about, the second being the pg_num of the pool, and the third one being the pool id.


#!/usr/bin/env python
# Calculate the PG ID from the object hash
# vimal@redhat.com
import sys

def pg_id_calc(*args):
    if any([len(args) == 0, len(args) > 3, len(args) < 3]):
        help()
    else:
        hash_hex = args[0]
        pg_num = int(args[1])
        pool_id = int(args[2])
        hash_dec = int(hash_hex, 16)
        id_dec = hash_dec % pg_num
        id = hex(id_dec)
        pg_id = str(pool_id) + "." + str(id)[2:]
        print("\nThe PG ID is %s\n" % pg_id)

def help():
    print("Usage:")
    print("This script expects the hash (in Hex), pg_num of the pool, and the pool id as arguments, in order")
    print("\nExample:")
    print("./pg_id_calc.py 0x8e2fe5d7 2048 12")
    sys.exit()

if __name__ == '__main__':
    pg_id_calc(*sys.argv[1:])

An example of the program in action:

# python pg_id_calc.py 0xf6b07b93 2048 12
The PG ID is 12.393

Once we get the PG ID, we can proceed using ‘ceph_objectstore_tool’ to move the PG to a different location altogether. More on how to use ‘ceph_objectstore_tool’ in an upcoming journal.

Mapping Placement Groups and Pools

Understanding the mapping of Pools and Placement Groups can be very useful while troubleshooting Ceph problems.

A direct method is to dump information on the PGs via :

# ceph pg dump

This command should output something like the following:

pg_stat    objects    mip    degr    unf    bytes    log    disklog   state
5.7a           0                0         0          0        0            0       0            active+clean

The output will have more information, and I’ve omitted it for the sake of explanation.

The first field is the PG ID, which are two values separated by a single dot (.). The left side value is the POOL ID, while the right side value is the actual PG number. It means that a specific PG can only be present under a specific pool, ie.. no PGs can be shared across pools. But please note that OSDs can be shared across multiple PGs.

To get the pools and associated numbers, use:

# ceph osd lspools

0 data,1 metadata,2 rbd,5 ssdtest,6 ec_pool,

So, the PG 5.7a belongs to the pool numbered ‘5’, ie.. ‘ssdtest’, and the PG number is ‘7a’.

The output of ‘ceph pg dump’ also shows various important informations such as the Acting OSD set, the primary OSD, the last time the PG was reported, the state of the PG, the time at which a normal scrub as well as a deep-scrub was run etc..

How to identify the journal disk for a Ceph OSD?

In many cases, one would like to understand the journal disk a particular OSD is using. There are two methods to understand this:

a) This is the most direct method, and should give you details on the OSD disks and the corresponding journal disks.


# ceph-disk list

This should output something like:


# ceph-disk list
 /dev/sda :
  /dev/sda1 other, xfs, mounted on /boot
  /dev/sda2 other, LVM2_member
 /dev/sr0 other, unknown
 /dev/vda :
  /dev/vda1 ceph data, active, cluster ceph, osd.0, journal /dev/vda2
  /dev/vda2 ceph journal, for /dev/vda1
 /dev/vdb :
  /dev/vdb1 ceph data, active, cluster ceph, osd.1, journal /dev/vdc1
 /dev/vdc :
  /dev/vdc1 ceph journal, for /dev/vdb1
 

b) The second method is cruder, and involves listing the OSD mount point on the file system.


# ls -l /var/lib/ceph/osd/ceph-0/

total 52
 -rw-r--r--.  1 root root  191 Aug  3 18:02 activate.monmap
 -rw-r--r--.  1 root root    3 Aug  3 18:02 active
 -rw-r--r--.  1 root root   37 Aug  3 18:02 ceph_fsid
 drwxr-xr-x. 70 root root 4096 Aug  4 00:38 current
 -rw-r--r--.  1 root root   37 Aug  3 18:02 fsid
 lrwxrwxrwx.  1 root root   58 Aug  3 18:02 journal -&gt; /dev/disk/by-partuuid/d9ebc4bd-7b5e-4e12-b909-0c72c4f58ee0
 -rw-r--r--.  1 root root   37 Aug  3 18:02 journal_uuid
 -rw-------.  1 root root   56 Aug  3 18:02 keyring
 -rw-r--r--.  1 root root   21 Aug  3 18:02 magic
 -rw-r--r--.  1 root root    6 Aug  3 18:02 ready
 -rw-r--r--.  1 root root    4 Aug  3 18:02 store_version
 -rw-r--r--.  1 root root   42 Aug  3 18:02 superblock
 -rw-r--r--.  1 root root    0 Aug  5 13:09 sysvinit
 -rw-r--r--.  1 root root    2 Aug  3 18:02 whoami

# ls -l /dev/disk/by-partuuid/d9ebc4bd-7b5e-4e12-b909-0c72c4f58ee0
 lrwxrwxrwx. 1 root root 10 Aug  5 13:08 /dev/disk/by-partuuid/d9ebc4bd-7b5e-4e12-b909-0c72c4f58ee0 -&gt; ../../vda2

As you can see, the file ‘journal’ is a symlink to the journal disk. The first method is much easier, but its always better to know how things are layered out underneath.