Data Structures – Arrays

Arrays are a commonly used data structure, and is one of the first a DS student looks into.

It is created as a collection of memory addresses which are contiguous in memory. These memory locations store data of a specific type depending on the array’s type.

Advantages:

  • Arrays are easier to create since the size and type is mentioned at the creation time.
  • Arrays have constant access/lookup time since the lookup is done by accessing the memory location as an offset from the base/first element. Hence the complexity will be O(1).
  • Arrays are contiguous in memory, ie.. a 10 cell array can start at perhaps 0x0001 and end at 0x0010.

Disadvantages:

  • The size of an array has to be defined at the time of its creation. This make the size static, and hence cannot be resized later.
  • An array can only accomodate a specific data type. The type of data an array can store has to be defined at creation time. Hence, if an array is set to store integers, it can only store integers in each memory location.
  • Since the size of an array is set at the time of creation, allocating an array may fail depending on the size of the array and the available memory on the machine.
  • Inserting an element into an array can be expensive depending on the location. To insert an element at a particular position, for example ‘n’, the element already has to be moved to ‘n + 1’, the element at ‘n + 1’ to ‘n + 2’ etc.. Hence, if the position to which the element is written to is at the starting of the array, the operation will be expensive. But if the position is at the starting, it won’t be.

What is the lookup time in an array?

The elements in an array are continuguous to each other. The address of an element is looked up as an `offset` of the primary or base element. Hence, the lookup of any element in the array is constant and can be denoted by O(1).

Arrays in Python

Python doesn’t have a direct implementation of an Array. The one that closely resembles an array in python is a `list`.

The major differences between an array and a list are:

  • The size of lists are not static. It can be grown or shrinked using the `append` and `remove` methods. Arrays are static in size.
  • lists can accomodate multiple data types, while arrays cannot.
In [1]: mylist = []

In [2]: type(mylist)
Out[2]: list

In [3]: mylist.append("string")

In [4]: mylist.append(1000)

In [5]: mylist
Out[5]: ['string', 1000]


Time complexity of Arrays

  • Indexing    – O(1)
  • Insertion/Deletion at beginning – O(n) (If the array has space to shift the elements)
  • Insertion/Deletion at the end – O(1) (If the array has space at the end)
  • Deletion at the end – O(1) (Since it doesn’t have to move any elements and reorder)
  • Insertion at the middle – O(n) (Since it requires to move the elements to the right and reorder)
  • Deletion at the middle – O(n) (Since it requires to delete the element and move the ones from the right to the left)

The ‘array’ module

Python comes with a module named ‘array’ which emulates the behavior of arrays.

In [24]: import array

In [25]: myarray = array.array('i', [1,2,3,4])

In [26]: myarray
Out[26]: array('i', [1, 2, 3, 4])

 

 

 

 

 

Code complexity – The Big O notation [O(n)]

Efficiency or Complexity is how well you’re using your resources to get your code run.

Efficiency can be calculated on the basis of how much time your code takes to run/execute.

Understanding the efficiency of your code can help to reduce the complexity, thus improving the runtime efficiency further. Getting the same job done in less time and less system resources is always good.

Once you find the efficiency of your program, you can start to find ways for:

  • Reducing the complexity (or increase the efficiency) which will reduce the program run time, and hence free the computer resources in a proportional rate.
  • Try to maintain a constant or reduced run time for a growing data set, which will help your program to fare well when the input size grows.

In Computer Science, the `Big O` notation is used to indicate the effieciency or complexity of a program. It is denoted by ‘O(n)’, where ‘n’ is a mathematical function to denote the input. This is

Some examples:

O(n)
O(n³)
O(n log(n))
O(√n)
O(O(n) + 1) or O(1)

Efficiency can be measures on the best, average, and worst cases. For example, consider finding a specific alphabet from a list of all the alphabets jumbled up.

  • The worst case is your program running through all 26 iterations and finding the alphabet as the last value.
  • The best case is when your program finds it in the first iteration itself.
  • The average is when your program finds the alphabet somewhere around 12-15 iterations (ie.. half of the worst case scenario).

Study of Data structures and Algorithms aim to study program complexities, and to reduce it as much as possible.

Algorithms are a series of well-defined steps you follow to solve a problem, whereas Data Structures are specific structures by which how you layout your data. Application of well-known algorithms in a program can help in reducing the run time.

More on Time complexity and the Big O notation can be read at:

https://en.wikipedia.org/wiki/Time_complexity
https://en.wikipedia.org/wiki/Big_O_notation

 

 

`ceph-check` – A Ceph installation checker

Many a user wants to know if a Ceph cluster installation has been done to a specific suggested guideline.

Technologies like RAID is better avoided in Ceph due to an additional layer, which Ceph already takes care of.

I’ve started writing a tool which can be run from the Admin node, and it aims to check various such points.

The code can be seen at https://github.com/arvimal/ceph_check

The work is slow, really slow, due to my daily work, procrastination, and what not, even though I intend to finish this fast.

How to get a Ceph MON/OSD map at a specific epoch?

To get a MON map or an OSD map of a specific epoch, use:

# ceph osd getmap <epoch-value>
# ceph mon getmap <epoch-value>

The map can be forwarded to a file as following:

# ceph osd getmap <epoch-value> -o /tmp/ceph_osd_getmap.bin

This would be in a binary format, and hence will need to be dumped to a human-readable form.

# osdmaptool –print /tmp/ceph-osd-getmap.bin

This will print the current OSD map, similar to the output of ‘ceph osd dump’.

Where this command shines is when you can fetch maps from previous epochs, and pull information on specific placement groups in those epochs.

For example, I’ve had all the OSDs on one of my node down some time back (in a previous epoch). The ability to query a previous epoch gives the administrator the power to understand how exactly the cluster was at a specific time period.

List RBD images, snapshots, and clones in Ceph pools

This is a crude bash one-liner I did to get the details of all the RBD images, as well as the information on snapshots and clones created from them.

# for pool in `rados lspools`;
    do echo "POOL :" $pool;
       rbd ls -l $pool;
       echo "-----";
    done

This will print an output similar to the following:

POOL : rbd
NAME                             SIZE        PARENT  FMT PROT LOCK
test_img                        10240M                    1
test_img2                      1024M                      2
test_img2@snap2      1024M                      2                    yes
-----
POOL : .rgw.root
-----
POOL : .rgw.control
-----
POOL : .rgw
-----
POOL : .rgw.gc
-----
POOL : .users.uid
-----
POOL : .users
-----
POOL : .users.swift
-----
POOL : .users.email
-----
POOL : .rgw.buckets.index
-----
POOL : images
NAME           SIZE      PARENT                               FMT PROT LOCK
clone1           1024M  rbd/test_img2@snap2             2
-----

range() and enumerate()

The usual way to iterate over a range of numbers or a list in python, is to use range().

Example 0:

colors = ["yellow", "red", "blue", "white", "black"]

for i in range(len(colors)):
    print(i, colors[i])

This should output:

(0, 'yellow')
(1, 'red')
(2, 'blue')
(3, 'white')
(4, 'black')

print(), by default, returns a tuple. If we want to print it in a more presentable way, we’ll need to find the indice at which each value is, and print that as well. Re-write the code a bit, to achieve the desired output:

colors = ["yellow", "red", "blue", "white", "black"]

for i in range(len(colors)):
    color = colors[i]
    print("%d: %s" % (i, color))

This should print:

0: yellow
1: red
2: blue
3: white
4: black

We can see that the above output starts with ‘0’ since python starts counting from ‘0’. To change that to ‘1’, we’ll need to tweak the print() statement.

colors = ["yellow", "red", "blue", "white", "black"]

for i in range(len(colors)):
    color = colors[i]
    print("%d: %s" % (i + 1, color))

This should print:

1: yellow
2: red
3: blue
4: white
5: black

Even though the above code snippet isn’t that complex, a much better way exists to do this. This is where the builtin function enumerate() comes in.

enumerate() returns a tuple when passed an object which supports iteration, for example, a list. It also supports a second argument named ‘start‘ which default to 0, and can be changed depending on where to start the order. We’ll check what ‘start‘ is towards the end of this article.

colors = ["yellow", "red", "blue", "white", "black"]
print(list(enumerate(colors)))

This returns a list of a tuples.

[(0, 'yellow'), (1, 'red'), (2, 'blue'), (3, 'white'), (4, 'black')]

To get to what we desire, modify it as:

for i, color in enumerate(colors):
    print('%d: %s' % (i, color))

This outputs:

0: yellow
1: red
2: blue
3: white
4: black

Remember that we talked about that enumerate() takes a second value named ‘start‘ which defaults to ‘0’? Let’s check how that’ll help here.

The above output starts with ‘0’. ‘start’ can help to change that.

for i, color in enumerate(colors, start=1):
    print('%d: %s' % (i, color))

This should change the output as:

1: yellow
2: red
3: blue
4: white
5: black

Ceph and unfound objects

In certain cases, a Ceph cluster may move away from an HEALTHY state due to “unfound” objects.

A “ceph -s” should show if you have any unfound objects. So, what are unfound objects? How does an object become “unfound”? This article tries to explain why/how “unfound” objects come into existence.

Let’s look into the life cycle of a write to a pool.

  • The client contacts a Ceph monitor and fetches the CRUSH map, which includes:
    • MON map
    • OSD map
    • PG map
    • CRUSH map
    • MDS map

Once the client has the maps, the Ceph client-side algorithm breaks the data being written into objects (the object size depends on the client side configuration). Clients such as RBD and RGW uses a 4MB object size, but RADOS doesn’t actually have such a limitation.

Each pool has a set of Placement Groups (PG) assigned to it at the time of creation, and the client always writes to a pool. Since the client has the maps which talks about the entire cluster, it knows the placement groups within the pool which it is writing to, and the OSDs assigned for each placement group. The client talks to the OSDs directly without going over any other path, such as a monitor.

The PG map will have the ACTING and UP OSD sets for each PG. To understand the ACTING set and UP set for the PGs, as well as a plethora of other information, use :

# ceph pg dump

The ACTING set is the current active set of OSDs that stores the replica sets for that particular PG. The UP set is the set of OSDs that are currently up and running. Usually, the ACTING set and UP set are the same. When an OSD in the ACTING set is not reachable, other OSDs wait for 5 minutes (which is configurable) for it to come back online (this is checked with a hearbeat).

The said OSD is removed out of the UP set when it is not accessible. If it doesn’t come back online within the configured period, the said OSD is marked out of the ACTING set, as well as the UP set. When it comes back, it is added back to the ACTING/UP set and a peering happens where the data is synced back.

Let’s discuss the scenario where an “unfound” object came come into existence. Imagine a pool with a two replica configuration. A write that goes into the pool is split into objects and stored in the OSDs which are in the ACTIVE set of a PG.

  • One OSD in the ACTING set goes down.
  • The write is done on the second OSD which is UP and ACTING.
  • The first OSD which went down, came back up.
  • The peering process started between the first OSD (that came back), and the second OSD (that serviced the write).
    • Peering refers to the process of arriving at an understanding on the object states between the OSDs in an ACTING set, and sync up the metadata/data between them.
  • Both the OSDs reach an understanding on which objects needs to be synced.
  • The second OSD that had the objects ready to be synced, went down before the sync process starts or is in midway.

In this situation, the first OSD knows about the objects that was written to the second OSD, but cannot probe it. The first OSD will try to probe possible locations for copies, provided there are more replicas. If the OSD is able to find other locations, the data will be synced up.

But in case there are no other copies, and the OSD with the only copy is not coming up anytime soon (perhaps a disk crash, file system corruption etc..) the only way is to either mark the object as “lost”, or revert it back to the previous version. Reverting to a previous version may not be possible for a new object, and in such cases the only way would be to mark it as “lost” or copy from a backup.

1. For a new object without a previous version:

# ceph pg {pg.num} mark_unfound_lost delete

2. For an object which is likely to have a previous version:

# ceph pg {pg.num} mark_unfound_lost revert

NOTE: The upstream Ceph documentation has an excellent write-up about “unfound” objects here.

I suggest reading the documentation prior taking any sort of action in a case where you see “unfound” objects in your cluster.

Ceph Rados Block Device (RBD) and TRIM

I recently came across a scenario where the objects in a RADOS pool used for an RBD block device doesn’t get removed, even if the files created through the mount point were removed.

I had an RBD image from an RHCS1.3 cluster mapped to a RHEL7.1 client machine, with an XFS filesystem created on it, and mounted locally. Created a 5GB file, and I could see the objects being created in the rbd pool in the ceph cluster.

1.RBD block device information

# rbd info rbd_img
rbd image 'rbd_img':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.1fcbe.2ae8944a
format: 1

An XFS file system was created on this block device, and mounted at /test.

2.Write a file onto the RBD mapped mount point. Used ‘dd’ to write a 5GB file.

# dd if=/dev/zero of=/mnt/rbd_image.img bs=1G count=5
 5+0 records in
 5+0 records out
 5368709120 bytes (5.4 GB) copied, 8.28731 s, 648 MB/s

3.Check the objects in the backend RBD pool

# rados -p rbd ls | wc -l
 &lt; Total number of objects in the 'rbd' pool&gt;

4.Delete the file from the mount point.

# rm /test/rbd_image.img -f
 # ls /test/
 --NO FILES LISTED--

5.List the objects in the RBD pool

# rados -p rbd ls | wc -l
< Total number of objects in the 'rbd' pool >

The number of objects doesn’t go down as we expect, after the file deletion. It remains the same, wrt to step 3.

Why does this happen? This is due to the fact that traditional file systems do not delete the underlying data blocks even if the files are deleted.

The process of writing a file onto a file system involves several steps like finding free blocks and allocating them for the new file, creating an entry in the directory entry structure of the parent folder, setting the name and inode number in the directory entry structure, setting pointers from the inode to the data blocks allocated for the file etc..

When data is written to the file, the data blocks are used to store the data. Additional information such as the file size, access times etc.. are updated in the inode after the writes.

Deleting a file involves removing the pointers from the inode to the corresponding data blocks, and also clearing the name<->inode mapping from the directory entry structure of the parent folder. But, the underlying data blocks are not cleared off, since that is a high I/O intensive operation. So, the data remains on the disk, even if the file is not present. A new write will make the allocator take these blocks for the new data, since they are marked as not-in-use.

This applies for the files created on an RBD device as well. The files created on top of the RBD-mapped mount point will ultimately be mapped to objects in the RADOS cluster. When the file is deleted from the mount point, since the entry is removed, it doesn’t show up in the mount point.

But, since the file system doesn’t clear off the underlying block device, the objects remain in the RADOS pool. These would be normally over-written when a new file is created via the mount point.

But this has a catch though. Since the pool contains the objects even if the files are deleted, it consumes space in the rados pool (even if they’ll be overwritten). An administrator won’t be able to get a clear understanding on the space usage, if the pool is used heavily, and multiple writes are coming in.

In order to clear up the underlying blocks, or actually remove them, we can rely on the TRIM support most modern disks offer. Read more about TRIM at Wikipedia.

TRIM is a set of commands supported by HDD/SSDs which allow the operating systems to let the disk know about the locations which are not currently being used. Upon receiving a confirmation from the file system layer, the disk can go ahead and mark the blocks as not used.

For the TRIM commands to work, the disks and the file system has to have the support. All the modern file systems have built-in support for TRIM. Mount the file system with the ‘discard‘ option, and you’re good to go.

# mount -o discard /dev/rbd{X}{Y} /{mount-point}

Custom CRUSH rulesets and pools

Ceph supports custom rulesets via CRUSH, which can be used to sort hardware based on various features such as speed and other factors, set custom weights, and do a lot of other useful things.

Pools, or the buckets were the data is written to, can be created on the custom rulesets, hence positioning the pools on specific hardware as per the administrator’s need.

A large Ceph cluster may have lots of pools and rulesets specific for multiple use-cases. There may be times when we’d like to understand the pool to ruleset mapping.

The default CRUSH ruleset is named ‘replicated_ruleset’. The available CRUSH rulesets can be listed with:

$ ceph osd crush rule ls

On a fresh cluster, or one without any custom rulesets, you’d find the following being printed to stdout.

# ceph osd crush rule ls
[
“replicated_ruleset”
]

I’ve got a couple more on my cluster, and this is how it looks:

# ceph osd crush rule ls
[
“replicated_ruleset”,
“replicated_ssd”,
“erasure-code”]

Since this article looks into the mapping of pools to CRUSH rulesets, it’d be good to add in how to list the pools, as a refresher.

# ceph osd lspools

On my Ceph cluster, it turned out to be:

# ceph osd lspools
0 data,1 metadata,2 rbd,21 .rgw,22 .rgw.root,23 .rgw.control,24 .rgw.gc,25 .users.uid,26 .users,27 .users.swift,28 test_pool,

Since you have the pool name you’re interested in, let’s see how to map it to the ruleset. The command syntax is:

# ceph osd pool get <pool_name> crush_ruleset

I was interested to understand the ruleset on which the pool ‘test_pool’ was created. The command to list this was:

# ceph osd pool get test_pool crush_ruleset
crush_ruleset: 1

Please note that the rulesets are numbered from ‘0’, and hence ‘1’ would map to the CRUSH ruleset ‘replicated_ssd’.

We’ll try to understand how a custom ruleset is created, in another article.

OSD information in a scriptable format

In case you are trying to get the OSD ID and the corresponding node IP address mappings in a script-able format, use the following command:

# ceph osd find <OSD-num>

This will print the OSD number, the IP address, the host name, and the default root in the CRUSH map, as a python dictionary.

# ceph osd find 2
{ “osd”: 2,
“ip”: “192.168.122.112:6800\/5311”,
“crush_location”: { “host”: “node4”, “root”: “default”}}

The output is in json format, which has a key:value format. This can be parsed using awk/sed, or any programming languages that support json. All recent ones do.

For a listing of all the OSDs and related information, get the number of OSDs in the cluster, and then use that number to probe the OSDs.

# for i in `seq 0 $(ceph osd stat | awk {‘print $3’})`; do

ceph osd find $i; echo; done

This should output:

{ “osd”: 0,
“ip”: “192.168.122.244:6805\/2579”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 1,
“ip”: “192.168.122.244:6800\/955”,
“crush_location”: { “host”: “node3”,
“root”: “ssd”}}
{ “osd”: 2,
“ip”: “192.168.122.112:6800\/5311”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 3,
“ip”: “192.168.122.112:6805\/5626”,
“crush_location”: { “host”: “node4”,
“root”: “default”}}
{ “osd”: 4,
“ip”: “192.168.122.82:6800\/4194”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 5,
“ip”: “192.168.122.82:6805\/4521”,
“crush_location”: { “host”: “node5”,
“root”: “default”}}
{ “osd”: 6,
“ip”: “192.168.122.73:6801\/5614”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 7,
“ip”: “192.168.122.73:6800\/1719”,
“crush_location”: { “host”: “node2”,
“root”: “ssd”}}
{ “osd”: 8,
“ip”: “192.168.122.10:6805\/5842”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 9,
“ip”: “192.168.122.10:6800\/4356”,
“crush_location”: { “host”: “node6”,
“root”: “default”}}
{ “osd”: 10,
“ip”: “192.168.122.109:6800\/4517”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}
{ “osd”: 11,
“ip”: “192.168.122.109:6805\/4821”,
“crush_location”: { “host”: “node7”,
“root”: “default”}}