Max file-name length in an EXT4 file system.

A recent discussion at work brought up the question “What can be the length of a file name in EXT4”. Or in other words, what would be the maximum character length of the name for a file in EXT4?

Wikipedia states that it’s 255 Bytes, but how does that come to be? Is it 255 Bytes or 255 characters?

In the kernel source for the 2.6 kernel series (the question was for a RHEL6/EXT4 combination), in  fs/ext4/ext4.h, we’d be able to see the following:


#define EXT4_NAME_LEN 255

struct ext4_dir_entry {
    __le32 inode;             /* Inode number */
    __le16 rec_len;           /* Directory entry length */
    __le16 name_len;          /* Name length */
    char name[EXT4_NAME_LEN]; /* File name */
};

/*
* The new version of the directory entry. Since EXT4 structures are
* stored in intel byte order, and the name_len field could never be
* bigger than 255 chars, it's safe to reclaim the extra byte for the
* file_type field.
*/

struct ext4_dir_entry_2 {
    __le32 inode;             /* Inode number */
    __le16 rec_len;           /* Directory entry length */
    __u8 name_len;            /* Name length */
    __u8 file_type;
    char name[EXT4_NAME_LEN]; /* File name */
};

This shows that there are two versions of the directory entry structure, ie.. ext4_dir_entry and ext4_dir_entry_2

A directory entry structure carries the file/folder name and the corresponding inode number under every directory.

Both structs use an element named name_len to denote the length of the file/folder name.

If the EXT filesystem feature filetype is not set, the directory entry structure falls to the first method ext4_dir_entry, else it’s the second, ie.. ext4_dir_entry_2.

By default, the file system feature filetype is set, hence the directory entry structure is ext4_dir_entry_2 . As seen above, in this case, the name_len field is set to 8 bits.

__u8 represents an unsigned 8-bit integer in C, and can store values from 0 to 255.

ie.. 2^8 = 255 (0 t0 255 == 256)

ext4_dir_entry has a name_len of __le16, but it seems that the file-name length can only go to a max of 256.

Observations:

  1. The maximum name length is 255 characters on Linux machines.
  2. The actual name length of a file/folder is stored in name_len in each directory entry, under its parent folder. So if the file name length is 5 characters, 5 would be the value set for name_len for that particular file. ie.. the actual length.
  3. A character will consume a byte of storage, so the number of characters in a file name will map to the respective number bytes. If so, a file with a name_len of 5 will be using 5 bytes of memory to store the name.

Hence, name_len denotes the number of characters that a file can have. Since U8 is 8-bits, name_len can store a file name with upto 255 chars.

Now the actual memory being consumed for storing these characters is not denoted by name_len. Since the size of a character translates to a byte, the maximum size wrt memory that a file name can have is 255 Bytes.

NOTE:

The initial dir entry structure ext4_dir_entry had __le16 for name_len, it was later re-sized to __u8 in ext4_dir_entry_2 , by culling 8 bits from the existing 16 bits of name_len.

The remaining free space culled from name_len was assigned to store the file type, in ext4_dir_entry_2. It was named file_type with size __u8.

file_type helps to identity the file types such as regular files, sockets, character devices, block devices etc..

References:

  1. RHEL6 kernel-2.6.32-573.el6 EXT4 header file (ext4.h)
  2. EXT4 Wiki – Disk layout
  3. http://unix.stackexchange.com/questions/32795/what-is-the-maximum-allowed-filename-and-folder-size-with-ecryptfs

FSCache and the on-disk structure of the cached data

The ‘cachefilesd’ kernel module will create two directories at the location specified in /etc/cachefilesd.conf. By default it’s /var/cache/fscache/.

[root@montypython ~]# lsmod |grep -i cache
cachefiles             40871  1
fscache                62354  3 nfs,cachefiles,nfsv4

Those are /var/cache/fscache/cache and /var/cache/fscache/graveyard.

The cache structure is maintained inside ‘/var/cache/fscache/cache/’, while anything that is retired or culled is moved to ‘graveyard’. The ‘cachefilesd’ daemon monitors ‘graveyard’ using ‘dnotify’ and will delete anything that is in there.

We’ll try an example. Consider an NFS share mounted with fscache support. The share contains the following files, with some random text.

# ls /vol1
files1.txt  files2.txt  files3.txt  files4.txt

a) Configure ‘cachefiles’ by editing ‘/etc/cachefilesd.conf’, and start the ‘cachefilesd’ daemon.

# systemctl start cachefilesd

b) Mount the NFS share on the client with the ‘fsc’ mount option, to enable ‘fscache’ support.

# sudo mount localhost:/vol1 /vol1-backup/ -o fsc

d) Access the data from the mount point, and fscache will create the backed caching index at the location specified in /etc/cachefilesd.conf. By default, its /var/cache/fscache/

e) Once the files are accessed on the client side, fscache builds an index as following:

NOTE: The index structure is dependent on the netfs (NFS in our case). The netfs driver can structure the cache index as it seems fit.

Explanation of the caching structure:

# tree /var/cache/fscache/
/var/cache/fscache/cache/
└── @4a
└── I03nfs
├── @22
│   └── Jo00000008400000000000000000000000400
│      └── @59
│           └── J110000000000000000w080000000000000000000000
│               ├── @53
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               ├── @5e
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               ├── @61
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               ├── @62
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               ├── @70
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               ├── @7c
│               │   └── EE0g00sgwB-90600000000ww000000000000000
│               └── @e8
│                   └── EE0g00sgwB-90600000000ww0000000000000000
└── @42
└── Jc000000000000EggDj00
└── @0a

a) The ‘cache‘ directory under /var/cache/fscache/ is a special index and can be seen as the root of the entire cache index structure.

b) Data objects (actual cached files) are represented as files if they have no children, or folders if they have. If represented as a directory, data objects will have a file inside named ‘data’ which holds the data.

c) The ‘cachefiles‘ kernel module represents :

i)   ‘index‘ objects as ‘directories‘, starting with either ‘I‘ or ‘J‘.

ii)  Data objects are represented with filenames, beginning with ‘D‘ or ‘E‘.

iii) Special objects are similar to data objects, and start with ‘S‘ or ‘T‘.

In general, any object would be represented as a folder, if that object has children.

g) In the directory hierarchy, immediately between the parent object and its child object, are directories named with *hash values* of the immediate child object keys, starting with an ‘@‘.

The child objects are placed inside this directory.These child objects would be folders, if it has child objects, or files if its the cached data itself. This can go on till the end of the path and reaches the file where the cached data is.

Representation of the object indexes (For NFS, in this case)

INDEX     INDEX      INDEX                             DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000–fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000–fHg8hi8400/@75/Es0g000w…DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000–fHg8hi8400/@75/Es0g000w…N22ry
cache/@4a/I03nfs/@30/Ji000000000000000–fHg8hi8400/@75/Es0g000w…FP1ry

FS-Cache and CacheFS, what are the differences?

FS-Cache and CacheFS. Are there any differences between these two? Initially, I thought both were same. But no, it’s not.

CacheFS is the backend implementation which caches the data onto the disk and mainpulates it, while FS-Cache is an interface which talks to CacheFS.

So why do we need two levels here?

FS-Cache was introduced as an API or front-end for CacheFS, which can be used by any file system driver. The file system driver talks with the FS-Cache API which inturn talks with CacheFS in the back-end. Hence, FS-Cache acts as a common interface for the file system drivers without the need to understand the backend CacheFS complexities, and how its implemented.

The only drawback is the additional code that needs to go into each file system driver which needs to use FS-Cache. ie.. Every file system driver that needs to talk with FS-Cache, has to be patched with the support to do so. Moreover, the cache structure differs slightly between file systems using it, and thus lacks a standard. This unfortunately, prevents FS-Cache from being used by every network filesystem out there.

The data flow would be as:

VFS <-> File system driver (NFS/CIFS etc..) <-> FS-Cache <-> CacheFS <-> Cached data

CacheFS need not cache every file in its entirety, it can also cache files partially. This partial caching mechanism is possible since FS-Cache caches ‘pages’ rather than an entire file. Pages are smaller fixed-size segments of data, and these are cached depending on how much the files are read initially.

FS-Cache does not require an open file to be loaded in the cache, prior being accessed. This is a nice feature as far as I understand, and the reasons are:

a) Not every open file in the remote file system can be loaded into cache, due to size limits. In such a case, only certain parts (pages) may be loaded. And the rest of the file should be accessed normally over the network.

b) The cache won’t necessarily be large enough to hold all the open files on the remote system.

c) Even if the cache is not populated properly, the file should be accessible. ie.. the cache should be able to be bypassed totally.

This hopefully clears the differences between FS-Cache and CacheFS.

FS-Cache and FUSE

I would be working on enabling FS-Cache support in the FUSE kernel module, as part of my under graduate project.

Niels De Vos, from Red Hat Engineering, would act as my mentor and guide throughout this project. He would also be presenting this idea in the ‘Linux Plumbers Conference’ being held in Germany, October 2014.

More details on the the talk can be seen at http://www.linuxplumbersconf.org/2014/ocw/sessions/2247

This feature has got quite a few requests from the FOSS world, and I’m glad I could work on this. For now, I’m trying to get a hold on FS-Cache, how it works with other file systems, and trying to build FUSE with some customizations. Ultimately, it would be the FUSE module were the code additions would go, not FS-Cache.

I’ll try to keep this blog updated, so that I have a journal to refer later

“Error: open /tmp/docker-import-123456789/repo/bin/json: no such file or directory”

I’ve been trying to create a minimal docker image for RHEL versions, for one of my projects. The following were the steps I followed:

a) Installed a RHEL6.5 server with ‘Minimal Installation’.

b) Registered it to the local satellite.

c) Created a tar-ball of the filesystem with the command below:


# tar --numeric-owner --exclude=/proc --exclude=/sys --exclude=/mnt --exclude=/var/cache

--exclude=/usr/share/doc --exclude=/tmp --exclude=/var/log -zcvf /mnt/rhel6.5-base.tar.gz /

d) Load the tar.gz image using ‘docker load’ (as per the man page of ‘docker load’)


# docker load -i rhel6.5-base.tar.gz

This is where it erred with the message:


2014/08/16 20:37:42 Error: open /tmp/docker-import-123456789/repo/bin/json: no such file or directory

After a bit of searching and testing, I found that ‘docker load -i’ doesn’t work as expected. The workaround is to cat and pipe the tar.gz file, as shown below:


# cat rhel6.5-base.tar.gz | docker import - rhel6/6.5

This ends up with the image showing up in ‘docker images’


# docker images

REPOSITORY   TAG    IMAGE ID           CREATED                  VIRTUAL SIZE
 rhel6/6.1           latest  32b4b345454a  About a minute ago 1.251 GB

Update: ‘docker load -i <image-file>’ would only work if the image is created as a layered docker image. If the <image-file> is a tar ball created from a root filesystem, you would need to use ‘cat <image-file> | docker import <name>’

lsusb and chroot in anaconda.. Is usbfs mounted in anaconda %post installation ?

The binary ‘/sbin/lsusb’ in a chroot-ed environment have problems running properly. I have not checked this in a manually created chroot environment or using tools like ‘mock’.

The scenario is as following :

We were trying to check the output of ‘lsusb’ in the %post section of a kickstart installation. I had specified ‘noreboot’ in the kickstart file so the machine will wait for the user to manually reboot the machine. This helps to check the logs and the situation of the machine just after the installation finishes.

After the installation and prior to the reboot, i checked in the second available terminal (Alt + F2) created by anaconda and was astonished to see that the command ‘lsusb’ does not give us the required output but an error that ‘/usr/share/hwdata/usb.ids’ is not accessible or found.

By default, i think only the ‘installation’ ie.. the %post section starts in a ‘chroot’ mode and the terminal available is not chroot-ed. So we will have to use ‘/mnt/sysimage/sbin/lsusb’. This didn’t work as expected since the ‘lsusb’ binary needs to check the file ‘/usr/share/hwdata/usb.ids’ and won’t be able to find it.

So I did a chroot from the second terminal and did an /sbin/lsusb, since /sbin in not in the ‘PATH’ by default. That too, didn’t work out. But this time it didn’t even complain anything. Just nothing at all, no output. Last time, at-least it complained it could not find something. So how do we go forward now ??? Here comes ‘strace’ to the rescue !!!

strace is of-course a really nice tool to know what system calls are made and lots of internal stuff a binary will do while being executed. But ‘strace’ is not installed by default on a RHEL5 machine, which is the case here. As most of you would know, anaconda creates a virtual file system which consists of most of the folders found under a linux main /. The location where the OS is installed is mounted under /mnt/sysimage.

Since we already have an ISO from where we have booted the machine from (DVD/CD), we are free to mount it on the filesystem, which is what we did. :

# mkdir /mnt/source
# mount -t iso9660 /dev/hdc /mnt/source
# cd /mnt/source/Server/

In case you want to know how the DVD/CD drive is detected, all you need to do is execute ‘dmesg’ in the available terminal. ie.. after pressing ‘Alt + Ctrl + F2’.

So we went forward and mounted the DVD to /mnt/source and changed the directory to /mnt/source/Server where all the rpm packages reside. Installed the package ‘strace’ using an ‘rpm -ivh’. Please note that we need to use ‘–root /mnt/sysimage’ as an option since we are installing the package to our newly installed file system which is at /mnt/sysimage. If this is not used, the installer will try to install the package to the virtual environment created in the memory.

# cd /mnt/source/Server
# rpm -ivh strace-&lt;version&gt;.rpm --root /mnt/sysimage
# cd
# chroot /mnt/sysimage

This will make /mnt/sysimage as the working root, ie.. where our installation was done. OK.. now for the ‘strace’ stuff.

# strace -fxvto strace.log -s 1024 /sbin/lsusb

The strace output will be saved to ‘strace.log’ which we can open up in a text editor of our choice. Opening it in ‘vi’, shows a lot of stuff such as the command run, the default language, location of libraries loaded, the environment variables etc.. In this case we would only need to be interested in the last parts, ie.. to know where the binary failed :

15:16:17 open("/dev/bus/usb", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory) = 03067
15:16:17 open("/proc/bus/usb", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 33067
15:16:17 fstat(3, {st_dev=makedev(0, 3), st_ino=4026532146, st_mode=S_IFDIR|0555, st_nlink=2, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2009/09/25-15:16:17, st_mtime=2009/09/25-15:16:17, st_ctime=2009/09/25-15:16:17}) = 03067
15:16:17 fcntl(3, F_SETFD, FD_CLOEXEC) = 03067
15:16:17 getdents(3, {{d_ino=4026532146, d_off=1, d_reclen=24, d_name="."} {d_ino=4026531879, d_off=2, d_reclen=24, d_name=".."}}, 4096) = 483067
15:16:17 getdents(3, {}, 4096) = 03067
15:16:17 close(3) = 03067
15:16:17 exit_group(1) = ?

The above trace output shows how the ‘lsusb’ binary proceeded at its last time and where it failed. We can see that it went to open ‘/dev/bus/usb’, only to find that the said location does not exist. We can understand that it is a directory from the call

open("/dev/bus/usb", O_RDONLY|O_NONBLOCK|O_DIRECTORY)

Ok,, fine.. so what does it do next ?

As the next step, it tries to open ‘/proc/bus/usb’ and it is present, which we know since there are no ‘No such file or directory’ errors. Going further, the binary goes on to do a ‘stat’ on ‘/proc/bus/usb’. After doing an ‘fstat’, it goes to check the file descriptor using ‘fcntl’ and further goes to list the directory contents using ‘getdents’.

This is where we find the interesting output :

getdents(3, {{d_ino=4026532146, d_off=1, d_reclen=24, d_name="."} {d_ino=4026531879, d_off=2, d_reclen=24, d_name=".."}}, 4096) = 48

As you can see in the above trace, it returns ‘.’ and ‘..’, which means there are nothing in /proc/bus/usb. So what we do understand is ‘lsusb’ refers /dev/bus/usb and /proc/bus/usb for its outputs.. If it was not able to find anything, strace would have given us an error which obviously would have made life much easier.

And that’s how ‘/sbin/lsusb’ failed silently.. Isn’t strace a nice tool ??

Okay, those who want to know why is this so… ‘lsusb’ needs either /mnt/sysimage/proc/bus/usb or /mnt/sysimage/dev/bus/usb display contents to work properly. Anaconda is not mounting /mnt/sysimage/proc/bus/usb with the ‘usbfs’ file system in the limited installation environment and hence ‘lsusb’ fails…

And we have a fix for that which goes into yuminstall.py in the anaconda source :

try:
    isys.mount("/proc/bus/usb", anaconda.rootPath + "/proc/bus/usb", "usbfs")
except Exception, e:
    log.error("error mounting usbfs: %s" %(e,))

This piece of python code, tries mounting /proc/bus/usb on /mnt/sysimage/proc/bus/usb as ‘usbfs. If its not possible, the code excepts an Exception error and reports “error mounting ‘usbfs’.

Device Mapper and applications

What is device-mapper ?

Device mapper is a modular driver for the linux kernel 2.6. It can be said as a framework which helps to create or map logical sectors of a pseudo block device to an underlying physical block device. So what device-mapper do is keep a table of mappings which equate the logical block devices to the physical block devices.

Applications such as LVM2, EVMS, software raid aka dmraid, multipathing, block encryption mechanisms such as cryptsetup etc… use device-mapper to work. All these applications excluding EVMS use the libdevmapper library to communicate with device-mapper.

The applications communicate with device-mapper’s API to create the mapping. Due to this feature, device-mapper does not need to know what LVM or dmraid is, how it works, what LVM metadata is, etc… It is upto the application to create the pseudo devices pointing to the physical volumes using one of device-mapper’s targets and then update the mapper table.

The device-mapper mapping table :

The mapping table used by device-mapper doesn’t take too much space and is a list created using a ‘btree’. A btree or a ‘Binary Search Tree‘ is a data-structure from which data can be added, removed or queried.

In order to know more on what a btree is and the concept behind it, read :

http://en.wikipedia.org/wiki/Binary_search_tree

http://en.wikipedia.org/wiki/B-tree

Types of device-mapper targets :

Applications which use device-mapper actually use one or more of its target methods to achieve their purpose. Targets can be said as a method or type of mapping used by device-mapper. The general mapping targets are :

a) Linear – Used by linear logical volumes, ie.. the default data layout method used by LVM2.

b) Striped – Used by striped logical volumes as well as software RAID0.

c) Mirror – Used by software RAID1 and LVM mirroring.

d) Crypt – Used by disk encryption utilties.

e) Snapshot – Used to take online snapshots of block devices, an example is LVM snapshot.

f) Multipath – Used by device-mapper-multipath.

g) RAID45 – Software raid using device-mapper, ie.. dmraid

h) Error – Sectors of the pseudo device mapped with this target causes the I/O to fail.

There are a few more mappings such as ‘flaky’ which is not used much.

I’ll write on how device-mapper works in LVM, in the next post…