ReiserFS - Everything2.com

ReiserFS is a B+Tree based journaling file system aimed at systems running the Linux kernel. It was named after its conceiver, the former UC Berkeley student Hans Reiser, who started the project alone in the middle 90's. Later, Reiser went to Russia and along with a group of about fifteen Russian hackers, founded The Naming Systems Venture (a.k.a. Namesys); the company that today administers the development of the ReiserFS line of file systems.

At the moment, ReiserFS is the paramount practical evolution of Reiser's senior thesis, which dealt with namespace design issues (more specifically, the differences in design paradigms between computer scientists and "traditional" science - physicists, philosophers, etc.). The interesting thing is that Hans Reiser and his companions have only begun to implement the lower part of a scheme that aims to become a completely new naming system framework, capable of extending the hierarchic nature of the traditional UNIX file system to accept structured queries for file system objects. ReiserFS is the first step towards that goal, representing the initial implementation of a storage layer for the desired naming system. ReiserFS first "interesting" version is 3.5, because it was at that point that it was first merged with an official release of the Linux kernel. ReiserFS will stand on the '3.6.X' branch, since Namesys has already frozen v3 (except for bug fixes, of course) and is now working on v4 alone, which will probably be named Reiser4.

Architecture overview

Namesys website isn't particularly clear regarding specific implementation details of ReiserFS. It poses some interesting theoretical views on the future of naming systems, but I missed more straight and comprehensive documentation about the concrete implementation. The source code, both of ReiserFS itself and of reiserfsutils package, and a gathering of third party articles around the Web proved to be more useful to understand the implementation.

The ReiserFS Tree:

The data structures employed by ReiserFS are basically B+Trees, which means that its algorithms employ tree-like pointer arrangements formations that do not store actual data on any but on the leaf nodes. Thus, the ReiserFS tree is a structured view over a subset of the one-dimensional array of 4K blocks on disk. Exactly what blocks are nodded in the in-memory tree depends on cache mechanisms and usage patterns (whether files are big or small on average, customarily how many files are open at a given moment, etc).

Nodes that contain pointers to other nodes and other meta-information useful for the tree traversing algorithms are said formatted nodes; nodes that contain only "raw" data, i.e., file content, are said unformatted nodes. Nodes in the tree are 4K bytes in size, and they can be of two different types, depending on their level in the tree:

INTERNAL NODES: these are nodes that are on levels other than the leaf level. In the ReiserFS tree, internal nodes contain only keys, which identify the node individually, and pointers to child nodes. These nodes exist to provide a path to the leaf nodes, and they are kept logically ordered by their keys.

LEAF NODES: Leaf nodes store actual data in the tree. They can hold diverse kinds of items, namely:

Stat items: Stat items are what in a "normal" file system would be called inodes. They store bookkeeping information about files on disk. When the kernel receives a system call to manipulate a file in a ReiserFS partition, the function installed by ReiserFS in the VFS layer, with this specific purpose, will eventually act upon this kind of object.

Directory entries: These are lists of object_id:filename pairs or, more strictly speaking, are lists of integers associated with the filenames in a given directory. In these very data structures and in the code that traverses them lies the explanation for ReiserFS' rapidness on parsing huge directories (e.g. millions of files), because the code that lists entries in directories is much faster than the linear linked lists that are traditionally applied to such purpose (ext2 uses them).

Direct items: You certainly have already learnt that ReiserFS has superb performance when it comes to handling small files (files whose content is shorter than 16K bytes). Well, the elucidation lies in the existence of direct items: ReiserFS caches the entire contents of small files in the tree. In the case of files large than 16K, direct items can store the tail of the file.

Indirect items: Indirect items are nodes that hold addresses of block on disk, these nodes identify the stat item to which the blocks pertains and the information about the disk extent that holds the actual data.

Storage

Blocks of 4096 bytes are the smallest individually accessible portion of the disk. Physically, files (ordinary files, directories, pipes, etc) are groups of blocks. Files smaller than one block (i.e. whose size is less than 4K) can be packed together inside a block. To sum it up, the storage layer makes the necessary to present the array or sectors it finds in the disk to the semantics layer (or to any other application accessing it, like the VFS) as a storage area structured in a tree of directories. Now let's set some limits:

You can store a maximum of 4,294,967,296 files in a reiserfs partition.
You can put no more than 2,147,483,648 files in a directory.
The maximum number of subdirectories inside a directory is 64.536.
The maximum file-size is 17.6 terabytes on 32 bit architectures.
You can have as much as 4,294,967,296 links to a file.
And finally your file system overall maximum size will be 4,294,967,296 x 4K blocks, i.e., 17.6 terabytes.

Creating ReiserFS file systems

Reiserfs sees the underlying media as a set of partitions, each one is a logical sub area in the 512 bytes sectors array known as a "disk". As said above, ReiserFS can embrace partitions of up to 17.6 TB. As with many other inode-based file systems, Reiserfs sees a partition as a mere string of contiguous blocks, which are the minimal portion of a partition that can be individually addressed. The only currently supported block size in Reiserfs is 4096 bytes.

In order to make use of a partition, reiserfs needs to know it first; introducing a filesystem and a disk partition to each other is a process popularly known as "formatting". To create a reiserfs filesystem in a partition you will need a tool named mkreiserfs, available for download off the Namesys server (but you shouldn't need bother going there, because mkreiserfs ships with most GNU/Linux distributions theses days).

A partition formatted by mkreiserfs looks like this:

+--------+------------+---------------+-------------+~. . .~+----------------+--------------+
| 0..64K | superblock | 1st bitmap of | 32,768 x 4K |       | last bitmap of | 32,768 x 4K  |
| empty  |(meta-info) | free/used blks| data blocks |       | free/used blks | data blocks  |
+--------+------------+---------------+-------------+~. . .~+----------------+--------------+

Reiser4

The next generation of ReiserFS has Reiser4 as its provisory name and, according to Namesys, will focus on modularity - allowing for easy expansion by third part plugins, and semantics richness - it should offer more flexibility on the naming methods. It will, for example, allow a program to query the filesystem for a file by specifying arbitrary individual attributes of the file, not just their path. But that, I suspect, is another matter entirely.

References

Namesys website: http://www.namesys.com
p-nand-q.com: http://p-nand-q.com/download/rfstool/reiserfs_docs.html
The source code of both reiserfs kernel modules and the reiserfsutils package.

ext3	journaling filesystem	Was not cleanly unmounted, check forced	SuSE
NTFS	xfs	JFS	ext2
balanced tree	filesystem	Doing something with your life	lost+found
Red Black Tree	The Original GNU Announcement	Virtual File System	The Linux Directory Structure
AVL tree	storage device	QNX	31337
B+Tree	WinFS	Paul Reiser	Network File System