The background

It is common knowledge that old school hackers all have large beards. Alan Cox, RMS and maddog are brilliant examples. The reason for this is that growing a beard is the most interesting use of one's time when the computer is waiting for fsck to finish messing around after a system crash, and on large filesystems, you'll have plenty of time to waste (this might also be why there are so few female hackers; they can't grow beards). Journaling filesystems are God's (in the handy incarnation of SGI, IBM, Red Hat and Hans Reiser's merry men) gift to us all, since they mean no more fscking around. However, actually using the filesystem should not be an excuse to grow a beard either, and after a recent re-install of Debian and migrating to a new hard drive, growing a beard and going bald was what I was doing while restoring my old /home from its backup. I decided to take a look at what could be done. After putting together a simple shell script to automatically make, mount and benchmark XFS filesystems with various mkfs and mount options, I went to an Aikido session, and came home to find a nice large pile of numbers waiting for me. Yes, it's sad, numbers make me all warm and fuzzy inside.

The filesystem

XFS is a powerful journaling file system created by Silicon Graphics (SGI), originally for use with their IRIX-based servers, workstations and visualization systems. It has a number of interesting features absent from many other filesystems (although, to be fair, some of them are available as add-ons) including Access Control Lists, Extended Attributes and the DMAPI. When SGI decided to embrace Linux, one of its contributions to the Linux community was a complete port of XFS (after having its lawyers make damn sure they wouldn't catch any intellectual property related flak for it -- a wise decision, considering the current SCO vs. Linux fiasco). Because of the common scenario SGI systems are deployed in (eg. data pumps moving gigantic amounts of bytes around, for making animations and special effects in movies and the like), the filesystem is specifically optimized for high performance with massive files, and can theoretically scale to handle a maximum file size of 9 million terabytes. Not that I have a 9 million terabyte hard drive in my home Linux box, though. Most of what I have in my /home is Ogg Vorbis files, various images, videos and large archives of code, making XFS look like the logical choice. Weaknesses of XFS include very poor delete performance, and poor to mediocre performance with very small files.

The machine

This round of tests will be performed using socrates, my Debian GNU/Linux box. An aging beast, socrates has dual 600MHz Pentium III processors, 256 megabytes of PC700 RDRAM (yes, I know Rambus is Satan, I didn't when I built this thing though, get off my back would you?), an Intel i840 ICH ATA-66 I/O chipset, and a 60GB Seagate Barracuda ATA IV hard drive. While it also has lots of other things, you don't need to know about any of them to get an impression of my storage hardware performance. The test area for the filesystem is a 45 GB LVM logical volume, which will eventually become my /home.

The benchmark

The benchmark I will be using here is bonnie++. Since the point here is not measuring how XFS handles with files of various sizes, I used only one bonnie++ run per filesystem option. The invocation I used:

bonnie++ -u bonnie:bonnie -d mtpt/bonnie/ -s 496m -m socrates -n 16:100000:16:64

The interesting option is -n. The numbers mean bonnie++ will make 16*1024 files of random sizes ranging between 16 and 100000 bytes distributed in 64 directories. The size range should give a more or less realistic impression of general performance. In my results below, all the numbers in sequential output and sequential inputare measured in kilobytes per second, random seeks are seeks per second, and all the others are files per second. The percentages measure CPU load during the operation. A certain amount of statistical insecurity does come into play with bonnie++, because of the random file sizes; this should be taken into account when reading the results.

The competition

Other Linux journaling filesystems include Red Hat's Third Extended Filesystem(ext3), Hans Reiser's ReiserFS and IBM's JFS. I decided to also benchmark ext3 and ReiserFS for comparison, to highlight the strengths and weaknesses of XFS. Ext3 is a modified version of the old ext2, and is often described as a "poor man's journaling filesystem", because its main feature is its ability to upgrade from the widely used ext2 filesystem without reformatting the partition. It is the most widely used journaling filesystem on Linux, mainly for this reason. ReiserFS was developed from scratch by Hans Reiser and a team of predominantly Russian filesystem hackers, and has a small-file performance that is the stuff of legend. It can also delete files obscenely quickly, making it sort of the Anti-XFS. However, this article is not meant as a showdown between the filesystems -- choice of journaling file systems is the most common cause of armed conflict in the world, second only to economy and religion.

Enough talk! Show me numbers!

My first test was a comparison of the three filesystems:

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
Ext3      4581(95%)  5532(14%)  1906(8%)   3893(82%)  4338(28%)
ReiserFS  3211(46%)  4872(15%)  2037(8%)   3743(82%)  4301(20%)
XFS       4347(83%)  5202(6%)   1995(12%)  3816(81%)  4222(13%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
Ext3      146.2(3%)  67(6%)   66(19%)  1757(23%) 23(66%)  8(37%)  553(8%)
ReiserFS  151.2(3%)  93(17%)  50(15%)  3999(98%) 83(16%)  41(15%) 1750(60%)
XFS       162.0(3%)  72(8%)   71(28%)  495(30%)  72(8%)   33(10%) 232(14%)

The most obvious number here is the monstrous delete performance of ReiserFS, and the absolutely abysmal delete performance of XFS. Apart from that, it's a close run. Ext3 sucks in random create and read.

The most significant tweak first. While not strictly a filesystem tweak, running hdparm to tune my IDE hard drive's parameters is a good place to begin. I put the drive into DMA mode, enabled unmask_irq and set it to 32-bit synced mode, using hdparm -c3 -d1 -u1 /dev/hda. The unmask_irq feature shouldn't affect the benchmarks too much, but it does drastically improve system responsiveness during disk load. You can also do a lot of other interesting things with this program, although you should read its man page first, some of the things it can do are dangerous for your drive. Before I go on to more bonnie++ benchmarks, I had hdparm do a simple drive benchmark (with hdparm -tT to see the effects of those three changes:

                     Before                   After
Buffer-cache reads : 128M/0.81s= 158.02 M/s   128M/0.82s = 156.10 M/s
Buffered disk reads: 64M/15.24s= 4.20 M/s     64M/1.59s  = 40.25 M/s

A 10x speedup on buffered disk reads! This looks promising indeed. Let's see how the three filesystems fare after this tweak:

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
Ext3      8387(99%)  37393(56%) 17701(24%) 8946(97%)  26447(15%)
ReiserFS  7949(98%)  46297(89%) 14280(19%) 8814(96%)  41262(27%)
XFS       8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
Ext3      203.7(1%)  194(16%) 243(11%) 1932(24%) 191(16%) 69(3%)  639(11%)
ReiserFS  216.4(1%)  592(65%) 134(7%)  4145(99%) 573(62%) 85(4%)  1836(52%)
XFS       232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)

I immediately notice that (apart from the massive boost of performance in all tests) the XFS delete performance was dramatically improved, while the two others have more modest improvements (although still ahead of XFS). The first XFS tweak I'll try relates to XFS' practice of adding a flag to all unwritten extents. This is a safety feature, and it can be disabled with an option during filesystem creation time (mkfs.xfs -d unwritten=0). Flagging unwritten extents should decrease write performance by at least some margin (although disabling it may not necessarily be wise, it does increase filesystem safety somewhat), so I try switching them off and seeing what happens:

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
Enabled   8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)
Disabled  8930(99%)  51465(41%) 13599(15%) 8850(96%)  41586(27%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
Enabled   232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
Disabled  202.7(2%)  367(25%) 243(13%) 1607(54%) 353(24%) 57(3%)  404(16%)

A few improvements, a few setbacks, they're all at the level of statistical noise. Disabling unwritten extent flagging doesn't seem to be terribly useful here. The next possible tweak is to change the number of allocation groups XFS decides to use. Allocation groups are an XFS-specific technology, and an allocation group is sort of a "sub-filesystem" which allows the Linux kernel to write to several parts of an XFS filesystem simultaneously (this is especially nice on a dual- or multiprocessor system), and is how XFS achieves its high parallelism. At least one allocation group is needed per 4 gigs of space. On the 45 GB partition, XFS chooses to use 45 allocation groups. I decide to see what using 16 or 90 of them would yield (mkfs.xfs -d agcount=XX, where XX is 16 or 90):

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
45        8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)  
16        8916(99%)  49797(41%) 15531(17%) 8926(97%)  41607(27%)  
90        8957(99%)  52770(44%) 18222(20%) 8820(96%)  41646(25%)  

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
45        232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
16        236.7(2%)  385(27%) 278(13%) 1580(49%) 350(24%) 58(2%)  379(16%)
90        241.0(2%)  369(26%) 266(13%) 1722(54%) 359(24%) 56(2%)  334(11%)

Again, nothing really significant, except for the 90 AG filesystem's much higher sequential rewrite performance. A downside of having many allocation groups is that they start hogging CPU time when the filesystem fills up, slowing the filesystem down drastically when it is close to full. 90 AGs would be more trouble than they're worth (16 could be nice though, but would hamper me if I should ever decide to grow my filesystem). This is getting boring, so I decide to try out something else. The next option that seems interesting from a performance perspective is the size of the log (or journal). XFS defaults to creating with a 22 megabyte log, but it would be interesting to try something a little larger. This should benefit performance, because a larger log means it takes longer for the log to fill up with filesystem transactions. I decide to try a 32M and a 64M one (mkfs.xfs -l size=XXm, where XX is 32 or 64). Here's the results:

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
Default   8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)
32M       8993(99%)  51814(43%) 13409(15%) 8744(95%)  41656(27%)
64M       8972(100%) 48672(39%) 18786(20%) 8941(97%)  41603(27%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
Default   232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
32M       235.9(2%)  374(26%) 262(13%) 1746(56%) 361(25%) 57(2%)  534(22%)
64M       238.8(2%)  386(27%) 275(12%) 1750(55%) 352(24%) 58(2%)  798(32%)

Now we're getting somewhere! It appears that the log is the key to improving XFS delete performance, although some of the other write operations also see some slight improvements. I decide to make another log-related tweak, although this time, I do so in an option while mounting the filesystem rather than while it is created. Specifically, I decide to try using 6 or 8 log buffers (parts of the log held in RAM for speed -- each log buffer takes up 32K of RAM, though, and SGI advises against using 8 on a system with 128M RAM or less) instead of the default 2. The command used for this mount is mount -o logbufs=X, where X is 2 to 8, inclusive. The results (on a filesystem with no mkfs options):

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
2         8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)
6         8927(99%)  48610(40%) 15106(17%) 8923(97%)  41658(24%)
8         8954(99%)  49257(41%) 18367(20%) 8954(97%)  41622(24%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
2         232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
6         240.4(2%)  325(22%) 270(14%) 1355(46%) 302(21%) 58(2%)  358(14%)
8         243.3(2%)  414(30%) 268(12%) 2050(70%) 404(29%) 58(3%)  458(19%)

Again, we see clear improvements in delete performance, and some slight increases in some other areas with 8 log buffers. The next interesting option is to switch off the filesystem's logging of access times. The main function of access time logging is to make your filesystem slower. Seriously. Nobody ever uses that feature anyway, and even SGI suggests turning it off to improve performance. The mount command I use for this is mount -o noatime,nodiratime.

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
Enabled   8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)
Disabled  8940(99%)  50378(41%) 18720(21%) 8917(97%)  41561(26%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
Enabled   232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
Disabled  237.2(2%)  374(26%) 283(14%) 1830(58%) 349(24%) 61(3%)  319(12%)

Not really a lot happened here, although there were more gains than losses. I notice random delete performance has dropped, which seems odd. Still, having atime off means less disk activity, which is always a good thing.

Grand Finale

I decide that the best combination of options appears to be a XFS filesystem created with the default 45 allocation groups, but a 64 megabyte log, and mounted with 8 log buffers, and atime disabled (mount -o noatime,nodiratime,logbufs=8). Here are the benches for this über-tweaked XFS, compared to a standard XFS (and the one before the hdparm tricks, just for kicks):

          Sequential Output                Sequential Input
          Char       Block      Rewrite    Char       Block
nohdparm  4347(83%)  5202(6%)   1995(12%)  3816(81%)  4222(13%)
Standard  8950(99%)  49915(40%) 14142(15%) 8938(97%)  41418(26%)
Tweaked   8936(99%)  53158(42%) 15079(17%) 8923(97%)  41401(25%)

          RndSeeks   Sequential Create           Random Create
                     Create   Read     Delete    Create   Read    Delete
nohdparm  162.0(3%)  72(8%)   71(28%)  495(30%)  72(8%)   33(10%) 232(14%)
Standard  232.1(2%)  369(25%) 270(14%) 1695(51%) 362(25%) 57(2%)  340(12%)
Tweaked   237.5(2%)  437(32%) 275(13%) 2331(87%) 396(29%) 63(3%)  872(35%)

Improvements more or less across the board, and the two setbacks are so small that they can be written off as statistical noise. More interestingly, my delete performance has actually superseded that of ext3, for both random and sequential deletes! The most major weakness of XFS has been eliminated, and my spankin' new filesystem is ready to rock. Cheers!

Log in or registerto write something here or to contact authors.