Search This Blog

Monday, November 14, 2011

RHEL 6 Part IV: Placing XFS into production and measuring performance

Making an XFS filesystem for a production environment

It's about time we see some actions apart from the first sysadmin impressions on RHEL 6, as described in the previous article of the series. One of the first fundamental differences between RHEL 5 and 6 is the support for XFS filesystem deployments. Why would you care to support XFS? Well simply, apart from the multi-threaded performance, if you are an ext4 kind of guy and you are likely to store more than 16 TiB on a volume, then XFS is your best choice (actually ext4 can support filesystems up to 1 EiB, however the accompanying filesystem utilities and the support on these utilities limit the supported size of a volume down to 16TiB).

Another kind of 'gotcha' (which I really dislike with RedHat) is that in RHEL 6, you should not take support of XFS for granted, unless your license includes the duties paid for the appropriate layered product, which is called "Scalable File System Add-On" (my own translation: "Give us your money if you want fs support over 16 TiB" :-) ). If you have paid for a basic RHEL 6 license, your RHN registered your machine, mkfs.xfs is missing from your root path and a yum search xfsprogs returns nothing, you know that you need to look into your pocket and not the yum repository config.

If you do not want to spend money and willing to risk running an XFS installation without support, head over to the nearest CentOS 6 repository, download the xfsprogs and xfsprogs-devel RPMs, do a yum install with these two RPMs and you will be good to go.

I used a simple Directly Attached Storage setup of a Dell PowerEdge R815 server, fitted with an H800 PERC SAS 6Gb controller driving a single Dell MD1200 cabinet fitted with 12 x 2Tb Nearline 6Gb SAS drives. Four of them were used for the purposes of the test in RAID0 config. In order to be precise, for those of you familiar with the OMSA setup, here is the exact config as reported by the omreport storage vdisk OMSA command:

ID                  : 2
Status              : Ok
Name                : EMBNETGALAXY
State               : Ready
Encrypted           : Not Applicable
Layout              : RAID-0
Size                : 7,450.00 GB (7999376588800 bytes)
Device Name         : /dev/sdd
Bus Protocol        : SAS
Media               : HDD
Read Policy         : Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 256 KB
Disk Cache Policy   : Enabled

Returning back to the OS land, the first step is to connect the built hardware virtual disk (vdisk) to LVM2, so I can have the luxury of expanding the filesystem size at will in the future.

[root@biotin src]# pvcreate /dev/sdd
  Physical volume "/dev/sdd" successfully created
[root@biotin src]# vgcreate VGEMBGalaxy /dev/sdd
  Volume group "VGEMBGalaxy" successfully created
[root@biotin src]# lvcreate -L 5T -n LVembgalaxy VGEMBGalaxy
  Logical volume "LVembgalaxy" created

At that point, I have tagged the hardware created vdisk (/dev/sdd) as an LVM physical volume, created my Volume Group and made a Logical Volume of 5 Tbytes, in order to build my XFS filesystem (I am not going to use the full size of the PV, in order to demonstrate XFS expansion later on). Now, let's build the actual XFS filesystem:

[root@biotin src]# mkfs.xfs -d su=256k,sw=4 /dev/VGEMBGalaxy/LVembgalaxy
meta-data=/dev/VGEMBGalaxy/LVembgalaxy isize=256    agcount=5, agsize=268435392 blks=  sectsz=512   attr=2
data=    bsize=4096   blocks=1342176960, imaxpct=5 sunit=64     
swidth=256 blks naming   =version 2              bsize=4096   ascii-ci=0
log=internal log           bsize=4096   blocks=521728, version=2 
sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
This actually builts the XFS filesystem on top of the LVM Logical Volume (/dev/VGEMBGalaxy/LVembgalaxy). You might have noticed that the specified stripe unit (su) size and the number of disks (sw) match the config of the H800 vdisk, as given earlier on by the output of the omreport storage vdisk command. Good system practice dictates that these parameters are passed to the mkfs.xfs utility, in order to improve filesystem performance.

We are now ready to mount the filesystem, so we make sure the mountpoint exists and enter an entry to the /etc/fstab:

/dev/VGEMBGalaxy/LVembgalaxy    /storage/tools          xfs rw,nobarrier,inode64            0 0

Note the nobarrier and inode64 flags. The first (which is also applicable to ext4 filesystems) makes sure that you get a bit of extra performance boost, if and only if your disk controller cache memory is battery backed (and the battery is good AND you have a UPS to shutdown your system properly). The same objective is set by using the inode64 flag, although it can break some older applications (old NFS v3 clients that NFS import the XFS partition, applications whose binaries are older than 4-5 years and write locally on the disk). A mount -a later and you should be able to see the XFS filesystem accessible:

[root@biotin src]# df -h 
Filesystem            Size  Used Avail Use% Mounted on
                      5.0T   33M  5.0T   1% /storage/tools

One thing that you will also note, is that the default settings give you a substantially large number of available inodes, always in comparison to ext4 based similarly sized filesystems:

[root@biotin src]# df -ih
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
                        346M      12    346M    1% /storage/area1
                        346M      11    346M    1% /storage/area2
                        1.0G       3    1.0G    1% /storage/tools
Now, let's say that all is good, you go ahead and use the filesystem and after some time your users fill up the volume. How about expanding the volume and add say a couple of TiBs, to give them some breathing space? Sure, quite easily, without even taking off-line (unmounting the filesystem). First, we extend the LV:

[root@biotin src]# lvextend -L+2T /dev/VGEMBGalaxy/LVembgalaxy
  Extending logical volume LVembgalaxy to 7.00 TiB
  Logical volume LVembgalaxy successfully resized

And then tell XFS to grow up to the size of the extended LV by doing a:

[root@biotin src]# xfs_growfs /storage/tools/
meta-data=/dev/mapper/VGEMBGalaxy-LVembgalaxy isize=256    agcount=5, agsize=268435392 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1342176960, imaxpct=5
         =                       sunit=64     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 1342176960 to 1879048192

Now, a df -h confirms the almost instant resize operation:

[root@biotin src]# df -h
Filesystem            Size  Used Avail Use% Mounted on
                      7.0T   33M  7.0T   1% /storage/tools

Quick, simple and efficient. The sort of thing you would expect from a scalable filesystem.

Measuring the performance envelope of XFS

So what XFS can really do in terms of performance? There is useful info on the web and many sysadmins have tried to compare and contrast XFS against the popular ext4 filesystem. Here is my method:

I employ iozone, a well tested filesystem benchmarking tool on an ext4 volume and then on the newly constructed XFS volume. Both volumes are configured with exactly the same RAID config (RAID 0 and 4 disks), they run on the same type of hardware, they have the same fs block size (4kbytes).

The mount flags for the ext4 filesystem were:


and for the XFS filesystem:


The benchmarks are run in the following order:

  • first the ext4 benchmark is run 
  • a reboot of the box follows to make sure we do not have any VFS cache/memory issues affecting the results
  • the XFS volume benchmark is run. 
During both tests all other I/O activity is excluded on the box ( no users login and services are kept to a minimum. You might also find useful to disable SELinux. There is always the option of running the benchmarks in single user mode, but I wanted to monitor the box remotely, as I was writing this  ).

This pro the entire procedure is repeated five times and the arithmetic mean of the results is reported on the graph results.

Both tests were performed by using the following iozone command:

nohup ./iozone -S 512 -f volume_file_path -P0 -Ra -i0 -i1 -i2 -i4 -n 512g -g 1024g -y64k -q512k > fileresultsFSTYPE.xls &

The iozone manual will help you decipher the meaning of the switch options, but briefly, the command encompasses some parameters that ensure we get meaningful results, given the size of RAM of the server, the processor cache size and the test conditions. The volume_file_path is the absolute path of the volume where the test file should reside (the volume/partition you should test).

Please note that these tests take weeks to complete properly, so should you wish to perform similar tests on a system, make sure you schedule enough downtime to complete them without additional activity on the box.

Here are the results.

These should make the difference clear, showing in summary that as far as sequential I/O performance is concerned, XFS is better. For random I/O performance (smaller figures on the right, we have also better speed for random writes on XFS.

Want a scalable solution that can give you a descent performance and have been so far on ext4, while your single volume data production rises? Think again and consider XFS!


  1. Nice article.

    Is the standard deviation between the various test transcurable and comparable?

    Did you collect/graph performance stats (eg. usage, cpu, queue time) ?

    Did you measure the results changing the queue scheduler elevator?

    Thx again for your post + Peace,

  2. Hi piccolapatria,

    Thanks for your kind words. In terms of your questions:
    1)I have looked at std between the iterations and I find comparable results. If you would like, I can give you the spreadsheets, so you can test.

    2)Yes, I collected perf stats using the dstat tool: cpu usage, wait queue, context switches and interrupts.

    3)No, I did not change the default queue scheduler. The idea was to test a default config of RHEL5 versus a default config of RHEL6. Obviously, by tunning the schedulers you could improve the results for certain workloads.