Article mar2006.tar

The Best File System in the World? Part 2

Peter Baer Galvin

In the last Solaris Corner, I began coverage of the startling new Solaris 10+ file system -- ZFS. This month, I conclude that coverage by showing the advanced features of ZFS, as we all chomp at the bit waiting for ZFS to be available for production use sometime in 2006.

ZFS Overview

Last month, I described ZFS in general, including its design goals and the basics of use, so there is no reason to repeat that information here. But by way of a quick summary to refresh the topic, consider that ZFS is a 128-bit integrated file system and volume manager, with advanced features, high performance, and ease of use.

Last month, the following features were described and demonstrated:

  • Creation, deletion, and modification of a RAID-ed pool of storage via the zpool command.
  • Creation of several file systems via the zfs command, including file systems within file systems.
  • Setting of parameters of those file systems, including compression and mount point.
  • Taking a snapshot of ZFS file systems.

With that background, we can now consider the more advanced features of ZFS.

Quotas, Reservations, and NFS

The best way to think of a ZFS file system is as a collection of attributes. This is far different from the normal view of a file system as a disk allocation. For example, a ZFS file system can contain a hierarchy of ZFS file systems, each assuming the attributes of its parent unless told otherwise (via zfs set). It is very reasonable to have thousands of ZFS file systems on a computer. For example, Sun kernel engineers create a ZFS file system per engineer. Then attributes like compression can be set per engineer. Each file system can have a quota to limit its growth. It can also have a reservation to guarantee that the reserved amount of space is available to that file system, as in:

# zfs set reservation=5g zfs/opt
Without a reservation set, a ZFS file system essentially takes no disk space, until its gets some content. This attribute model is also used for file system sharing. To export a file system for NFS mounting:

# zfs set sharenfs=rw=pbg zfs/test
which results in the appropriate system information being set in /etc/dfs/sharetab but no entry in /etc/dfs/dfstab:

# cat /etc/dfs/sharetab    
/zfs/test       -       nfs     rw=pbg
Lots of Snapshots

Now let's explore the more advanced features of ZFS. To begin, I'll take advantage of ZFS's ability to create a nearly infinite number of very low-cost snapshots. To control and manage the snapshots, a script that can delete old snapshots and create new ones at given intervals for all ZFS file systems is desirable. The following script was written by Chris Gerhard and posted at:

http://blog.sun.com/roller/page/chrisg?entry=snapping_every_minute
but I think it is instructive to include it here to show the power of ZFS (with the date format "%e" changed to "%d" to fix a bug):

#!/bin/ksh -p

function take_snap
{
        if zfs list -H -o name $1 >/dev/null 2>&1
        then
                zfs destroy $1
        fi
        zfs snapshot ${1}
}

case ${1:-boot} in
        "boot")
                snap=$(date '+%F-%T')
                ;;
        "minute")
                snap=minute_$(date +%M)
                ;;
        "hour")
                snap=hour_$(date +%H)
                ;;
        "day")
                snap=day_$(date +%d)
                ;;
        "month")
                snap=month_$(date +%m)
                ;;
esac

for fs in $(zfs list -H -o name -t filesystem)
do
        take_snap ${fs}@${snap}
done
Now this cron entry will create snapshots for each ZFS file system, once a minute for an hour, once an hour for a day, once a day for a month, and once a month for a year:

* * * * * /opt/snapshot minute > /dev/null 2>&1
0 * * * * /opt/snapshot hour > /dev/null 2>&1
1 1 * * * /opt/snapshot day > /dev/null 2>&1
2 1 1 * * /opt/snapshot month > /dev/null 2>&1
The result on my home terabyte file server looks like this, for file system zfs/zones:

# zfs list | grep zones
zfs/zones              145M   454G   113M  /opt/zones
zfs/zones@day_20       142K      -   110M  -
zfs/zones@day_21       119K      -   110M  -
zfs/zones@day_22       119K      -   110M  -
zfs/zones@day_23       119K      -   110M  -
zfs/zones@day_24       120K      -   110M  -
zfs/zones@day_25       121K      -   110M  -
zfs/zones@day_26       121K      -   110M  -
zfs/zones@day_27       122K      -   110M  -
zfs/zones@day_28       123K      -   110M  -
zfs/zones@day_29       123K      -   110M  -
zfs/zones@day_30       124K      -   110M  -
zfs/zones@day_31       124K      -   110M  -
zfs/zones@month_01     125K      -   110M  -
zfs/zones@hour_11     55.0K      -   113M  -
zfs/zones@hour_12     55.0K      -   113M  -
zfs/zones@hour_13     55.0K      -   113M  -
zfs/zones@hour_14     55.0K      -   113M  -
. . .
Currently, this script has created 996 snapshots on the server. Note the ramifications of this use of snapshots before you do so in your own environments -- original versions of modified files (and deleted files) will be on the system and taking up space for a year! If changes are made that you don't want around, you can destroy snapshots:

# zfs destroy zfs@minute_27
You might also want to delete all the snapshots of a ZFS file system when major changes are made (and you know you won't need to undo them) via this "snapdestroy" script posted by Carson Little at an opensolaris.org forum:

for fs in 'zfs list -H -o name -t snapshot | grep $1'
do
zfs destroy -R $fs
done
Note that this script accepts one command-line argument: the name of the file system that should have all of its snapshots deleted.

To reset the state of the file system to that of a snapshot, simply use the rollback option:

zfs rollback -r zfs/zones@minute_00
The -r option tells ZFS to delete all of the snapshots between the one selected and the current file system. Those wouldn't make any sense anyway, once the file system was rolled back. Without -r, only the latest snapshot can be used as the rollback source.

Clones

Once snapshots are mastered, there are several other ZFS features built upon them. One such is "clones". A clone is a read-write snapshot, based on a read-only snapshot. A clone cannot be created directly from a ZFS file system. It must be made from a snapshot:

# zfs clone zfs/zones@minute_34 zfs/zones/clone-zones-minute_34
# df -kh
Filesystem    size  used  avail capacity  Mounted on
. . .
zfs/zones     1.1T  113M   478G     1%    /opt/zones
zfs/zones/clone-zones-minute_34
              1.1T  113M   478G     1%    /opt/zones/clone-zones-minute_34
Now this system has a clone of the zones@minute_34 snapshot, mounted under /opt/zones/clone-zones-minute_34. This directory is exactly what the zones file system looked like when the snapshot was created and is writeable to boot. Note that it takes up very little space, as with a snapshot, because only blocks overwritten since that snapshot was taken are stored in the snapshot. In this case, the clone is taking up 33K:

# zfs list -o name,type,used,available,referenced,mountpoint \
  zfs/zones/clone-zones-minute_34
NAME                             TYPE        USED  AVAIL REFER MOUNTPOINT
zfs/zones/clone-zones-minute_34  filesystem  33.0K  478G  113M /opt/zones/ \
                                                       clone-zones-minute_34
Backup/Restore and Replication

For the most part, any backup software that uses the standard Unix file system interfaces should work with ZFS. NetBackup seems to work, for example, except for the new Access Control List (ACL) mechanism in ZFS. The backups and restores will work for the files, but the ACLs are silently dropped. I hope Veritas is working to solve this. On the other hand, ZFS has its own basic backup/restore mechanism built-in.

For example, ZFS can create a backup file containing the contents of a snapshot:

# zfs backup zfs/test@minute_55 > /tmp/zfs-test-backup
The resulting file contains all of the information needed by ZFS to recreate that file system. It is also a standard data file, so it could be backed up by any appropriate software.

A restore occurs via:

# zfs restore zfs/test/restore < /tmp/zfs-test-backup
Results in :

# df -kh
Filesystem             size   used  avail capacity  Mounted on
. . .
zfs/test               1.1T    17K   476G     1%    /zfs/test
zfs/test/restore       1.1T    16K   476G     1%    /zfs/test/restore
Both full and incremental backups are available, as shown from the ZFS documentation. The delta between the two specified backups is sent to a tape drive in this example:

# zfs backup -i tank/dana@111505 tank/dana@now > /dev/rmt/0
Replication is available by piping a backup from one machine to a restore on another.

All of these issues are described in more detail at Tim Foster's blog:

http://blogs.sun.com/roller/page/timf?entry=zfs_backup
ZFS in Zones

ZFS and zones (a.k.a. containers) have a special relationship. Certainly zones can be created on top of ZFS file systems. But, uniquely, one or more ZFS file systems can be given to a zone to own. The file systems are wholly owned by the child zone and, for example, do not appear in the global zone's df output.

Consider my server system, as seen from the global zone:

# df -kh
Filesystem             size   used  avail capacity  Mounted on
. . .
zfs                    1.1T    18K   476G     1%    /zfs
zfs/big                1.1T   462G   476G    50%    /zfs/big
zfs/home               1.1T   164G   476G    26%    /export/home
zfs/opt                1.1T   3.8G   478G     1%    /opt
zfs/zones              1.1T   113M   476G     1%    /opt/zones
zfs/mqueue             1.1T    16K   476G     1%    /var/spool/mqueue
And now from the "www" zone:

# zlogin www
[Connected to zone 'www' pts/4]
Last login: Mon Dec 19 20:47:37 on console
Sun Microsystems Inc.   SunOS 5.11      snv_28  October 2007
 
# df -kh
Filesystem             size   used  avail capacity  Mounted on
. . .
zfs/www                1.1T    17K   476G     1%    /zfs/www
zfs/www/apache         1.1T    10M   476G     1%    /var/apache
zfs/www/mail           1.1T    16K   476G     1%    /zfs/www/mail
zfs/www/mqueue         1.1T    16K   476G     1%    /var/spool/mqueue
Note the zfs/www file system that is not visible to the global zone. The zone given ownership of a ZFS file system can change attributes of the file system, as well as create new ZFS file systems within the file system. This is a very handy feature.

To give a ZFS file system to a zone, create the file system in the global zone, and then use zonecfg:

# zonecfg -z www
zonecfg:www> add dataset
zonecfg:www:dataset> set name=zfs/test
zonecfg:www:dataset> end               
zonecfg:www> info
zonename: www    
zonepath: /opt/zones/www
autoboot: true
pool:
inherit-pkg-dir:
        dir: /lib
inherit-pkg-dir:
        dir: /platform
inherit-pkg-dir:
        dir: /sbin
inherit-pkg-dir:
        dir: /usr
inherit-pkg-dir:
        dir: /opt
net:
        address: 10.1.112.24
        physical: e1000g0
dataset:
        name: zfs/www
A reboot of the zone makes the new file system available at zone boot time.

Discussions

There is a lively discussion of ZFS at:

http://www.opensolaris.org/jive/forum.jspa?forumID=80&start=0
about advanced features and their uses. For example, there is a useful script showing how to automatically create a clone, perform a backup from that clone, and finally delete the clone. Over in the blogs.sun.com at:

http://www.blogs.sun.com/roller/page/jclingan?entry=create_a_zone_using_zfs
there is other interesting ZFS discussion, including a script (unsupported) to create a new zone via ZFS in about a second!

Also in the forums is mention of a useful ZFS testing technique. Without even having spare partitions available, you can get started experimenting with ZFS by creating an empty file on a current file system (e.g., via mkfile) and creating a zpool from that empty file.

A nice cheat sheet showing some of the things you can do with ZFS is available at:

http://www.colinseymour.co.uk/techie/solaris-10-stuff/zfs-cheatsheet/
Conclusions

ZFS is a revolutionary file system/volume management facility. Even though it is not yet shipping in production, it has a rich feature set and tremendously easy management. Before it is included in a production Solaris release, it is likely to improve even more, making its use a strong possibility for many sites at that appointed time.

Perhaps the most astounding bit of data comes from Eric Schrock's blog at:

http://blogs.sun.com/roller/page/eschrock?entry=ufs_svm_vs_zfs_code
Here is a count of the number of lines in the Solaris implantation of UFS and the volume manager, compared to ZFS:

-------------------------------------------------
  UFS: kernel= 46806   user= 40147   total= 86953
  SVM: kernel= 75917   user=161984   total=237901
TOTAL: kernel=122723   user=202131   total=324854
-------------------------------------------------
  ZFS: kernel= 50239   user= 21073   total= 71312
-------------------------------------------------
This information is certainly food for thought and reason to hope that ZFS will be stable and reliable when it ships for production use. I can hardly wait.

Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.