The
Best File System in the World? Part 2
Peter Baer Galvin
In the last Solaris Corner, I began coverage of the startling
new Solaris 10+ file system -- ZFS. This month, I conclude that
coverage by showing the advanced features of ZFS, as we all chomp
at the bit waiting for ZFS to be available for production use sometime
in 2006.
ZFS Overview
Last month, I described ZFS in general, including its design goals
and the basics of use, so there is no reason to repeat that information
here. But by way of a quick summary to refresh the topic, consider
that ZFS is a 128-bit integrated file system and volume manager,
with advanced features, high performance, and ease of use.
Last month, the following features were described and demonstrated:
- Creation, deletion, and modification of a RAID-ed pool of storage
via the zpool command.
- Creation of several file systems via the zfs command,
including file systems within file systems.
- Setting of parameters of those file systems, including compression
and mount point.
- Taking a snapshot of ZFS file systems.
With that background, we can now consider the more advanced features
of ZFS.
Quotas, Reservations, and NFS
The best way to think of a ZFS file system is as a collection
of attributes. This is far different from the normal view of a file
system as a disk allocation. For example, a ZFS file system can
contain a hierarchy of ZFS file systems, each assuming the attributes
of its parent unless told otherwise (via zfs set). It is
very reasonable to have thousands of ZFS file systems on a computer.
For example, Sun kernel engineers create a ZFS file system per engineer.
Then attributes like compression can be set per engineer. Each file
system can have a quota to limit its growth. It can also have a
reservation to guarantee that the reserved amount of space is available
to that file system, as in:
# zfs set reservation=5g zfs/opt
Without a reservation set, a ZFS file system essentially takes no
disk space, until its gets some content. This attribute model is also
used for file system sharing. To export a file system for NFS mounting:
# zfs set sharenfs=rw=pbg zfs/test
which results in the appropriate system information being set in /etc/dfs/sharetab
but no entry in /etc/dfs/dfstab:
# cat /etc/dfs/sharetab
/zfs/test - nfs rw=pbg
Lots of Snapshots
Now let's explore the more advanced features of ZFS. To begin,
I'll take advantage of ZFS's ability to create a nearly infinite
number of very low-cost snapshots. To control and manage the snapshots,
a script that can delete old snapshots and create new ones at given
intervals for all ZFS file systems is desirable. The following script
was written by Chris Gerhard and posted at:
http://blog.sun.com/roller/page/chrisg?entry=snapping_every_minute
but I think it is instructive to include it here to show the power
of ZFS (with the date format "%e" changed to "%d" to fix a bug):
#!/bin/ksh -p
function take_snap
{
if zfs list -H -o name $1 >/dev/null 2>&1
then
zfs destroy $1
fi
zfs snapshot ${1}
}
case ${1:-boot} in
"boot")
snap=$(date '+%F-%T')
;;
"minute")
snap=minute_$(date +%M)
;;
"hour")
snap=hour_$(date +%H)
;;
"day")
snap=day_$(date +%d)
;;
"month")
snap=month_$(date +%m)
;;
esac
for fs in $(zfs list -H -o name -t filesystem)
do
take_snap ${fs}@${snap}
done
Now this cron entry will create snapshots for each ZFS file system,
once a minute for an hour, once an hour for a day, once a day for
a month, and once a month for a year:
* * * * * /opt/snapshot minute > /dev/null 2>&1
0 * * * * /opt/snapshot hour > /dev/null 2>&1
1 1 * * * /opt/snapshot day > /dev/null 2>&1
2 1 1 * * /opt/snapshot month > /dev/null 2>&1
The result on my home terabyte file server looks like this, for file
system zfs/zones:
# zfs list | grep zones
zfs/zones 145M 454G 113M /opt/zones
zfs/zones@day_20 142K - 110M -
zfs/zones@day_21 119K - 110M -
zfs/zones@day_22 119K - 110M -
zfs/zones@day_23 119K - 110M -
zfs/zones@day_24 120K - 110M -
zfs/zones@day_25 121K - 110M -
zfs/zones@day_26 121K - 110M -
zfs/zones@day_27 122K - 110M -
zfs/zones@day_28 123K - 110M -
zfs/zones@day_29 123K - 110M -
zfs/zones@day_30 124K - 110M -
zfs/zones@day_31 124K - 110M -
zfs/zones@month_01 125K - 110M -
zfs/zones@hour_11 55.0K - 113M -
zfs/zones@hour_12 55.0K - 113M -
zfs/zones@hour_13 55.0K - 113M -
zfs/zones@hour_14 55.0K - 113M -
. . .
Currently, this script has created 996 snapshots on the server. Note
the ramifications of this use of snapshots before you do so in your
own environments -- original versions of modified files (and deleted
files) will be on the system and taking up space for a year! If changes
are made that you don't want around, you can destroy snapshots:
# zfs destroy zfs@minute_27
You might also want to delete all the snapshots of a ZFS file system
when major changes are made (and you know you won't need to undo them)
via this "snapdestroy" script posted by Carson Little at an opensolaris.org
forum:
for fs in 'zfs list -H -o name -t snapshot | grep $1'
do
zfs destroy -R $fs
done
Note that this script accepts one command-line argument: the name
of the file system that should have all of its snapshots deleted.
To reset the state of the file system to that of a snapshot, simply
use the rollback option:
zfs rollback -r zfs/zones@minute_00
The -r option tells ZFS to delete all of the snapshots between
the one selected and the current file system. Those wouldn't make
any sense anyway, once the file system was rolled back. Without -r,
only the latest snapshot can be used as the rollback source.
Clones
Once snapshots are mastered, there are several other ZFS features
built upon them. One such is "clones". A clone is a read-write snapshot,
based on a read-only snapshot. A clone cannot be created directly
from a ZFS file system. It must be made from a snapshot:
# zfs clone zfs/zones@minute_34 zfs/zones/clone-zones-minute_34
# df -kh
Filesystem size used avail capacity Mounted on
. . .
zfs/zones 1.1T 113M 478G 1% /opt/zones
zfs/zones/clone-zones-minute_34
1.1T 113M 478G 1% /opt/zones/clone-zones-minute_34
Now this system has a clone of the zones@minute_34 snapshot, mounted
under /opt/zones/clone-zones-minute_34. This directory is exactly
what the zones file system looked like when the snapshot was created
and is writeable to boot. Note that it takes up very little space,
as with a snapshot, because only blocks overwritten since that snapshot
was taken are stored in the snapshot. In this case, the clone is taking
up 33K:
# zfs list -o name,type,used,available,referenced,mountpoint \
zfs/zones/clone-zones-minute_34
NAME TYPE USED AVAIL REFER MOUNTPOINT
zfs/zones/clone-zones-minute_34 filesystem 33.0K 478G 113M /opt/zones/ \
clone-zones-minute_34
Backup/Restore and Replication
For the most part, any backup software that uses the standard
Unix file system interfaces should work with ZFS. NetBackup seems
to work, for example, except for the new Access Control List (ACL)
mechanism in ZFS. The backups and restores will work for the files,
but the ACLs are silently dropped. I hope Veritas is working to
solve this. On the other hand, ZFS has its own basic backup/restore
mechanism built-in.
For example, ZFS can create a backup file containing the contents
of a snapshot:
# zfs backup zfs/test@minute_55 > /tmp/zfs-test-backup
The resulting file contains all of the information needed by ZFS to
recreate that file system. It is also a standard data file, so it
could be backed up by any appropriate software.
A restore occurs via:
# zfs restore zfs/test/restore < /tmp/zfs-test-backup
Results in :
# df -kh
Filesystem size used avail capacity Mounted on
. . .
zfs/test 1.1T 17K 476G 1% /zfs/test
zfs/test/restore 1.1T 16K 476G 1% /zfs/test/restore
Both full and incremental backups are available, as shown from the
ZFS documentation. The delta between the two specified backups is
sent to a tape drive in this example:
# zfs backup -i tank/dana@111505 tank/dana@now > /dev/rmt/0
Replication is available by piping a backup from one machine to a
restore on another.
All of these issues are described in more detail at Tim Foster's
blog:
http://blogs.sun.com/roller/page/timf?entry=zfs_backup
ZFS in Zones
ZFS and zones (a.k.a. containers) have a special relationship.
Certainly zones can be created on top of ZFS file systems. But,
uniquely, one or more ZFS file systems can be given to a zone to
own. The file systems are wholly owned by the child zone and, for
example, do not appear in the global zone's df output.
Consider my server system, as seen from the global zone:
# df -kh
Filesystem size used avail capacity Mounted on
. . .
zfs 1.1T 18K 476G 1% /zfs
zfs/big 1.1T 462G 476G 50% /zfs/big
zfs/home 1.1T 164G 476G 26% /export/home
zfs/opt 1.1T 3.8G 478G 1% /opt
zfs/zones 1.1T 113M 476G 1% /opt/zones
zfs/mqueue 1.1T 16K 476G 1% /var/spool/mqueue
And now from the "www" zone:
# zlogin www
[Connected to zone 'www' pts/4]
Last login: Mon Dec 19 20:47:37 on console
Sun Microsystems Inc. SunOS 5.11 snv_28 October 2007
# df -kh
Filesystem size used avail capacity Mounted on
. . .
zfs/www 1.1T 17K 476G 1% /zfs/www
zfs/www/apache 1.1T 10M 476G 1% /var/apache
zfs/www/mail 1.1T 16K 476G 1% /zfs/www/mail
zfs/www/mqueue 1.1T 16K 476G 1% /var/spool/mqueue
Note the zfs/www file system that is not visible to the global
zone. The zone given ownership of a ZFS file system can change attributes
of the file system, as well as create new ZFS file systems within
the file system. This is a very handy feature.
To give a ZFS file system to a zone, create the file system in
the global zone, and then use zonecfg:
# zonecfg -z www
zonecfg:www> add dataset
zonecfg:www:dataset> set name=zfs/test
zonecfg:www:dataset> end
zonecfg:www> info
zonename: www
zonepath: /opt/zones/www
autoboot: true
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
inherit-pkg-dir:
dir: /opt
net:
address: 10.1.112.24
physical: e1000g0
dataset:
name: zfs/www
A reboot of the zone makes the new file system available at zone boot
time.
Discussions
There is a lively discussion of ZFS at:
http://www.opensolaris.org/jive/forum.jspa?forumID=80&start=0
about advanced features and their uses. For example, there is a useful
script showing how to automatically create a clone, perform a backup
from that clone, and finally delete the clone. Over in the blogs.sun.com
at:
http://www.blogs.sun.com/roller/page/jclingan?entry=create_a_zone_using_zfs
there is other interesting ZFS discussion, including a script (unsupported)
to create a new zone via ZFS in about a second!
Also in the forums is mention of a useful ZFS testing technique.
Without even having spare partitions available, you can get started
experimenting with ZFS by creating an empty file on a current file
system (e.g., via mkfile) and creating a zpool from that empty file.
A nice cheat sheet showing some of the things you can do with
ZFS is available at:
http://www.colinseymour.co.uk/techie/solaris-10-stuff/zfs-cheatsheet/
Conclusions
ZFS is a revolutionary file system/volume management facility.
Even though it is not yet shipping in production, it has a rich
feature set and tremendously easy management. Before it is included
in a production Solaris release, it is likely to improve even more,
making its use a strong possibility for many sites at that appointed
time.
Perhaps the most astounding bit of data comes from Eric Schrock's
blog at:
http://blogs.sun.com/roller/page/eschrock?entry=ufs_svm_vs_zfs_code
Here is a count of the number of lines in the Solaris implantation
of UFS and the volume manager, compared to ZFS:
-------------------------------------------------
UFS: kernel= 46806 user= 40147 total= 86953
SVM: kernel= 75917 user=161984 total=237901
TOTAL: kernel=122723 user=202131 total=324854
-------------------------------------------------
ZFS: kernel= 50239 user= 21073 total= 71312
-------------------------------------------------
This information is certainly food for thought and reason to hope
that ZFS will be stable and reliable when it ships for production
use. I can hardly wait.
Peter Baer Galvin (http://www.petergalvin.info) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column, and Pete's
Super Systems, the systems management column for Unix Insider
(http://www.unixinsider.com). Peter is coauthor of the Operating
Systems Concepts and Applied Operating Systems Concepts
textbooks. As a consultant and trainer, Peter has taught tutorials
and given talks on security and systems administration worldwide. |