Article feb2006.tar

The Best File System in the World? Part 1

Peter Baer Galvin

In case you haven't noticed, there continues to be a lot going on at Sun. In the last few months, there have been quite a few new product announcements and upgrades. Sun has made both their premier development environment, Studio 11, and their Java Enterprise System tool set open source and free to use. They have included the open source Postgres database with Solaris and are supporting it via the usual channels.

Sun also has announced a new CPU, the UltraSPARC T1. This chip has up to eight cores, with four threads per core (a la hyperthreading). Sun has also released the first systems in their next-generation Opteron-based line (the X2100, X4100, and X4200). These systems have an outstanding feature set and continue to set benchmark records.

Sun's efforts to create a community around OpenSolaris are also bearing fruit, with three non-Sun distributions and counting:

  • Belenix -- http://belenix.sarovar.org/belenix_background.html
  • Nexenta -- http://www.gnusolaris.org/gswiki
  • Schillix -- http://schillix.berlios.de/index.php?id=news or ftp://ftp.berlios.de/pub/schillix/

All of these releases add value to what is delivered with OpenSolaris, with perhaps the most excitement being generated by Nexenta with its use of the OpenSolaris kernel and Debian tools and package management on top.

Even although each of these topics deserves its own coverage, they are not the subject of this article. Rather, in this column I'll discuss Sun's new ZFS file system. The anticipation and excitement around this Solaris feature are beyond compare. As you will find out if you read further, this is for good reason.

ZFS

ZFS has been gestating for several years within Sun. As of November 2005, the beta-test program has been rolled into the OpenSolaris release. OpenSolaris build Nevada 27a includes the full source code for ZFS, and as of this writing was the best and only way to get access to ZFS. Note that this is not a production nor supported release. That will happen sometime in 2006 when an update to Solaris 10 is shipped by Sun. The feature set described in this column, and examples shown, are all based on the pre-production OpenSolaris Nevada 27a release. OpenSolaris is available for free download from:

http://www.opensolaris.org
The genesis of ZFS was the idea to create a modern file system from scratch. Many fundamental ideas of what a file system should be, rather than how to modify an existing one, came together in a new approach to file management. For example, ZFS includes both disk management and file management in one. These areas, traditionally "the file system" and "the volume manager", were not considered separately. Sun's contention is that evolution and happenstance resulted in the current split and that the split causes unnecessary complexity and inefficiency.

Another area in need of vast improvement was reliability. While some common failure scenarios are reasonably solved by previous technologies (such as a single disk failure repaired by RAID), many other failure scenarios cause revealed or hidden corruption (such as a controller bug). Finally, the set of features should include the utility found scattered throughout other file systems, such as snapshots, replication, and compression.

ZFS Features

The net result of this engineering effort is a new way of thinking about and managing storage. There are no volumes. Rather, there are storage pools made up of disks (or slices if desired) in various RAID configurations. File systems are no longer large, monolithic, difficult or impossible to change entities. Instead, they are allocated out of pools and can contain other file systems. All aspects of a file system can be changed dynamically without loss of access to the data. And the list goes on. Here then is a summary of the features of the first release of ZFS:

  • Integral checksumming of all data and file system entities (directories et al.) for data correctness and seamless error recovery. Checksums are created on write and are recalculated and checked on reads for always-on data protection.
  • All writes are copy-on-write, so data is always consistent on disk. There is no fsck command for ZFS. Note that NVRAM is not needed for this consistency implementation.
  • ZFS is a 128-bit facility, for (almost) limitless scalability.
  • A file system allocated from a pool grows as data is allocated within it (i.e., the utility storage model is implemented).
  • Storage pools supporting mirroring, dynamic striping, RAID-Z. Dynamic striping automatically expands stripes across disks as disks are added to a pool. RAID-Z is like RAID-5, but it always performs full-stripe writes for performance and to avoid disk inconsistency.
  • File systems are hierarchical, inheriting properties from their parent.
  • There are per-file system quotas and per-file system reservations (guaranteed disk space availability).
  • Built-in NFS sharing.
  • Full access-control lists.
  • A pool can be used by multiple file systems, allowing improved performance (depending on the configuration) by involving many disks in each I/O cycle.
  • Automatic, permanent mounting of created file systems (this, as with almost all ZFS features, can be configured or disabled).
  • Optional data "scrubbing" to read data and confirm its checksum correctness.
  • Import and export of pools (allowing pools to move between systems).
  • Endian-independent file system representation (allowing pools to move between endian-different systems, such as SPARC and Opteron).
  • Instantaneous snapshots and clones (read-write snapshots).
  • Integrated backup and restore options, including incremental and full backups.
  • Replication between systems using the integrated backup and restore functions.

ZFS includes all of these features, while maintaining a simple and easy-to-understand command structure. In fact, there are only two commands: zfs and zpool. The combination of features with simplicity and ease of use is perhaps the most stunning aspect of ZFS.

ZFS Operation

These examples are from an AMD-based workstation:

bash-3.00# uname -a
SunOS unknown 5.11 snv_27 i86pc i386 i86pc
On the example system, this is the initial disk status:

# df -kh

Filesystem           size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0      6.7G   3.3G   3.4G    49%    /
/devices               0K     0K     0K     0%    /devices
ctfs                   0K     0K     0K     0%    /system/contract
proc                   0K     0K     0K     0%    /proc
mnttab                 0K     0K     0K     0%    /etc/mnttab
swap                 1.5G   648K   1.5G     1%    /etc/svc/volatile
objfs                  0K     0K     0K     0%    /system/object
/usr/lib/libc/libc_hwcap1.so.1
                     6.7G   3.3G   3.4G    49%    /lib/libc.so.1
fd                     0K     0K     0K     0%    /dev/fd
swap                 1.5G    40K   1.5G     1%    /tmp
swap                 1.5G    20K   1.5G     1%    /var/run
There are also some unused slices. Note that whole disks, and not just slices, are zpool-manageable. In this case, the available slices are c0d0s4, c0d0s5, c0d0s6, and c0d0s7 as shown by format -> partition -> print:

Part      Tag   Flag    Cylinders        Size           Blocks
  0       root   wm    409 - 1301        6.84GB   (893/0/0)  14346045
  1       swap   wu      3 -  133        1.00GB   (131/0/0)   2104515
  2     backup   wm      0 - 1301        9.97GB   (1302/0/0) 20916630
  3 unassigned   wm      0               0        (0/0/0)           0
  4 unassigned   wm    134 -  201      533.41MB   (68/0/0)    1092420
  5 unassigned   wm    202 -  269      533.41MB   (68/0/0)    1092420
  6 unassigned   wm    270 -  337      533.41MB   (68/0/0)    1092420
  7 unassigned   wm    338 -  405      533.41MB   (68/0/0)    1092420
  8       boot   wu      0 -    0        7.84MB   (1/0/0)       16065
  9 alternates   wu      1 -    2       15.69MB   (2/0/0)       32130
Now, let's create a mirror using two of the slices and call the resulting pool "pool1". The commands after the create show information about the pools on the system. Again, this is an example; in normal production, you should never mirror two slices on one disk!

# zpool create pool1 mirror c0d0s4 c0d0s5

# zpool list

NAME           SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
pool1          528M   33.0K    528M     0%  ONLINE     -

bash-3.00# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

NAME        STATE     READ WRITE CKSUM
pool1       ONLINE       0     0     0
  mirror    ONLINE       0     0     0
    c0d0s4  ONLINE       0     0     0
    c0d0s5  ONLINE       0     0     0
The pool is also a file system, and is automatically mounted, as evidenced by the new df:

# df -kh

Filesystem           size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0      6.7G   3.3G   3.4G    49%    /
/devices               0K     0K     0K     0%    /devices
ctfs                   0K     0K     0K     0%    /system/contract
proc                   0K     0K     0K     0%    /proc
mnttab                 0K     0K     0K     0%    /etc/mnttab
swap                 1.5G   648K   1.5G     1%    /etc/svc/volatile
objfs                  0K     0K     0K     0%    /system/object
/usr/lib/libc/libc_hwcap1.so.1
                     6.7G   3.3G   3.4G    49%    /lib/libc.so.1
fd                     0K     0K     0K     0%    /dev/fd
swap                 1.5G    40K   1.5G     1%    /tmp
swap                 1.5G    20K   1.5G     1%    /var/run
pool1                512M     8K   512M     1%    /pool1
Now let's create a file system within the pool:

# zfs create pool1/dev
It is also quite easy to delete a file system. Obviously, all data stored on the file system is deleted when you destroy a file system:

# zfs destroy pool1/dev

# zfs create pool1/src

# df -kh

Filesystem            size used  avail capacity Mounted on
. . .
pool1                 512M   8K   512M     1%   /pool1
pool1/src             512M   8K   512M     1%   /pool1/src
Within the "src" file system, let's create a new file system:

# zfs create pool1/src/opensolaris

# df -kh

Filesystem            size used  avail capacity Mounted on

. . .
pool1                 512M   8K   512M     1%   /pool1
pool1/src             512M   8K   512M     1%   /pool1/src
pool1/src/opensolaris 512M   8K   512M     1%   /pool1/src/opensolaris
Many options are available with respect to ZFS pools and file systems. In the next example, I change the mount point of file system "opensolaris". Note that this takes effect immediately:

# zfs set mountpoint=/opensolaris pool1/src/opensolaris

# df -kh
Filesystem             size   used  avail capacity  Mounted on
. . .
pool1                  512M     8K   512M     1%    /pool1
pool1/src              512M     8K   512M     1%    /pool1/src
pool1/src/opensolaris  512M     8K   512M     1%    /opensolaris

# cd /opensolaris
Now it's time to start using these new file systems. I'll start by untar-ing the OpenSolaris source code:

# cd /opensolaris
# tar xf /var/tmp/opensolaris-src* &
[1] 1279
Now let's see how the ZFS pools are doing:

# zpool  iostat -v 5

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool1       19.1M   509M      0    680      0  3.03M
  mirror    19.1M   509M      0    680      0  3.03M
    c0d0s4      -      -      0     94      0  3.02M
    c0d0s5      -      -      0     93      0  3.03M
----------  -----  -----  -----  -----  -----  -----

^C
While the untar is continuing, let's add another mirror set to the available storage in the pool and watch the result:

# zpool add pool1 mirror c0d0s6 c0d0s7
# zpool iostat -v 5
               

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool1        142M   914M      0    208      0  2.05M
  mirror     124M   404M      0    109      0  1.10M
    c0d0s4      -      -      0     33      0  1.10M
    c0d0s5      -      -      0     33      0  1.10M
  mirror    18.5M   509M      0     99      0   976K
    c0d0s6      -      -      0     13      0   908K
    c0d0s7      -      -      0     14      0   976K
----------  -----  -----  -----  -----  -----  -----

^C
The operation was done seamlessly without any change to file system availability. ZFS used its "dynamic striping" feature to spread I/O across both mirrors and all four disks in this example. Observant readers will note that, in this case, performance actually decreased when the other mirror set was added. I'm sure those same observant readers can determine why this is the case (and why this would not normally be the case). Now, let's check our status again:

# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

NAME        STATE     READ WRITE CKSUM
pool1       ONLINE       0     0     0
  mirror    ONLINE       0     0     0
    c0d0s4  ONLINE       0     0     0
    c0d0s5  ONLINE       0     0     0
  mirror    ONLINE       0     0     0
    c0d0s6  ONLINE       0     0     0
    c0d0s7  ONLINE       0     0     0

# zpool list
NAME           SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
pool1         1.03G    283M    773M    26%  ONLINE     -
Many aspects of ZFS are dynamic. For example, compression can be turned on and off dynamically without affecting file system availability. In this case all future writes to the "opensolaris" file system will be compressed (including updates to existing non-compressed data):

# zfs set compression=on pool1/src/opensolaris
# zpool iostat -v 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool1        292M   764M      0    222  5.08K  1.65M
  mirror     231M   297M      0    180  5.08K  1.31M
    c0d0s4      -      -      0     30  1.35K  1.32M
    c0d0s5      -      -      0     30  1.73K  1.32M
  mirror    60.5M   467M      0    157      0  1.26M
    c0d0s6      -      -      0     28  2.31K  1.29M
    c0d0s7      -      -      0     28  2.31K  1.29M
----------  -----  -----  -----  -----  -----  -----
^C
ZFS file systems have many properties, some of which are changeable and some of which are fixed:

# zfs get all pool1/src/opensolaris
NAME                   PROPERTY       VALUE                    SOURCE
pool1/src/opensolaris  type           filesystem                -
pool1/src/opensolaris  creation       Tue Nov 29 10:33 2005     -
pool1/src/opensolaris  used           188M                      -
pool1/src/opensolaris  available      851M                      -
pool1/src/opensolaris  referenced     188M                      -
pool1/src/opensolaris  compressratio  1.75x                     -
pool1/src/opensolaris  mounted        yes                       -
pool1/src/opensolaris  quota          none                    default
pool1/src/opensolaris  reservation    none                    default
pool1/src/opensolaris  recordsize     128K                    default
pool1/src/opensolaris  mountpoint     /opensolaris            local
pool1/src/opensolaris  sharenfs       off                     default
pool1/src/opensolaris  checksum       on                      default
pool1/src/opensolaris  compression    on                      local
pool1/src/opensolaris  atime          on                      default
pool1/src/opensolaris  devices        on                      default
pool1/src/opensolaris  exec           on                      default
pool1/src/opensolaris  setuid         on                      default
pool1/src/opensolaris  readonly       off                     default
pool1/src/opensolaris  zoned          off                     default
pool1/src/opensolaris  snapdir        visible                 default
pool1/src/opensolaris  aclmode        groupmask               default
pool1/src/opensolaris  aclinherit     secure                  default
There is a lot more to ZFS, including snapshots, clones, replication, and zones plus ZFS, but those topics will have to wait until the next Solaris Companion column. Here's a taste just to whet your appetite:

# zfs snapshot pool1/src/opensolaris@pbg1
# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
pool1                  189M   851M  8.50K  /pool1
pool1/src              188M   851M     8K  /pool1/src
pool1/src/opensolaris   188M   851M   188M  /opensolaris
pool1/src/opensolaris@pbg1      0      -   188M  -
# touch foo
# ls
ReleaseNotes  foo           usr
# cd /opensolaris/.zfs/snapshot/pbg1
# ls
ReleaseNotes  usr
ZFS Limits

There are many things that ZFS is, currently, and a few that it is not. The most frustrating current limit is that ZFS cannot be the root file system. A project is underway to resolve that issue, however. It certainly would be nice to install a system with ZFS as the root file system and then to have features like snapshots available for systems work. Consider taking a snapshot, making a change that causes a problem on the system (e.g., installing a bad patch), and then reverting the system to its pre-patch state by restoring the snapshot.

Also, ZFS can be imported and exported, but it is not a true "clustered file system" in that only one host can access the file system at one time.

An open issue is the support of ISVs, such as Oracle, for the use of a ZFS file system to store their data. I'm sure this will come over time.

Hot spares are not implemented currently. If a disk fails, a zpool replace command must be executed for the bad disk to be replaced by a good one.

A mirror can be removed from a mirror pair by the zpool detach command, but RAID-Z sets and non-mirrored disks currently cannot be removed from a pool.

Also, currently there is no built-in encryption, and at this point ZFS is a Solaris-only feature. Whether ports will be done to other operating systems remains to be seen.

ZFS Ramifications

ZFS has the potential to make low-cost disks and storage arrays perform and function much like (or even better than) their more expensive cousins. This could lead to a revolution in the cost of disk used in production environments. I believe that between the Solaris 10 network performance improvements, and the ZFS feature set, running a Solaris server as a group or company file server will once again be an option to consider. The price/performance/feature set should be outstanding.

The potential uses of ZFS are endless, especially once it can be used as the boot file system. Bart Smaalders (http://blogs.sun.com/roller/page/barts) has a blog entry containing his ZFS-futures thoughts.

ZFS Readiness

Just a brief word on the readiness of ZFS for production use. Usually, a new file system would not even be considered for production use for quite a while after its first ship. ZFS, on the other hand, may well garner production use immediately on its production ship. The testing that has gone into ZFS is astounding, and in fact testing was considered a first-class component of the ZFS design and implementation. See the blogs listed below for more details on the ZFS torture tests.

More on ZFS

The ZFS documentation is available at:

http://www.opensolaris.org/os/community/zfs/docs/
There is a plethora of other information available about ZFS. Specifically, Sun engineers are blogging about the facility in a variety of areas. The best place to start is Bryan Cantrill's blog, which points to many others:

http://blogs.sun.com/roller/page/bmc
There is also an active ZFS community at:

http://www.opensolaris.org/os/community/zfs
Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (http://www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.