The
Best File System in the World? Part 1
Peter Baer Galvin
In case you haven't noticed, there continues to be a lot going
on at Sun. In the last few months, there have been quite a few new
product announcements and upgrades. Sun has made both their premier
development environment, Studio 11, and their Java Enterprise System
tool set open source and free to use. They have included the open
source Postgres database with Solaris and are supporting it via
the usual channels.
Sun also has announced a new CPU, the UltraSPARC T1. This chip
has up to eight cores, with four threads per core (a la hyperthreading).
Sun has also released the first systems in their next-generation
Opteron-based line (the X2100, X4100, and X4200). These systems
have an outstanding feature set and continue to set benchmark records.
Sun's efforts to create a community around OpenSolaris are also
bearing fruit, with three non-Sun distributions and counting:
- Belenix -- http://belenix.sarovar.org/belenix_background.html
- Nexenta -- http://www.gnusolaris.org/gswiki
- Schillix -- http://schillix.berlios.de/index.php?id=news
or ftp://ftp.berlios.de/pub/schillix/
All of these releases add value to what is delivered with OpenSolaris,
with perhaps the most excitement being generated by Nexenta with
its use of the OpenSolaris kernel and Debian tools and package management
on top.
Even although each of these topics deserves its own coverage,
they are not the subject of this article. Rather, in this column
I'll discuss Sun's new ZFS file system. The anticipation and excitement
around this Solaris feature are beyond compare. As you will find
out if you read further, this is for good reason.
ZFS
ZFS has been gestating for several years within Sun. As of November
2005, the beta-test program has been rolled into the OpenSolaris
release. OpenSolaris build Nevada 27a includes the full source code
for ZFS, and as of this writing was the best and only way to get
access to ZFS. Note that this is not a production nor supported
release. That will happen sometime in 2006 when an update to Solaris
10 is shipped by Sun. The feature set described in this column,
and examples shown, are all based on the pre-production OpenSolaris
Nevada 27a release. OpenSolaris is available for free download from:
http://www.opensolaris.org
The genesis of ZFS was the idea to create a modern file system from
scratch. Many fundamental ideas of what a file system should be, rather
than how to modify an existing one, came together in a new approach
to file management. For example, ZFS includes both disk management
and file management in one. These areas, traditionally "the file system"
and "the volume manager", were not considered separately. Sun's contention
is that evolution and happenstance resulted in the current split and
that the split causes unnecessary complexity and inefficiency.
Another area in need of vast improvement was reliability. While
some common failure scenarios are reasonably solved by previous
technologies (such as a single disk failure repaired by RAID), many
other failure scenarios cause revealed or hidden corruption (such
as a controller bug). Finally, the set of features should include
the utility found scattered throughout other file systems, such
as snapshots, replication, and compression.
ZFS Features
The net result of this engineering effort is a new way of thinking
about and managing storage. There are no volumes. Rather, there
are storage pools made up of disks (or slices if desired) in various
RAID configurations. File systems are no longer large, monolithic,
difficult or impossible to change entities. Instead, they are allocated
out of pools and can contain other file systems. All aspects of
a file system can be changed dynamically without loss of access
to the data. And the list goes on. Here then is a summary of the
features of the first release of ZFS:
- Integral checksumming of all data and file system entities
(directories et al.) for data correctness and seamless error recovery.
Checksums are created on write and are recalculated and checked
on reads for always-on data protection.
- All writes are copy-on-write, so data is always consistent
on disk. There is no fsck command for ZFS. Note that NVRAM
is not needed for this consistency implementation.
- ZFS is a 128-bit facility, for (almost) limitless scalability.
- A file system allocated from a pool grows as data is allocated
within it (i.e., the utility storage model is implemented).
- Storage pools supporting mirroring, dynamic striping, RAID-Z.
Dynamic striping automatically expands stripes across disks as
disks are added to a pool. RAID-Z is like RAID-5, but it always
performs full-stripe writes for performance and to avoid disk
inconsistency.
- File systems are hierarchical, inheriting properties from their
parent.
- There are per-file system quotas and per-file system reservations
(guaranteed disk space availability).
- Built-in NFS sharing.
- Full access-control lists.
- A pool can be used by multiple file systems, allowing improved
performance (depending on the configuration) by involving many
disks in each I/O cycle.
- Automatic, permanent mounting of created file systems (this,
as with almost all ZFS features, can be configured or disabled).
- Optional data "scrubbing" to read data and confirm its checksum
correctness.
- Import and export of pools (allowing pools to move between
systems).
- Endian-independent file system representation (allowing pools
to move between endian-different systems, such as SPARC and Opteron).
- Instantaneous snapshots and clones (read-write snapshots).
- Integrated backup and restore options, including incremental
and full backups.
- Replication between systems using the integrated backup and
restore functions.
ZFS includes all of these features, while maintaining a simple
and easy-to-understand command structure. In fact, there are only
two commands: zfs and zpool. The combination of features
with simplicity and ease of use is perhaps the most stunning aspect
of ZFS.
ZFS Operation
These examples are from an AMD-based workstation:
bash-3.00# uname -a
SunOS unknown 5.11 snv_27 i86pc i386 i86pc
On the example system, this is the initial disk status:
# df -kh
Filesystem size used avail capacity Mounted on
/dev/dsk/c0d0s0 6.7G 3.3G 3.4G 49% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 1.5G 648K 1.5G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
/usr/lib/libc/libc_hwcap1.so.1
6.7G 3.3G 3.4G 49% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 1.5G 40K 1.5G 1% /tmp
swap 1.5G 20K 1.5G 1% /var/run
There are also some unused slices. Note that whole disks, and not
just slices, are zpool-manageable. In this case, the available slices
are c0d0s4, c0d0s5, c0d0s6, and c0d0s7 as shown by format ->
partition -> print:
Part Tag Flag Cylinders Size Blocks
0 root wm 409 - 1301 6.84GB (893/0/0) 14346045
1 swap wu 3 - 133 1.00GB (131/0/0) 2104515
2 backup wm 0 - 1301 9.97GB (1302/0/0) 20916630
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 134 - 201 533.41MB (68/0/0) 1092420
5 unassigned wm 202 - 269 533.41MB (68/0/0) 1092420
6 unassigned wm 270 - 337 533.41MB (68/0/0) 1092420
7 unassigned wm 338 - 405 533.41MB (68/0/0) 1092420
8 boot wu 0 - 0 7.84MB (1/0/0) 16065
9 alternates wu 1 - 2 15.69MB (2/0/0) 32130
Now, let's create a mirror using two of the slices and call the resulting
pool "pool1". The commands after the create show information
about the pools on the system. Again, this is an example; in normal
production, you should never mirror two slices on one disk!
# zpool create pool1 mirror c0d0s4 c0d0s5
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool1 528M 33.0K 528M 0% ONLINE -
bash-3.00# zpool status
pool: pool1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0s4 ONLINE 0 0 0
c0d0s5 ONLINE 0 0 0
The pool is also a file system, and is automatically mounted, as evidenced
by the new df:
# df -kh
Filesystem size used avail capacity Mounted on
/dev/dsk/c0d0s0 6.7G 3.3G 3.4G 49% /
/devices 0K 0K 0K 0% /devices
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 1.5G 648K 1.5G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
/usr/lib/libc/libc_hwcap1.so.1
6.7G 3.3G 3.4G 49% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 1.5G 40K 1.5G 1% /tmp
swap 1.5G 20K 1.5G 1% /var/run
pool1 512M 8K 512M 1% /pool1
Now let's create a file system within the pool:
# zfs create pool1/dev
It is also quite easy to delete a file system. Obviously, all data
stored on the file system is deleted when you destroy a file system:
# zfs destroy pool1/dev
# zfs create pool1/src
# df -kh
Filesystem size used avail capacity Mounted on
. . .
pool1 512M 8K 512M 1% /pool1
pool1/src 512M 8K 512M 1% /pool1/src
Within the "src" file system, let's create a new file system:
# zfs create pool1/src/opensolaris
# df -kh
Filesystem size used avail capacity Mounted on
. . .
pool1 512M 8K 512M 1% /pool1
pool1/src 512M 8K 512M 1% /pool1/src
pool1/src/opensolaris 512M 8K 512M 1% /pool1/src/opensolaris
Many options are available with respect to ZFS pools and file systems.
In the next example, I change the mount point of file system "opensolaris".
Note that this takes effect immediately:
# zfs set mountpoint=/opensolaris pool1/src/opensolaris
# df -kh
Filesystem size used avail capacity Mounted on
. . .
pool1 512M 8K 512M 1% /pool1
pool1/src 512M 8K 512M 1% /pool1/src
pool1/src/opensolaris 512M 8K 512M 1% /opensolaris
# cd /opensolaris
Now it's time to start using these new file systems. I'll start by
untar-ing the OpenSolaris source code:
# cd /opensolaris
# tar xf /var/tmp/opensolaris-src* &
[1] 1279
Now let's see how the ZFS pools are doing:
# zpool iostat -v 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
pool1 19.1M 509M 0 680 0 3.03M
mirror 19.1M 509M 0 680 0 3.03M
c0d0s4 - - 0 94 0 3.02M
c0d0s5 - - 0 93 0 3.03M
---------- ----- ----- ----- ----- ----- -----
^C
While the untar is continuing, let's add another mirror set to the
available storage in the pool and watch the result:
# zpool add pool1 mirror c0d0s6 c0d0s7
# zpool iostat -v 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
pool1 142M 914M 0 208 0 2.05M
mirror 124M 404M 0 109 0 1.10M
c0d0s4 - - 0 33 0 1.10M
c0d0s5 - - 0 33 0 1.10M
mirror 18.5M 509M 0 99 0 976K
c0d0s6 - - 0 13 0 908K
c0d0s7 - - 0 14 0 976K
---------- ----- ----- ----- ----- ----- -----
^C
The operation was done seamlessly without any change to file system
availability. ZFS used its "dynamic striping" feature to spread I/O
across both mirrors and all four disks in this example. Observant
readers will note that, in this case, performance actually decreased
when the other mirror set was added. I'm sure those same observant
readers can determine why this is the case (and why this would not
normally be the case). Now, let's check our status again:
# zpool status
pool: pool1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0s4 ONLINE 0 0 0
c0d0s5 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0d0s6 ONLINE 0 0 0
c0d0s7 ONLINE 0 0 0
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool1 1.03G 283M 773M 26% ONLINE -
Many aspects of ZFS are dynamic. For example, compression can be turned
on and off dynamically without affecting file system availability.
In this case all future writes to the "opensolaris" file system will
be compressed (including updates to existing non-compressed data):
# zfs set compression=on pool1/src/opensolaris
# zpool iostat -v 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
pool1 292M 764M 0 222 5.08K 1.65M
mirror 231M 297M 0 180 5.08K 1.31M
c0d0s4 - - 0 30 1.35K 1.32M
c0d0s5 - - 0 30 1.73K 1.32M
mirror 60.5M 467M 0 157 0 1.26M
c0d0s6 - - 0 28 2.31K 1.29M
c0d0s7 - - 0 28 2.31K 1.29M
---------- ----- ----- ----- ----- ----- -----
^C
ZFS file systems have many properties, some of which are changeable
and some of which are fixed:
# zfs get all pool1/src/opensolaris
NAME PROPERTY VALUE SOURCE
pool1/src/opensolaris type filesystem -
pool1/src/opensolaris creation Tue Nov 29 10:33 2005 -
pool1/src/opensolaris used 188M -
pool1/src/opensolaris available 851M -
pool1/src/opensolaris referenced 188M -
pool1/src/opensolaris compressratio 1.75x -
pool1/src/opensolaris mounted yes -
pool1/src/opensolaris quota none default
pool1/src/opensolaris reservation none default
pool1/src/opensolaris recordsize 128K default
pool1/src/opensolaris mountpoint /opensolaris local
pool1/src/opensolaris sharenfs off default
pool1/src/opensolaris checksum on default
pool1/src/opensolaris compression on local
pool1/src/opensolaris atime on default
pool1/src/opensolaris devices on default
pool1/src/opensolaris exec on default
pool1/src/opensolaris setuid on default
pool1/src/opensolaris readonly off default
pool1/src/opensolaris zoned off default
pool1/src/opensolaris snapdir visible default
pool1/src/opensolaris aclmode groupmask default
pool1/src/opensolaris aclinherit secure default
There is a lot more to ZFS, including snapshots, clones, replication,
and zones plus ZFS, but those topics will have to wait until the next
Solaris Companion column. Here's a taste just to whet your appetite:
# zfs snapshot pool1/src/opensolaris@pbg1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool1 189M 851M 8.50K /pool1
pool1/src 188M 851M 8K /pool1/src
pool1/src/opensolaris 188M 851M 188M /opensolaris
pool1/src/opensolaris@pbg1 0 - 188M -
# touch foo
# ls
ReleaseNotes foo usr
# cd /opensolaris/.zfs/snapshot/pbg1
# ls
ReleaseNotes usr
ZFS Limits
There are many things that ZFS is, currently, and a few that it
is not. The most frustrating current limit is that ZFS cannot be
the root file system. A project is underway to resolve that issue,
however. It certainly would be nice to install a system with ZFS
as the root file system and then to have features like snapshots
available for systems work. Consider taking a snapshot, making a
change that causes a problem on the system (e.g., installing a bad
patch), and then reverting the system to its pre-patch state by
restoring the snapshot.
Also, ZFS can be imported and exported, but it is not a true "clustered
file system" in that only one host can access the file system at
one time.
An open issue is the support of ISVs, such as Oracle, for the
use of a ZFS file system to store their data. I'm sure this will
come over time.
Hot spares are not implemented currently. If a disk fails, a zpool
replace command must be executed for the bad disk to be replaced
by a good one.
A mirror can be removed from a mirror pair by the zpool detach
command, but RAID-Z sets and non-mirrored disks currently cannot
be removed from a pool.
Also, currently there is no built-in encryption, and at this point
ZFS is a Solaris-only feature. Whether ports will be done to other
operating systems remains to be seen.
ZFS Ramifications
ZFS has the potential to make low-cost disks and storage arrays
perform and function much like (or even better than) their more
expensive cousins. This could lead to a revolution in the cost of
disk used in production environments. I believe that between the
Solaris 10 network performance improvements, and the ZFS feature
set, running a Solaris server as a group or company file server
will once again be an option to consider. The price/performance/feature
set should be outstanding.
The potential uses of ZFS are endless, especially once it can
be used as the boot file system. Bart Smaalders (http://blogs.sun.com/roller/page/barts)
has a blog entry containing his ZFS-futures thoughts.
ZFS Readiness
Just a brief word on the readiness of ZFS for production use.
Usually, a new file system would not even be considered for production
use for quite a while after its first ship. ZFS, on the other hand,
may well garner production use immediately on its production ship.
The testing that has gone into ZFS is astounding, and in fact testing
was considered a first-class component of the ZFS design and implementation.
See the blogs listed below for more details on the ZFS torture tests.
More on ZFS
The ZFS documentation is available at:
http://www.opensolaris.org/os/community/zfs/docs/
There is a plethora of other information available about ZFS. Specifically,
Sun engineers are blogging about the facility in a variety of areas.
The best place to start is Bryan Cantrill's blog, which points to
many others:
http://blogs.sun.com/roller/page/bmc
There is also an active ZFS community at:
http://www.opensolaris.org/os/community/zfs
Peter Baer Galvin (http://www.petergalvin.info)
is the Chief Technologist for Corporate Technologies (http://www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the systems
manager for Brown University's Computer Science Department. He has
written articles for Byte and other magazines, and previously
wrote Pete's Wicked World, the security column, and Pete's Super Systems,
the systems management column for Unix Insider (http://www.unixinsider.com).
Peter is coauthor of the Operating Systems Concepts and Applied
Operating Systems Concepts textbooks. As a consultant and trainer,
Peter has taught tutorials and given talks on security and systems
administration worldwide. |