Article	Listing 1	Listing 2	Listing 3	Listing 4
Listing 5	Listing 6	Listing 7	Listing 8	Listing 9
Listing 10	Listing 11	Listing 12	Listing 13	Listing 14
Listing 15	jan2007.tar

ZFS Administration

Corey Brune

In June of 2006, Sun Microsystems released ZFS into the Solaris Enterprise System -- also known as Solaris 10 6/06. It was originally released to the public in OpenSolaris the previous year. ZFS is a new file system and data management tool that provides a simple command-line and Web interface. To date it is the most advanced file system available.

ZFS is designed for demanding applications, where scalability and performance issues are of the utmost importance. It is POSIX compatible; therefore modifications are not needed for existing applications or storage infrastructure. It will run on any hardware that Solaris 10 operating system supports. The main features of ZFS are pooled storage, data protection and correction features, data consistency on disk, scrubbing, snapshots and clones, superior backup and restores, highly scalable, and built-in compression. Although there are similar products that manage data, what makes this product revolutionary is the ease of management, availability, performance, and scalability.

Ease of Management

The ease of management results primarily in the elimination of mundane tasks administrators perform each day. Automation is built into ZFS. Since the storage pool manager is a built-in function, the traditional volume manager is eliminated. The first step in using ZFS is to create pool(s) and file system(s). ZFS allows the user to decide whether to create a pool utilizing one slice or the entire disk. Additionally, you will need to decide what type of RAID is best for your requirements. The four types available are RAID-0, RAID-1 (mirror), RAID-Z, and RAID-Z2. RAID-1 is two or more disks that contain exact copies of data. As writes occur, both disks are modified creating the mirror effect. Listing 1 shows an example of how to create a mirrored pool that manages the entire disk.

Sun recommends that a pool utilize the entire disk so that zpool(1M) will format and label the disk automatically. Otherwise, an administrator will need to manually perform these functions. Creating a pool automatically creates a file system mounted at /<poolname>. If you wish, you may also create additional file systems (Listing 2).

Your second file system is now ready to use. Notice that ZFS has eliminated the need to run newfs(1M) to prepare the file system for use. The mount(1M) of the file system is automatic and requires no additional steps. Additionally, there is no need to edit /etc/vfstab to add the file system since ZFS handles this superfluous task as well.

To delete a pool and make the underlying devices available for other uses, use the following command:

zpool destroy <pool name>

This will destroy everything that exists in the given pool. You can specify the -f option if the pool contains mounted file systems. This option will automatically unmount and destroy the underlying file systems.

Similarly, you can delete a file system using the following command:

zfs destroy <filesystem name>

To utilize NFS share options on a file system, use the zfs(1M) command. With other file systems you would edit /etc/dfs/dfstab, enable the NFS server daemons, and share the file system. However, in ZFS these steps are eliminated. The command is simply:

zfs set sharenfs=on

Please note that setting sharenfs=on will allow all servers read and write access to the share.

Listing 3 shows a file system with no options and a file system with share(1M) options. The share(1M) command displays all NFS shared file systems.

If it is necessary to change the mount point for a file system you will use the command:

zfs set mountpoint=<dir> <fs name>

The new mountpoint will automatically be created, the file system unmounted and remounted to its new location, all automatically. With ZFS, you simply express your intent, and the rest is taken care of by the system.

There are two ways to import a ZFS file system into a zone. One way is to create the file system in the global zone, run the zonecfg(1M) command, and add it to the local zone. However, there are security ramifications with this approach. If a file system is mounted in the global zone, an administrator could assign two zones to the same file system. ZFS resolved this issue, which leads us to the second way to import. After the file system is created, set the mountpoint=legacy (Listing 4).

This command specifies that the file system will not be mounted in the global zone. Next, use the zonecfg(1M) command to add the file system to a local zone (Listing 5).

If there is a need for another person, typically an outside group, to manage a ZFS file system within a local zone, you will need to run the command zonecfg(1M) to provide management access (Listing 6).

A unique feature of ZFS is pooled storage. Before ZFS, an admin had to create a file system on a single disk or have a volume group manage multiple devices. Both approaches have their drawbacks. ZFS file systems are grouped together in a dynamically shared storage pool. Users no longer need to use the command newfs(1M) to recalculate parameters, recalculate stripe sizes, or manually grow file systems. This results in an efficient file system that minimizes work time while preserving the integrity of data. Once a device is allocated to a pool (Listing 7), ZFS distributes IO evenly across all devices in the storage pool.

Please note that with this feature, a file system can consume all available disk space in a pool. To limit the amount of space, you will need to set a quota (Listing 8).

As space requirements increase so will your need for additional storage. The zpool(1M) command will add a disk to a pool simplifying the process of allocating additional storage. Anytime a disk is added or replaced to a mirrored or RAID-Z device, ZFS will re-silver. Re-silvering copies data from the original device to the new device(s). Please note that in the event a scrub is in process, re-silvering will be deferred until the scrub has completed.

Similar to most volume managers, ZFS allows exporting and importing of pools in order to move them between hosts. However, ZFS, unlike the average volume manager, provides portability. A user can export a pool from a SPARC system and import it onto a x64/x86 system without any modifications. The command to export a pool is:

zfs export <pool name>

The command to import a pool is:

zfs import <pool name>

Availability

ZFS contains several data protection features. One feature is Copy On Write (COW). When a write request is made, ZFS creates a copy of the specified block. All modifications are then written to the copy. The original block is kept until all changes are committed. When the write is complete, pointers are re-routed to the new location. This ensures that the on-disk copy is always valid, eliminating the need to fsck(1M) the file system.

A second feature is the use of checksums for error and data corruption detection. A checksum is a validation algorithm that is performed during reads and writes. If an inconsistency is detected in the data or metadata ZFS verifies the data and utilizes the checksum to compare known good copies. Since calculations are performed at each node's parent block to ensure integrity, ZFS relies on the checksum over the corrupted data.

A third feature is RAID-Z and RAID-Z2, ZFS data-protection and storage features. Since ZFS automatically corrects corrupted data, you can now use inexpensive disks. These new RAID redundancies utilize XOR and checksums for error detection. RAID-Z is similar to RAID-5, and RAID-Z2 is similar to RAID-6. The exceptions are that RAID-Z and RAID-Z2 do not suffer from the write hole issue, are faster, and contain dynamically sized stripes (all disks are utilized at each block and contain a separate RAID-Z stripe of differing sizes).

Similar to all other ZFS writes, if power is lost during a full-stripe write the data is not lost since the changes have not been committed.

To create a RAID-Z group, you must execute the zpool(1M) command. An example of this command is:

zpool mypool raidz c1t0d0 c2t0d0 c3t0d0

According to the zfs(1M) man page, there must be a minimum of 2 drives in the RAID-Z configuration. They recommend 3-9 devices per pool. Roch Bourbonnais' When to (and not to) use RAID-Z blog entry addresses the performance considerations on when to utilize RAID-Z or mirror.

Scrubbing is a fourth feature that is a unique correction feature used to recover data from corruption. It reads and compares checksums to identify errors. If a pool is protected with either RAID-Z, RAID-Z2, or mirror, then scrubbing is automatically performed on a single block when a checksum error is detected. The damaged data is repaired before the process is returned to the application. If you are not protected with the above features, an administrator can explicitly scrub a pool (Listing 9) utilizing the command:

zpool scrub <pool name>

After a manual scrub is performed, any errors that are detected (Listing 11) are visible after entering the command:

zpool status

Notice that at no point was the file system offline during the data scrubbing. This is a valuable availability feature.

Once the administrator views the error status, the pool must be cleared in order to eliminate device errors (Listing 10).

In the event of a silent disk failure, ZFS will notify Solaris Fault Manager, which will in turn notify the administrator. If a disk failure has occurred or a device needs to be replaced, an administrator will replace the faulty drive and run the command:

zpool replace <pool name> <old disk> <new disk>

ZFS will automatically re-silver to populate the new device with valid data.

A fifth feature is the use of clones and snapshots for backups and restores. Clones are a read and write copies of a file system, whereas snapshots are read only. An example of when to use a clone over a snapshot is when you need a write copy of a file system to perform upgrades or patches. One nice feature is that they do not consume large amounts of storage since only blocks that differ between a snapshot and the live version are stored. Any blocks that have not changed are shared between the snapshot, any clones, and the live file system. You can create snapshots (Listing 11) over a course of time and use the rollback feature (Listing 12) to restore to any of the images.

Notice in the above listing that the snapshot and rollback function takes approximately 1 second.

With rollbacks, you can perform upgrades to application, Web, or database servers without the worry of restore time and extended outages. If an upgrade or patch doesn't take, simply run the zfs rollback command. Next, take a snapshot of the failed upgrade and send it to a development or test server to research the failure. zfs send (Listing 13) and zfs receive (Listing 14) are commands that allow an administrator to send or receive snapshots remotely.

To access a snapshot, use either of the following:

cd <filesystem name|volume name>/.zfs 
ls <filesystem name|volume name>/.zfs

If a write copy is needed you will need to create a clone using the following command:

zfs clone <snapshot> <filesystem name>

To access a clone, you will need to refer to the file system specified in the clone command.

When a snapshot or clone is no longer needed use the follow command to delete and free up disk space.

zfs destroy <snapshot name>

Please note that you may wish to name snapshots by date in order to organize, track, and retrieve them more efficiently.

Performance

With ZFS the I/O pipelining technique was introduced. The pipeline's main responsibility is to execute I/O requests based on interdependencies and priorities. Each I/O is assigned a deadline predetermined by the pipeline. Another responsibility is compression. Once enabled, compression can reduce the size of data up to 10 times. The unique feature is that there is no significant performance degradation. In fact, it may actually increase performance. There can be many reasons for performance degradation. For example, if there are too many or too few disks in a storage pool or if the record size does not match the database block size, then performance may decrease. If performance degradation is encountered, this command:

zpool iostat

will measure I/O performance and throughput on a per vdev or per pool basis (Listing 15). Use the dtrace(1M) command if you require deeper performance diagnostics.

When adding additional devices to a pool, ZFS dynamically stripes at write time, which eases administration and boosts performance. The user does not need to recalculate the stripe width after adding additional devices to the pool. ZFS will automatically stripe data across all available devices according to internal performance metrics.

Scalability

ZFS is the first file system to achieve 128 bits. Unlike other file systems, ZFS in virtually unlimited by the amount of storage it can handle. The scalability feature resides primarily in the amount of disks, files, directories, and snapshots that are available to the user. Additionally, ZFS uses dynamic metadata, which eliminates the need to calculate inodes. If a file system requires an additional inode, one is dynamically allocated. No more /usr/ucb/df -i.

Conclusion

Whether you use ZFS in the open source or enterprise edition, it is a data management and file system tool that will aid in productivity and job ease. I have listed select benefits and administration commands that make this product unique from other data management products on the market. For further information on ZFS, visit http://www.opensolaris.org.

I thank Jarod Jenson and Bill Moore for their contributions.

Resources

http://blogs.sun.com/bonwick/entry/smokin_mirrors

http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf

http://opensolaris.org/os/community/zfs/source/

http://www.sun.com/bigadmin/features/articles/zfs_part1.scalable.html

http://blogs.sun.com/bill/entry/zfs_vs_the_benchmark

http://docs.sun.com/app/docs/doc/819-5461

http://blogs.sun.com/eschrock/entry/behind_the_scenes_of_zpool

http://blogs.sun.com/timc/

http://blogs.sun.com/roch/entry/when_to_and_not_to

http://blogs.sun.com/bonwick/date/20051118

Corey Brune is currently a Director of Systems and Database Engineering in Plano, Texas. He has been a Unix Engineer for more than 10 years, specializing in Systems Development, Networking, Architecture, and Performance and Tuning. He can be contacted at mcbrune@gmail.com.