Article

jul2006.tar

RAIDing FreeBSD with GEOM

Stephen Corbesero

I recently set up a new file server at my home. I have always been a fan of FreeBSD, and my current file server is a FreeBSD 4.11 system that uses the vinum system to mirror two ATA133 40-GB drives for part of the system area and all of the data storage. The motherboard on my new system has support for the typical four (ATA-100) devices, but it also had support for two SATA-300 drives. I had been eager to check out these new SATA drives, so this was a perfect opportunity.

Over the years and through many versions, FreeBSD has provided support for different hardware and software RAID systems. I generally prefer software RAID solutions because they are often more flexible. Besides hardware RAID on specific disk controllers, there are several software RAID solutions on FreeBSD. Remember that RAID stands for Redundant Array of Independent Disks, and the most common levels are RAID-0 (striping), RAID-1 (mirroring), RAID-5 (distributed data plus parity), and RAID-10 (striping plus mirroring). I tend to prefer RAID-1 on systems with fewer than four drives and RAID-10 with four or more drives.

Important safety note: Having your system and data areas protected by RAID is no reason to avoid backups. RAID can only protect your system from a routine drive failure. A catastrophic event like a nearby lightning strike or flood will cheerfully destroy all your mirrors simultaneously. Maintain the discipline of a regular backup strategy including off-site archiving!

ccd: This is the concatenated disk driver, and it has been in FreeBSD since the mid-90s [1]. It is a standard pseudo-device driver that can be added to the kernel to provide simple disk striping and mirroring. I used the ccd device to create a large, fast news spool for an ISP. It did not have many frills, but it worked great!
GEOM: GEOM is the latest abstraction layer for the FreeBSD disk system, which first appeared in FreeBSD 5 [2]. The GEOM is dubbed as a "Modular Disk Transformation Framework." It provides a set of stackable drivers that can act as provider or consumer classes for storage objects. With this new and flexible framework, "geom classes" can be linked together to provide various services. Besides the RAID classes like striping and mirroring, there are also classes to support disk label management and encrypted file systems. However, it is important to realize that the fundamental GEOM framework does not provide a complete set of logical volume management (LVM) services. For that level of abstraction, you should consider vinum.
vinum and gvinum: inum is a very powerful logical volume manager originally written by Greg Lehey and was first available in FreeBSD 3 [3]. Somewhat patterned after the Veritas volume manager, this software is very configurable and flexible. As with ccd, it does disk striping and mirroring, and also supports RAID-5. My current file server was built using vinum mirroring. gvinum is a rewrite of vinum that uses the GEOM framework, but it is not well-documented -- there is not even a manual page!

I chose GEOM mirroring for my new file server instead of staying with vinum for several reasons. First, I wanted to get some experience with the new GEOM classes. Second, vinum predates the new GEOM standard for the FreeBSD disk subsystem, and there are compatibility issues. With FreeBSD 5.X and 6.X, the new gvinum should be used, but the lack of documentation was more than enough to stick with raw GEOM classes. Unfortunately, as mentioned previously, not using gvinum means not having a true LVM, so many operations must be done manually.

Components

My system is composed of an ASUS Pentium motherboard with a 2.9GHz Celeron processor, 512MB of RAM, one 120GB ATA-100 drive, two 160GB SATA-300 drives, a CD-ROM drive, case, power supply, etc.

Let's assume that the FreeBSD drive device assignments for this configuration are ad0 for the ATA disk drive and ad4 and ad6 for the SATA disk drives. The goal of the following procedures will be to create a mirrored system volume across ad0, ad4, and ad6.

Planning

Before I undertook this project, I started by looking for references and documentation that might give me a good direction along with a set of instructions. The FreeBSD Handbook [4] has an entire chapter dedicated to the new GEOM framework [5]. Although this chapter contains a great deal of useful information, reading it did not make me feel confident enough to try it. After some Web searching, I found two very good articles online. The first one was titled "FreeBSD Disk Mirroring" [6]. This was a good reference, but the article approached the problem with a shell script-like solution.

The other article was titled "Using Software RAID-1 with FreeBSD" [7]. This was also a good article, but it only presented the whole-disk-mirror approach, which I find to be too restrictive. However, this article did suggest an interesting but undocumented direct technique to convert an existing and populated slice into a GEOM gmirror. All the GEOM classes require a block of bookkeeping information that is placed at the end of the object, which could theoretically be a single slice, an entire partition, or even the whole drive.

I had concerns about how the file system would react to a block being taken away after the file system was created. Also, the article applied this technique after partitioning the entire drive as a single FreeBSD slice. This affects the initial offset of the disk label in the slice. I experimented with this approach for multiple slices, but it did not work well. I had problems with partition offsets within my slices, and I had to abandon it. However, if the configuration of a single slice is acceptable for your application, I would still recommend trying it, because it saves a considerable amount of data juggling. Since neither of the previous approaches worked out of the box for me, I developed the following procedure as an amalgam of the above with a dash of my own philosophies on disk administration.

When setting up any new system, it is always best to spend time planning before installing. Some very important choices must be made that will affect how the actual disk space will be allocated and configured. One of the most important choices is that of using single or multiple slices for the disks. The GEOM classes make it possible to mirror entire disks or individual slices of disks. Although mirroring the entire drive seems appealing for its apparent simplicity, it is not very flexible. For example, I was concerned about replacing a drive if one should fail in the future. Would I need to find an exact replacement? If I were forced to replace a failed drive with a larger one, would I lose the extra space? By applying GEOM to individual slices, I would be free to add slices that might not need to be mirrored.

Questions like these led me to the decision that I would have at least two slices that would each be under GEOM control. One slice would be just for the operating system, and the other would be for a data area such as home directories. There are many advantages to multiple- rather than single-slice configurations. The single most significant advantage is flexibility. By having a separate data slice, operating system upgrades and backups can be simpler. Also, there is a much smaller chance of running out of partition letters.

Another advantage is the possibility of having multiple versions of FreeBSD on separate slices, so you can experiment with different releases on their own slices without having to overwrite or potentially damage an installation. Such experimental slices might not even need to be mirrored, because they are likely to be transient.

With this layout in mind, the most important task would be to get the system sized, installed, and mirrored. Unfortunately, the FreeBSD install program, sysinstall, still does not provide an option for mirroring at install time. To accomplish our goal, the procedure will involve a three-phase process of installing a base system, creating a single-drive "mirror" of it, and finally destroying the original system to make the system fully mirrored. At this point, additional system or data slices, mirrored, striped, or not, can easily be added.

The next step is to decide on a size for the system volume. Table 1 shows the layout I used. I generally follow very standard allocation rules and try to leave sufficient space for expansion.

Procedures

From this point on, it is assumed that you are comfortable with BSD commands and procedures in general as well as with using sysinstall to install a new system or modify an existing one.

Phase 1: Base Installation

Perform a standard installation of FreeBSD on the first slice of ad0. If using my layout, size this slice at 30 GB and make note of the number of sectors. Be sure to install the standard FreeBSD boot manager.

Do not get very attached to the result of this install, because there is always a chance you may have to do it again. I went through the complete installation but did not bother to load the ports collection, configure X, or create any user accounts. After the install was finished, I must confess that I did install emacs and bash from the installation CD-ROM, but those are two packages that I just cannot live without. After the install finishes, remove the CD-ROM, and reboot from the hard drive. You should have a minimal but fully functional system.

Phase 2: Create the First System Mirror

As root, use sysinstall to put a slice on the second drive, ad4, which is the same size as the system slice on the first drive. Be sure to also let sysinstall put the boot manager there as well. If sysinstall can't write the slice information, run the sysctl command (below) to enable GEOM writes and then run sysinstall again. GEOM uses the end of a slice or disk to store its configuration metadata, so you will actually lose a small amount of storage. Unless you tried to size your system volume down to an exact number of sectors, this should not be an issue. Let us assume that this slice becomes ad4s1. Use the following commands to start the process of creating the first mirror component.

gmirror  load
sysctl  kern.geom.debugflags=16
gmirror  label  -v  -b  round-robin  sys0  /dev/ad4s1

The first command loads the GEOM mirror module into memory. The second command is equally, if not more important. It disables a kernel safety feature that will normally prevent altering GEOM data on a disk that is currently mounted for writing. This may not be needed just yet, but it definitely will be needed once we start creating file systems. Whenever you're working with low-level GEOM operations like fdisk and bsdlabel, if you get an error message that the operation was not permitted or a write failed, you probably need to set this kernel variable back to 16 for the command(s) to succeed. This variable is initialized to 0 at system startup so do not worry about resetting it.

There is one additional caveat. If you happen to create a slice that goes all the way to the end of the disk drive, GEOM will get confused and think that the entire disk is under the control of a single GEOM class, as opposed to each slice being controlled by independent GEOM classes. To avoid this, make sure the last slice does not go all the way to the end. Leaving the last sector of the drive out of the slice is enough for GEOM to get it right.

The third command creates the mirror, sets the I/O scheduling to be round robin, and names the new "device" /dev/mirror/sys0. I chose the name "sys" so that it would be descriptive and added the trailing 0 to make it somewhat similar to most disk naming conventions. It is very important from this point on to refer to all disk operations by the logical mirror name, /dev/mirror/sys0, and not to use the physical /dev/ad4s1 name. Any operations on /dev/ad4s1 could seriously hurt the configuration, and you would likely need to start over.

You now need to create the file systems inside the mirror. Unfortunately, sysinstall does not know about mirrored drives, so you have to put a bootstrap on the slice and edit the disk label manually. The command bsdlabel -wB /dev/mirror/sys0 will initialize the mirror with a bootstrap and a minimal disk label. Next use the command bsdlabel -e /dev/mirror/sys0 to make the file system partitions. This may take a few tries, so I set the EDITOR environment variable to be emacs for my comfort, otherwise it would default to vi.

Luckily, the bsdlabel command has become more user-friendly over the years. When editing the slice with bsdlabel, it is now possible to specify partition sizes in units like mega- and gigabytes instead of just sectors. Also, the * can be used as a place holder for bsdlabel to compute the partition offsets automatically. The following commands will write a bootstrap for the mirror, and then let you edit (create) a label. I urge you to read the man page for bsdlabel.

According to the man page, you should be able to set the label to what is shown in Figure 1. Do not change the c partition! Theoretically, when you leave the editor, bsdlabel will correctly determine the values of the asterisks for the offsets. However, I had a problem with the last partition. bsdlabel could not figure out the size of the last partition to replace that asterisk. I had done the label in two steps. In the first edit, I created the last partition at a small size. When I did the second edit, I could see the various sector sizes, and I could subtract the sector offset of the last partition from the total number of sectors in the entire slice (from partition c) to determine the real (and maximum) size (in sectors) for the f partitions.

If the labeling went well, you should now be able to make all the file systems using newfs commands and then check them with fsck:

 for  part  in  a  d  e  f;  do
     newfs  -U  /dev/mirror/sys0$part
     fsck  /dev/mirror/sys0$part
 done

Copy the data from your boot disk file systems to the sys0 mirror. You can use dump and restore for this, but I always use cpio in pass mode. For example, to copy the root system, use the following commands:

 cd  /
 mount  /dev/mirror/sys0a  /mnt
 find  -x  .  -print  |  cpio  -pmud  /mnt
 umount  /mnt

Then, repeat the above sequence using the other file systems names and partitions: /var to sys0d, /usr to sys0e, and (if you did install ports) /usr/ports to sys0f. Note that it is possible to mount all the mirror file systems under /mnt and do one mass copy, but I prefer doing one at a time just to be safe.

There are some minor configuration files that must now be edited to make the mirror the actual boot drive. The /etc/fstab on the mirror must refer to mirrored file systems and not the ad0 file systems. Remount /dev/mirror/sys0a on /mnt and then edit /mnt/etc/fstab. Change all occurrences of /dev/ad0 to /dev/mirror/sys0, keeping the partition letters the same.

Also, make sure that the GEOM mirroring module will be loaded at boot time and that the loader will correctly read the kernel from the mirrored slice. The following commands will correctly set the necessary configuration on both the mirrored and unmirrored slices:

#  echo  'geom_mirror_load="YES"'   >  /boot/loader.conf
#  echo  '1:ad(4,a)/boot/loader'    >  /boot/boot.config
#  echo  'geom_mirror_load="YES"'   >  /mnt/boot/loader.conf
#  echo  '1:ad(4,a)/boot/loader'    >  /mnt/boot/boot.config

At this point, your mirror should be fully functional and ready to be the system boot device. You can unmount the mirrored root, do a couple of syncs, and then reboot.

Watch the boot messages; you should see messages about the GEOM mirror module being loaded and the root going to /dev/mirror/sys0a. If not, you are still in pretty good shape. You can boot from the system on ad0a manually by telling the loader. At that point, check the /etc/fstab on the mirrored volume and the boot files to make sure they are correct. You may wish to fsck the mirrored file systems to make sure they are intact.

Phase 3: Complete the System Mirror

If your system has booted from the mirror, you can now finish the job by creating additional mirrors of the system volume. This is actually the easy part, but you do have some choices.

Again, become root. Do a df to make sure that nothing is mounted from ad0. If so, double-check that that the size of ad0s1 and ad4s1 are the exact same number of sectors. You can use fdisk to read these values. If they are not the same size, use sysinstall to erase and recreate the ad0s1 to be exactly the same size in sectors as ad4s1! Once this is the case, add that partition to the mirror with the following command:

#  gmirror  insert  sys0  /dev/ad0s1

At this point, the mirroring module will automatically begin copying data from the source on ad4s1 to the new mirror component until the two are fully synchronized. This can take quite a while if you have chosen a large slice size. You can monitor the progress with the gmirror status and gmirror list commands. The output from the status command will include a notation of DEGRADED to indicate that synchronization is in progress along with a percentage completed measure. In my installation, I had a third drive, so I created a third mirror for the system volume by making the ad6s1 slice and then adding it with another gmirror insert command.

Congratulations. At this point, your system is mirrored and ready to go. You may now choose to create one or more mirrored data slices. Data slices are much simpler. For example, I created a slice to hold file systems for my home and data directories using the following sequence. I also chose to mirror these only on the SATA drives and not the ad0 drive:

1. Use sysinstall or fdisk to create an s2 slice on drives ad4 and ad6. You will undoubtedly need to issue the sysctl command to allow writing on the GEOM provider.

2. Use gmirror label -v -b round-robin data0 /dev/ad4s2 to build the first component of the mirror.

3. Use bsdlabel -w /dev/mirror/data0 and bsdlabel -e /dev/mirror/data0 commands to create an initial label and to lay out the partitions for the file systems.

4. Use the newfs and fsck commands to make and check the new file systems.

5. Use gmirror insert data0 /dev/ad6s2 to add the other drive's slice to the mirror.

6. Add the file systems you created in the mirror to /etc/fstab, remembering to use their /dev/mirror/data0x partition names. Don't forget to create their mount points.

My final /etc/fstab is shown in Figure 2.

After the mirrors synchronize, the system should be fully operational. Enjoy!

Conclusions

This procedure shows a step-by-step account of how I set up a small file server for my home with basic RAID capabilities. In retrospect, although working with the GEOM mirroring features was instructional, I would have to say that using gvinum is still probably the best way to go to get the flexibility of a true logical volume management. Since my system is stable and completely functional, I doubt that I will change it. The only thing holding me back from using gvinum in a system is the lack of documentation; however, it seems like a good project to try on a spare system.

References

1. ccd(4): Concatenated disk driver. FreeBSD 6.0 manpage, August 1995. University of Utah.

2. Kamp, P.-H. Geom(4): Modular disk i/o request transformation framework. FreeBSD 6.0 manpage, March 2002.

3. Lehey, G. Vinum(4): Logical volume manager. FreeBSD 6.0 manpage, May 2002.

4. Various Authors. 1995-2006. The FreeBSD Handbook. http://www.freebsd.org.

5. Rhodes, T. 1995-2006. The FreeBSD Handbook, ch. GEOM: Modular Disk Transformation Framework. http://www.freebsd.org.

6. Engelschall, R. S. 2005 (February). FreeBSD system disk mirroring. Daemonnews. http://www.daemonnews.org.

7. Lavigne, D. 2005 (November). "Using software RAID-1 with FreeBSD". O'Reilly ONLAMP.com. http://www.onlamp.com.

Stephen Corbesero is an associate professor of Computer Science at Moravian College in Bethlehem, Pennsylvania. His teaching and research interests include operating systems, systems administration, and networking. He also lives in Bethlehem with his dogs, Cursor and Chip, and his cat, Pixel, who always provide interesting input to the software development process.