Article Figure 1 Listing 1 apr2006.tar

Clones, Failovers, and Migrations

Mike Scott

While working at a customer site recently, I had a requirement to design a mechanism whereby a large legacy database server could be failed over from one site to another. Over the years, several methods of doing so had been investigated and subsequently abandoned for various financial, technical, and political reasons.

Years after the initial service was implemented, the system was still a major single point of failure for the customer and was scheduled to be in active duty for up to 18 months longer. The size and criticality of the system meant that the risk of implementing a creative solution was deemed inappropriate, until a single, significant event forced the client to reassess the situation.

We were required to construct a solution around the existing system, keeping change to a minimum, and absolutely avoiding any direct interference with the database and application. We had a healthy budget to work with; however, we felt that a competent solution could be delivered utilizing a fraction of the available funds.

In this article, I'll discuss the available technologies used during this exercise and describe the methods employed to ensure consistent and reliable failover.

Legacy Systems

We briefly considered constructing a solution based on a clustering product; Veritas Cluster Server is heavily used by this particular client and would be the product of choice. VCS is a great framework for encompassing the complicated dependencies for starting/stopping and monitoring applications. As an aside, I notice that the Solaris 10 SMF provides a similar, perhaps an even better framework for this task, albeit without the multi-host capabilities of VCS -- perhaps an interesting future direction.

This particular 64-way E10k has been around for ages. It has been upgraded, broken, fixed, patched, unpatched, crashed, and recovered -- it even has custom written kernel patches provided by Sun because we are running it beyond the designed capabilities of the somewhat geriatric Solaris 2.6.

The outline of the final infrastructure of our solution can be seen in Figure 1, where it is apparent that it closely resembles many standard DR failover architectures. This style of failover is easy to implement in a green-field site, but without some extra work, the failover process (and more importantly, the service running on the contingent node) can prove to be problematic when dealing with legacy systems.

The biggest concern I had while considering this project was to ensure that the contingent host had a suitable and accurate replica operating system, and that once we had cloned the system, future changes to the production system would be replicated accurately. (As it happens, we didn't end up using VCS, opting instead for a manual ten-step failover procedure that could be invoked by any local sys admin.)

Basic Failover Mechanics

There are two basic infrastructure requirements for a successful failover:

1. IP Connectivity

Great flexibility can be gained by the use of DNS for providing a bridge between a service name and an IP address that may potentially change. There is a significant drawback though, which is that, for the entire history of the system, the service name (i.e., DNS CNAME record used to signify the application interface) must be used by all parties.

On the face of it, this sounds like a relatively straightforward and sensible assertion ("why would it be any other way?", someone once asked me). However, given the wide-reaching use of network services and firewalls, the hundreds of developers, administrators, and third parties that connect to the system, this is not something we can often guarantee on a legacy system. Instead, we are restricted to ensuring that the same IP address can by used for production and standby nodes.

To achieve this, we examined two possibilities:

  • The use of Cisco LocalDirectors to distribute inbound connections to the correct node.
  • Migration to a cross-site VLAN. The subnet the system was originally installed on was available only on the production site. Another recent project had required the use of a number of cross-site VLANs that we could make use of.

2. Storage Availability

Even as far back as the days of SCSI direct-attached storage, we could configure arrays with multiple initiators on the bus in order to provide a resemblance of failover capability, but this didn't address failures or outages influenced by locality (i.e., entire datacenter outages). Only when storage systems began to move to fiber did we really see this sort of thing take off in a big way. Initially, we had SSA's and A5x00 using longwave GBICs providing distant connectivity.

These days, large volume corporate data storage is almost exclusively in the arena of proprietary SAN-attached arrays. All the major vendors have their own products that provide remote site data replication. These proprietary hardware solutions generally use dedicated fiber links and have a suitably high price tag (in terms of both capital and operational costs) associated with them to put off everybody except those with very specific requirements.

Alternative, host-based solutions are available. These generally run as part of the operating system, and make use of network connectivity to transfer the data from one system to another. Veritas Volume Replicator (VVR) is one such product -- it integrates with Veritas Volume Manager to provide volume-level synchronization.

I also stumbled across an interesting concept of "Network Block Devices" on Linux (see http://nbd.sourceforge.net), which, as its name suggests, allows a standard block device to be made available over a network link. This is a remarkably simple concept that makes remote mirroring both a trivial task and one that can be done with a normal local volume manager.

This is an interesting distraction; however, the real core of the failover procedure is the part that ensures that the standby OS configuration is as close to production as possible. We do that by cloning the operating system and holding it in sync while it is in standby. To overcome the potential for service failure when the service is migrated to the standby host, we need to ensure that even the application is unaware of the node upon which it is running.

We therefore set out to replicate the entire OS via rsync (over SSH). This sounds easy in principle, but there are some parts of the OS that you cannot replicate. (Think about what might happen if you accidentally copied "/dev" from one machine onto another -- would you expect it to work when you couldn't guarantee identically configured hardware?)

Cloning the System

The first step in creating the standby server is to replicate it completely. We start the process by creating a new root file system. To do this, the domain is booted from the network, and the file system created, as one might expect using newfs (assuming c0t0d0 as your boot device):

# newfs /dev/rdsk/c0t0d0s0
# mount /dev/dsk/c0t0d0s0 /a
Ufsdump/ufsrestore is then used to copy the root file system across. In this example, the server has no other OS file systems -- if we did have separate /var, /usr, /opt, or other file systems, then these would have to be copied in the same manner.

On the production machine:

# ufsdump 0f /somewherebig/root.dump /
(where "somewherebig" is a file system suitably sized to take the entire ufsdump). The dump was then transferred to the destination machine and applied to the newly created file system. For my purposes, sharing the "/somewherebig" file system via NFS met my requirements and allowed the dump to be extracted.

Storing the ufsdump (and not, for example, piping the dump directly across the network to a ufsrestore running on the other node) also provided the additional benefit of a point-in-time backup that could be reused, should the configuration on the contingent node go disastrously wrong:

# mkdir /tmp/nfs
# mount -o ro production:/somewherebig /tmp/nfs
# cd /a ; ufsrestore rf /tmp/nfs/root.dump
There are four immediate issues that must be addressed before attempting to boot:

  • IP addressing -- If we're on the same VLAN as the production machine, then remember that we've just copied across the "/etc/hostname.*" files, and booting that image may result in a production-threatening IP clash. I chose to remove all /etc/hostname.* files and work solely through the console until the system was ready to boot independently.
  • You should have your boot device mirrored, either by SVM or VxVM (as is the case in our example). The Solaris image that was just copied onto the standby contains the configuration for a mirrored/encapsulated rootdisk, which must be manually backed out in order to boot the environment (we can then later re-encapsulate/mirror).
  • The device tree (/dev and /devices and /etc/path_to_inst) will have to change, unless you are in the fortunate position of having an absolutely identically configured contingent system. To ensure a clean build, we bootstrapped a completely new tree.
  • Bootblock -- We've manually created a root file system and will have to ensure that we also install a suitable bootblock onto the device.

Root Disk Unencapsulation

To make the system bootable, we must remove the configuration of the volume management software that was, at the time of the ufsdump, managing the root file system. For reference, here is the procedure for removing either VxVM or SVM from a rootdisk.

VxVM:

# cd /a/etc/
# cp vfstab vfstab.bak
# vi vfstab
< edit the vfstab, and ensure that all OS filesystems refer to \
  the underlying devices >
# cp system.bak
# vi system
< remove or comment out the two entries:
rootdev:/pseudo/vxio@0:0
set xio:vol_rootdev_is_volume=1
>
# cd vx/reconfig.d/state.d
# touch install-db
# rm root-done
SVM:

# cd /a/etc/
# cp vfstab vfstab.bak
# vi vfstab
< edit the vfstab, and ensure that all OS filesystems refer to \
  the underlying devices >
# cp system.bak
# vi system
< remove or comment out the entry:
rootdev:/pseudo/md@0:0,1,blk
>
Bootstrapping a New Device Tree

A simple boot -r should be able to boot the system and rebuild the device tree to the spec of the new server. However, there are a number of disadvantages to this simple approach, some of which are technical issues, other administrative:

  • Your controller numbers will not start at "c0", "c1", etc. Instead, they will generally start numbering where the source machine left off. On a large server this could mean that the controller numbers start in the twenties ("c21", "c22"). Although this doesn't affect the running of the server, it makes the system configuration less irregular and therefore easier to support).
  • Depending on storage configuration and OS revision, particular circumstances may arise (using same disk controllers in the same hardware device path), you may be in danger of exceeding LUN-per-controller limit (this was a particular concern in this example).

So, the theory is simple -- we want to clear out /dev and /devices (and preferably /etc/path_to_inst). However, if we simply delete these files, the system will be unbootable.

Instead, while we are still booted into single-user mode from the network, we must erase the old device trees and generate new ones, and there is a specific trick for the path_to_inst.

Take a deep breath and delete the old trees:

# rm -rf /a/dev/* /a/devices/*
# echo '#path_to_inst_bootstrap_1' >/a/etc/path_to_inst
Notice that last command; we're overwriting /etc/path_to_inst with a very particular string. Simply emptying this file and expecting a reconfiguration won't work (even with a boot -a, which appears to suggest that it can rebuild a missing path_to_inst). Replacing the file with this token will cause the machine to generate a new file from scratch. A good example of this in action is on a Jumpstart boot image.

We can then generate the new trees:

# drvconfig -r /a/devices
# devlinks -r /a
Remember that we're working on a legacy system here; on newer revisions of Solaris, devfsadm is the preferred method of doing this. And, all being well, you should now have a reasonably complete device tree. All that remains is to install a new bootblock and perform a reconfiguration reboot to check:

# installboot /usr/platform/'uname -i'/lib/fs/ufs/bootblk \
  /dev/rdsk/c0t0d0s0
# reboot -- -rs
If all has gone to plan, then the domain should reboot, building a new path_to_inst and updating /dev and /devices as it goes. It ought to end up in single-user mode, whereby the remaining configuration (remirror rootdisk, set up networks) is left as an exercise to the reader.

OS Syncing

After briefly considering the available options, I settled on rsync as the file synchronization tool of choice. Rsync is a great tool for this sort of thing -- it's like an enhanced "rcp" or "scp". It'll handle just about anything you care to throw at it, including sparse files, device files, named pipes, on-the-fly compression, and, a particularly nifty party piece, block-level incremental transfer. That means, where "rdist" will check a set of files and only transfer the subset that has changed, rsync will go one step further and transfer only those parts of the files that have been altered.

It's tempting to have rsync simply synchronize the whole of the root file system; however, that will clearly overwrite all the good configuration that we have just done and also have profound effects on the bootability of the machine as it overwrites "/dev" and "/devices" (note that later versions of Solaris have a separate pseudo file system for "/dev").

The ideal method seemed, therefore, to provide both an "include" and "exclude" list, meaning those files/directories listed in the "include" list will be synchronized recursively, except when a match is made (via rsync's "--exclude" flag).

The include/exclude settings in the script (see Listing 1) will synchronize the whole of the root file system and exclude particularly irrelevant parts (/dev, /devices). The OS configuration starts to get complicated under "/etc", so I've deliberately excluded that and explicitly stated (via the "include" list) the files and directories underneath /etc that I want to keep synchronized. One day, perhaps I'll find the time to explore and map out the contents of "/etc" more accurately.

Frequency

While using the proposed procedure for synchronizing a host, some thought must be paid to the frequency of the sync process. If it's not often enough, you may be missing vital configuration in the event of a disaster; if it's too frequent, you may be compromising some of the functionality of a failover system.

For example, imagine what might happen if you had a file system corruption that was copied across to the failover host -- you might not be able to prevent this from happening, but you can reduce the odds of its causing you an issue on both hosts by reducing the frequency of updates. Perhaps different subsets of files should be copied at differing frequencies.

For our situation, we chose to split the frequency. As we were not using network-based authentication, it is important to keep the password, shadow, and group files as current as possible (the machine is in 24x7 interactive use, with in excess of 16,000 entries and up to 2500 concurrent telnet sessions). These files are sync'd once an hour, ensuring that, in the event of a failover, as many user accounts as possible will be completely up to date (with respect to account creations, deletions, and password changes).

Everything else was then set to sync once daily -- this caught any other configuration issues (print queues, cron changes, etc.).

Failover Execution

Because of the very high profile of our specific situation, any service-affecting issue would be immediately noticed by our on-site operations staff and escalated to technical support. Depending on the nature of the issue, the problem should be analyzed and discussed by the teams involved. The goal of this initial analysis is to ascertain the estimated time to recover using both a fix-on-fail and a failover strategy and to determine which approach should restore service earliest with the least risk.

Summary

An "ideal world" is usually a faraway proposition for legacy systems that have evolved over a period of many years. Many teams of sys admins, DBAs, app support, and developers will have been involved, perhaps adding up to hundreds of personnel, each of whom have the capability to break a failover-capable system by hardcoding, misconfiguration, or just plain old error.

One of the difficulties of having a failover-capable system is that unless a regular test of the failover solution is organized (with customer expectations set that problems may be found, and opportunity must be given to fix), then any inability to successfully failover will be found only at the absolute worst possible opportunity.

The discussed procedure is a useful one; it treats the operating system as a file system with just a bunch of files, a commodity viewpoint whereby the initially complex problem becomes trivial. I've also used this procedure to migrate a host -- reducing a complex server migration from a multi-hour outage to (what appeared to the end user) to be a trivial system reboot. We cloned the system a month in advance of the planned migration in order to test the hardware thoroughly before committing our decision. In the month leading up to the migration, we used the rsync script to hold the destination machine in sync with the source.

There is a cloud on the horizon for this type of operation though -- traditionally, Solaris has been configured solely by a collection of flat files (compare with the "registry"-style databases of Windows and AIX). In Solaris 10, Sun has begun to introduce more complex configuration databases -- the SMF has begun to emerge (I say "begun", as there still appears to be a transition underway of the old "rc" files). That's not to say it's not possible, just that we'll have to work a little harder to be that much smarter in the future.

Mike Scott is the director of Hindsight IT Ltd, a small Solaris consultancy based in Central Scotland. He has been working in the North East and the central belt for longer than he cares to calculate, specializing in systems management with a keen interest in security and performance management. He can be contacted at: sysadmin@hindsight.it.