Article Figure 1 Listing 1 Listing 2 Listing 3
Listing 4 feb2007.tar

Dynamic Patching via State and Run-time Control

James Hartley

This project stemmed from my desire to approach Unix administration from the standpoint of cooperating agents. I wanted to explore how scripts could cooperate to accomplish complex tasks that systems administrators face day to day. As the number of systems in the data center increase, it becomes an almost impossible task to maintain systems by hand. By designing agents that perform various tasks and by chaining these together via plans and goals, systems can begin to learn how to maintain themselves.

I consider this experiment a first step in learning how machines can perceive their environment, determine some identity information about themselves, and act via the Unix API and Unix commands to perform the complex day-to-day tasks themselves, without administration by humans.

In such as setup, humans become the orchestrators of activity and not slaves to the servers. Humans set goals and desires and determine plans, and the software agents attempt to execute these goals by determining their environment and communicating with other agents required for the process. This experiment tests the cooperation between agents to complete the complex task of patching a system automatically once a month. The higher level design would simply desire the goal of patching, the machines would then figure out how to patch by combining a series of agents to do the job. In this article, grabpatch, detach_attach_mirrors, and patchit are the agents that patch the machine; each performs a task that could be used in different goals to do different tasks.

Future work will concentrate on new agents and designing the frame work to introduce goals and desires into the mix. Then, in the patch case, all that would be required of an administrator would be to set the goal "patch host on the third Saturday of the month with the current patchcluster", and all systems would cooperate to load the appropriate agents and set up the appropriate environment to patch the hosts.

Introduction

It is the intent of this article to present the automated scripts and explore some modifications that might be useful in future revisions of the code. It is also the intent of the article to present a novel method of using state to control the operation of cooperating scripts, and demonstrate high-level control of system operation by dynamically altering system files during the execution of the patching scripts. In this case the /etc/inittab file is dynamically altered to control the boot process.

Adding patch clusters in Solaris is primarily a manual process. Currently patch clusters are manually pulled down from SunSolve's Web site, distributed to the servers, and installed by running the install_cluster script within the patch cluster. Attempts to automate the process can still require some manual intervention to either install or reboot the server to fully install the cluster.

Several problems arise from any manual process. For patching, these include: the time to manually pull the correct cluster from SunSolve, manually copying patch clusters to the appropriate host, pre-patch preparation, and the actual installation. Not only is time the enemy, but manual methods are prone to error; in a large shop it is possible to forget to add the patch cluster to some hosts, install or attempt to install the wrong patch cluster, and/or forget to reboot a host that has installed the patch cluster.

The solution is to completely automate the patch process and provide enough logging information so that the administrator can simply orchestrate events. This is achieved by running several scripts that automate each step of the patching process. Each step has its own script that can be modified for various business models.

The Scripts

Automating installation of the patch clusters is performed using four scripts. The four scripts that define the fully automated procedure are pull_cluster, grabpatch, detach_attach_mirrors, and patchit.

The pull_cluster script sits on a patch server. It is responsible for downloading the patches for each version of the operating system and placing it in the appropriate directory.

The grabpatch script is located on each host, it is responsible for downloading the correct patch for that host based upon the operating system version the host is running. The patch is loaded into the correct directory that patchit will use to install the patch cluster.

The detach_attach_mirrors script is a pre-processing script which, in our business environment, breaks Disksuite mirrors on our root partitions. Detach_attach_mirrors shows how to plug in your own pre-processing scripts required by your environment before installation of the cluster. If you use Disksuite, it might be of interest to study the code. Detach_attach_mirrors lives on each host. It is smart enough to determine whether there are any metadevices on the host to be patched and, if so, whether the mirrors can be successfully detached.

Finally, patchit is the script that does the actual installation. It dynamically modifies the /etc/inittab to control the run level at with patching is done and reboots to run level 3 after patching is complete. This script allows installation of the patch cluster at the appropriate run level and then reboots the system after installation.

These scripts take advantage the crontab file, startup scripts, and the ability with Unix to dynamically edit files during the execution of scripts.

The scripts on the host to be patched must cooperate to determine the success or failure of each previous step. If any previous step is unsuccessful, then patching must stop. To control each script, state flags in the form of files are used to define the state of the patching process. Before I explain each script in detail, however, take a look at Figure 1, which shows the architectural layout of the system. Note the existence of a patch server and which scripts are loaded onto which hosts.

Run-Time Control

Note that pull_cluster runs only on the patch cluster server. It is independent of the other scripts. Its job is to pull down the appropriate scripts from SunSolve. It can be scheduled to run via the cron service once a month to pull down the latest release of the cluster. It is desirable to patch only if there is a change in the patch cluster. To accomplish this, pull_cluster compares the md5 checksum of the new patch cluster it gets from Sun with the last patch cluster; and, if different, the zip files are stored in the appropriate OS version directory.

Patching will continue when the grabpatch script is run from the host being patched. If the patch clusters are the same, the zip files are deleted. The grabpatch script will fail if there are no patch cluster zip files on the patch server. This will halt the patching process, and the other scripts will not run. Synchronization between pull_cluster and grabpatch is accomplished by scheduling pull_cluster on the patch server before running the grabpatch as a cron service on the host to be patched. The code for pull_cluster is shown in Listing 1.

Pull_cluster

Notice that the main routine is very straightforward. For each OS type in the OS array, we clean up the old zip files, get the new patch cluster zip files, construct an md5 digest of the zip files, then check these against the old checksum values for the previous patch clusters. The check_production function is used in our business environment to determine whether old patches should be applied to our production machines. Our business model requires that production machines get the patch cluster last given to the development machines. The heart of the script uses "wget" to grab the cluster. It would be very easy to model different patch models within this framework by adding the appropriate functions. It is the responsibility of the check_development and check_production functions to delete the zip files if the previously applied cluster is the same as the new cluster.

Once the clusters are in place, control moves to the host to be patched. Again cron is used as synchronization tool along with three flags that track the state of the patch process. As mentioned before, the order of execution is grabpatch, detach_attach_mirrors, and patchit. Each will be discussed in turn along with the use of each flag.

On each host to be patched the scripts are installed in the following locations. Grabpatch and detach_attach_mirrors are placed in /usr/local/bin. Patchit is placed in /etc/init.d/patchit, and symlinks are created to it in /etc/rc1.d/S202patchit and /etc/rc3.d/S202patchit. Crontab entries are made into root's crontab file to first run grabpatch (in our model the night before patching), then detach_attach_mirrors (pre-processing step to break Disksuite mirrors, usually the day of patch installation), and finally to run patchit.

Because each step of the patch process must be successful before the next step can be taken, we must communicate state information to each script. Also notice that patchit is both a cron script and a run control script. Patchit requires additional flags to determine whether or not to patch at run level 1 and at run level 3. Patchit has this complexity because it is desirable to patch at run level 1 and then reboot after patching. However, there are other reasons to reboot besides patching, and it is not desirable to patch the box during every reboot or every time we execute init 1. The flow is as follows.

When patching is desired, the administrator creates the /etc/patchflag file on the host. This flag must be present when the crontab entry for grabpatch executes. When grabpatch executes, it looks for the patch cluster zip files within the patch cluster server directory. It creates another flag file called /etc/patchlog. The /etc/patchlog file also doubles as a log file for the patch process. We use these to flag files to communicate with the other processes. In this case if grabpatch sees zip files for the patch clusters, it will download the zip files and log the processes within the /etc/patchlog file. Patching continues with the next script, and the /etc/patchlog and /etc/patchflag remain.

If grabpatch does not see any zip files, it logs failure into /etc/patchlog, removes the /etc/patchflag file, and then renames the /etc/patchlog file to /etc/patchlog.`date +%y%m%d`. Because the other patch scripts require the existence of both /etc/patchlog and /etc/patchflag to run, once the files are renamed or removed, patching halts. This means that if grabpatch finds a patch cluster, patching will continue; if it does not, the other scripts in the cron service will exit without continuing the patch process. The code for grabpatch is shown in Listing 2.

Grabpatch

Grabpatch does some interesting things. It starts by reading a file on the patch cluster called "host_type". If it finds its hostname, it continues; if not, it erases the /etc/patchflag and renames the /etc/patchlog halting the patching process as before. If the host is identified in the host_type file, the host reads where to place the cluster on the hard drive for installation. This location is placed into the /etc/patchlog so that patchit knows were to find the patch cluster when it reboots the box for installation.

Grabpatch also determines its OS version from the uname command; it uses this to determine which cluster to pull from the patch cluster server. This is particularly nice because the host "knows" which cluster to pull and the sys admin does not have to keep track of this detail. Once the patches are pulled to the host, the clusters are unzipped and any cluster in the patch cluster directory on the host is logged into /etc/patchlog. Success keeps the /etc/patchlog and /etc/patchflag files intact; any failure of this script will rename the log file and remove the patchflag halting the next step in the process.

After grabpatch, the business model requires that we break any root mirrors before patching. This script can be considered a pre-processing step to the patch procedure. In other words, any scripts that are considered necessary to prepare an environment for patching can be placed between grabpatch and patchit, as long as they obey the rules. The rules are: if successful, leave /etc/patchflag and /etc/patchlog named as they are; otherwise, remove /etc/patchflag and rename the log file.

Detach_attach_mirrors

The next script in the chain is detach_attach_mirrors. This script is useful as system script in its own right. Because of this, a -p switch is added to make it useful in the patch mode. When patching, the -p switch must be present and is the only operation that the script allows. The script can clean up its files in /var/tmp, it can detach root mirrors, and it can attach root mirrors. During patching, the -p switch requires the existence of the flag files in order to complete; otherwise it exits early, does not break mirrors, and stops the patch process in the usual manner. It might be useful to anyone using Disksuite to break root mirrors. Meta-device health is checked before the mirrors are detached. Listing 3 displays the code for detach_attach_mirrors.

This script uses temporary files in /var/tmp to keep track of which submirror is broken from the mirror so that you can reattach the appropriate submirror to the original mirror after patching is complete. The "metastat_before_breaking" file is necessary to reattach mirrors; hence it is checked by the script when the attach option is given. Remember, you can use the script in patching mode or standalone mode. When patching, the script obeys the rules and, upon success, leaves the files /etc/patchflag and /etc/patchlog named as they are. If the script fails to break the mirrors, it renames the log files and removes the /etc/patchflag.

Patchit

Once any processing tasks are completed successfully, patching occurs. The final script, patchit, starts as a cron job looking for the two flags, etc/patchflag and /etc/patchlog. Patchit starts at run level 3, and if all flags are set, it will begin the patch process. Patchit verifies the flags and the run level then edits the /etc/inittab file so that the default run level is set to 1 and reboots the host. The host will reboot to run level 1 where the patchit script is linked to /etc/rc1.d/S202patchit.

When the run level script is executed, it verifies the flags and, if patching, it will now remove the /etc/patchflag file, get the installation directory from the /etc/patchlog file, cd to the install directory, and install any clusters that exist in that directory while logging all installation activity. If successful, it moves the /etc/patchlog to /etc/patchlog.`date +%yymmdd` and sets the /etc/patchdone flag. If the patch fails, the log is still moved to /etc/patchlog.`date +%yymmdd`. The etc/patchdone flag is not set, and success will not be emailed to the administrator. Finally, the /etc/inittab is edited using "SED" to reset the /etc/inittab file so that run level 3 is again the default run level, and the system is rebooted.

When the system comes back up to run level 3, the patchit script is run as /etc/rc3.d/S202patchit. The script checks flags and, if successful, will see only the /etc/patchdone flag set, indicating that patching completed correctly. An email is sent to the administrator that patching was successful. The /etc/patchflag is removed, and the script exits. Please examine the code for patchit in Listing 4.

Conclusion and Future Work

The system I've described is currently in place and working in our data center. There were some minor problems associated with NFS hard mounts, which have since been converted to soft mounts to eliminate NFS timeout problems during the boot up process. Other minor issues stemmed from standardization problems that should have been corrected prior to installation. These problems have been corrected, but it is interesting to note that human interference and non-standard system configuration caused the automation problems. This is likely not to the surprise of anyone, but it points out the need for consistency in system configuration, which is a problem that agents could eventually solve by themselves given the appropriate goals. This patch project demonstrates feasibility of designing agents to cooperate via state and having the hosts and agents use the perceived environment to cooperate on a complex systems administration problem.

Future work will continue on the construction of a goal-based higher level abstraction that allows the administrator to place a goal and have the machines execute that goal via plans. This experiment will be implemented within that scope as a simple goal, and the systems will "learn" to complete that goal via installation of the appropriate software and editing the appropriate system files without the aid of a systems administrator.

The code and edits to the system have been incorporated into a package that can be installed on any Solaris Sparc-based system. This would potentially be the agent-based installation method of choice, because it can be tracked via pkginfo and other package tools. The code can be obtained from the Sys Admin Web site at: http://www.sysadminmag.com. Enjoy the code and please send in any comments or suggestions. As always, inspect the code and test before attempting to use in your environment. Also, please remember that you assume any risks if you apply the code to your environment. Of course, I am available for comment at the address provided.

James Hartley is a systems administrator for a government project somewhere in Nevada. He holds a Master's Degree in Computer Science, and Bachelors Degree in Applied Mathematics. He is active in the area of AI as applied to systems control and uses Unix as a test bed for research and development. He has been investigating AI techniques as applied to systems administration for the past couple of years. He can be reached for comment at james.hartley@gmail.com.