Dynamic Patching via State and Run-time Control
James Hartley
This project stemmed from my desire to approach Unix
administration from the standpoint of cooperating agents. I wanted to explore
how scripts could cooperate to accomplish complex tasks that systems
administrators face day to day. As the number of systems in the data center
increase, it becomes an almost impossible task to maintain systems by hand. By
designing agents that perform various tasks and by chaining these together via
plans and goals, systems can begin to learn how to maintain themselves.
I consider this experiment a first step in learning how
machines can perceive their environment, determine some identity information
about themselves, and act via the Unix API and Unix commands to perform the
complex day-to-day tasks themselves, without administration by humans.
In such as setup, humans become the orchestrators of
activity and not slaves to the servers. Humans set goals and desires and
determine plans, and the software agents attempt to execute these goals by
determining their environment and communicating with other agents required for
the process. This experiment tests the cooperation between agents to complete
the complex task of patching a system automatically once a month. The higher
level design would simply desire the goal of patching, the machines would then
figure out how to patch by combining a series of agents to do the job. In this
article, grabpatch, detach_attach_mirrors, and patchit are the agents that
patch the machine; each performs a task that could be used in different goals
to do different tasks.
Future work will concentrate on new agents and designing
the frame work to introduce goals and desires into the mix. Then, in the patch
case, all that would be required of an administrator would be to set the goal
"patch host on the third Saturday of the month with the current patchcluster",
and all systems would cooperate to load the appropriate agents and set up the
appropriate environment to patch the hosts.
Introduction
It is the intent of this article to present the automated
scripts and explore some modifications that might be useful in future revisions
of the code. It is also the intent of the article to present a novel method of
using state to control the operation of cooperating scripts, and demonstrate
high-level control of system operation by dynamically altering system files
during the execution of the patching scripts. In this case the /etc/inittab
file is dynamically altered to control the boot process.
Adding patch clusters in Solaris is primarily a manual
process. Currently patch clusters are manually pulled down from SunSolve's Web
site, distributed to the servers, and installed by running the install_cluster
script within the patch cluster. Attempts to automate the process can still
require some manual intervention to either install or reboot the server to
fully install the cluster.
Several problems arise from any manual process. For
patching, these include: the time to manually pull the correct cluster from
SunSolve, manually copying patch clusters to the appropriate host, pre-patch
preparation, and the actual installation. Not only is time the enemy, but
manual methods are prone to error; in a large shop it is possible to forget to
add the patch cluster to some hosts, install or attempt to install the wrong
patch cluster, and/or forget to reboot a host that has installed the patch
cluster.
The solution is to completely automate the patch process
and provide enough logging information so that the administrator can simply
orchestrate events. This is achieved by running several scripts that automate
each step of the patching process. Each step has its own script that can be
modified for various business models.
The Scripts
Automating installation of the patch clusters is
performed using four scripts. The four scripts that define the fully automated
procedure are pull_cluster, grabpatch, detach_attach_mirrors, and patchit.
The pull_cluster script sits on a patch server. It is
responsible for downloading the patches for each version of the operating
system and placing it in the appropriate directory.
The grabpatch script is located on each host, it is
responsible for downloading the correct patch for that host based upon the
operating system version the host is running. The patch is loaded into the
correct directory that patchit will use to install the patch cluster.
The detach_attach_mirrors script is a pre-processing
script which, in our business environment, breaks Disksuite mirrors on our root
partitions. Detach_attach_mirrors shows how to plug in your own pre-processing
scripts required by your environment before installation of the cluster. If you
use Disksuite, it might be of interest to study the code. Detach_attach_mirrors
lives on each host. It is smart enough to determine whether there are any
metadevices on the host to be patched and, if so, whether the mirrors can be
successfully detached.
Finally, patchit is the script that does the actual
installation. It dynamically modifies the /etc/inittab to control the run level
at with patching is done and reboots to run level 3 after patching is complete.
This script allows installation of the patch cluster at the appropriate run
level and then reboots the system after installation.
These scripts take advantage the crontab file, startup
scripts, and the ability with Unix to dynamically edit files during the
execution of scripts.
The scripts on the host to be patched must cooperate to
determine the success or failure of each previous step. If any previous step is
unsuccessful, then patching must stop. To control each script, state flags in
the form of files are used to define the state of the patching process. Before
I explain each script in detail, however, take a look at Figure 1, which shows
the architectural layout of the system. Note the existence of a patch server
and which scripts are loaded onto which hosts.
Run-Time Control
Note that pull_cluster runs only on the patch cluster
server. It is independent of the other scripts. Its job is to pull down the appropriate
scripts from SunSolve. It can be scheduled to run via the cron service once a
month to pull down the latest release of the cluster. It is desirable to patch
only if there is a change in the patch cluster. To accomplish this,
pull_cluster compares the md5 checksum of the new patch cluster it gets from
Sun with the last patch cluster; and, if different, the zip files are stored in
the appropriate OS version directory.
Patching will continue when the grabpatch script is run
from the host being patched. If the patch clusters are the same, the zip files
are deleted. The grabpatch script will fail if there are no patch cluster zip
files on the patch server. This will halt the patching process, and the other
scripts will not run. Synchronization between pull_cluster and grabpatch is
accomplished by scheduling pull_cluster on the patch server before running the
grabpatch as a cron service on the host to be patched. The code for
pull_cluster is shown in Listing 1.
Pull_cluster
Notice that the main routine is very straightforward. For
each OS type in the OS array, we clean up the old zip files, get the new patch
cluster zip files, construct an md5 digest of the zip files, then check these
against the old checksum values for the previous patch clusters. The check_production
function is used in our business environment to determine whether old patches
should be applied to our production machines. Our business model requires that
production machines get the patch cluster last given to the development
machines. The heart of the script uses "wget" to grab the cluster. It would be
very easy to model different patch models within this framework by adding the
appropriate functions. It is the responsibility of the check_development and
check_production functions to delete the zip files if the previously applied
cluster is the same as the new cluster.
Once the clusters are in place, control moves to the host
to be patched. Again cron is used as synchronization tool along with three flags
that track the state of the patch process. As mentioned before, the order of
execution is grabpatch, detach_attach_mirrors, and patchit. Each will be
discussed in turn along with the use of each flag.
On each host to be patched the scripts are installed in
the following locations. Grabpatch and detach_attach_mirrors are placed in
/usr/local/bin. Patchit is placed in /etc/init.d/patchit, and symlinks are
created to it in /etc/rc1.d/S202patchit and /etc/rc3.d/S202patchit. Crontab
entries are made into root's crontab file to first run grabpatch (in our model
the night before patching), then detach_attach_mirrors (pre-processing step to
break Disksuite mirrors, usually the day of patch installation), and finally to
run patchit.
Because each step of the patch process must be successful
before the next step can be taken, we must communicate state information to
each script. Also notice that patchit is both a cron script and a run control
script. Patchit requires additional flags to determine whether or not to patch
at run level 1 and at run level 3. Patchit has this complexity because it is
desirable to patch at run level 1 and then reboot after patching. However,
there are other reasons to reboot besides patching, and it is not desirable to
patch the box during every reboot or every time we execute init 1. The flow is
as follows.
When patching is desired, the administrator creates the
/etc/patchflag file on the host. This flag must be present when the crontab
entry for grabpatch executes. When grabpatch executes, it looks for the patch
cluster zip files within the patch cluster server directory. It creates another
flag file called /etc/patchlog. The /etc/patchlog file also doubles as a log
file for the patch process. We use these to flag files to communicate with the
other processes. In this case if grabpatch sees zip files for the patch
clusters, it will download the zip files and log the processes within the
/etc/patchlog file. Patching continues with the next script, and the
/etc/patchlog and /etc/patchflag remain.
If grabpatch does not see any zip files, it logs failure
into /etc/patchlog, removes the /etc/patchflag file, and then renames the
/etc/patchlog file to /etc/patchlog.`date +%y%m%d`. Because the other patch
scripts require the existence of both /etc/patchlog and /etc/patchflag to run,
once the files are renamed or removed, patching halts. This means that if
grabpatch finds a patch cluster, patching will continue; if it does not, the
other scripts in the cron service will exit without continuing the patch
process. The code for grabpatch is shown in Listing 2.
Grabpatch
Grabpatch does some interesting things. It starts by
reading a file on the patch cluster called "host_type". If it finds its
hostname, it continues; if not, it erases the /etc/patchflag and renames the
/etc/patchlog halting the patching process as before. If the host is identified
in the host_type file, the host reads where to place the cluster on the hard
drive for installation. This location is placed into the /etc/patchlog so that
patchit knows were to find the patch cluster when it reboots the box for
installation.
Grabpatch also determines its OS version from the uname command; it uses this to determine which cluster to pull from the patch cluster
server. This is particularly nice because the host "knows" which cluster to
pull and the sys admin does not have to keep track of this detail. Once the
patches are pulled to the host, the clusters are unzipped and any cluster in
the patch cluster directory on the host is logged into /etc/patchlog. Success
keeps the /etc/patchlog and /etc/patchflag files intact; any failure of this
script will rename the log file and remove the patchflag halting the next step
in the process.
After grabpatch, the business model requires that we
break any root mirrors before patching. This script can be considered a
pre-processing step to the patch procedure. In other words, any scripts that
are considered necessary to prepare an environment for patching can be placed
between grabpatch and patchit, as long as they obey the rules. The rules are:
if successful, leave /etc/patchflag and /etc/patchlog named as they are;
otherwise, remove /etc/patchflag and rename the log file.
Detach_attach_mirrors
The next script in the chain is detach_attach_mirrors.
This script is useful as system script in its own right. Because of this, a -p
switch is added to make it useful in the patch mode. When patching, the -p
switch must be present and is the only operation that the script allows. The
script can clean up its files in /var/tmp, it can detach root mirrors, and it
can attach root mirrors. During patching, the -p switch requires the existence
of the flag files in order to complete; otherwise it exits early, does not
break mirrors, and stops the patch process in the usual manner. It might be
useful to anyone using Disksuite to break root mirrors. Meta-device health is
checked before the mirrors are detached. Listing 3 displays the code for
detach_attach_mirrors.
This script uses temporary files in /var/tmp to keep
track of which submirror is broken from the mirror so that you can reattach the
appropriate submirror to the original mirror after patching is complete. The
"metastat_before_breaking" file is necessary to reattach mirrors; hence it is
checked by the script when the attach option is given. Remember, you can use
the script in patching mode or standalone mode. When patching, the script obeys
the rules and, upon success, leaves the files /etc/patchflag and /etc/patchlog
named as they are. If the script fails to break the mirrors, it renames the log
files and removes the /etc/patchflag.
Patchit
Once any processing tasks are completed successfully,
patching occurs. The final script, patchit, starts as a cron job looking for
the two flags, etc/patchflag and /etc/patchlog. Patchit starts at run level 3,
and if all flags are set, it will begin the patch process. Patchit verifies the
flags and the run level then edits the /etc/inittab file so that the default
run level is set to 1 and reboots the host. The host will reboot to run level 1
where the patchit script is linked to /etc/rc1.d/S202patchit.
When the run level script is executed, it verifies the
flags and, if patching, it will now remove the /etc/patchflag file, get the
installation directory from the /etc/patchlog file, cd to the install
directory, and install any clusters that exist in that directory while logging
all installation activity. If successful, it moves the /etc/patchlog to
/etc/patchlog.`date +%yymmdd` and sets the /etc/patchdone flag. If the patch
fails, the log is still moved to /etc/patchlog.`date +%yymmdd`. The etc/patchdone
flag is not set, and success will not be emailed to the administrator. Finally,
the /etc/inittab is edited using "SED" to reset the /etc/inittab file so that
run level 3 is again the default run level, and the system is rebooted.
When the system comes back up to run level 3, the patchit
script is run as /etc/rc3.d/S202patchit. The script checks flags and, if
successful, will see only the /etc/patchdone flag set, indicating that patching
completed correctly. An email is sent to the administrator that patching was
successful. The /etc/patchflag is removed, and the script exits. Please examine
the code for patchit in Listing 4.
Conclusion and Future Work
The system I've described is currently in place and
working in our data center. There were some minor problems associated with NFS
hard mounts, which have since been converted to soft mounts to eliminate NFS
timeout problems during the boot up process. Other minor issues stemmed from
standardization problems that should have been corrected prior to installation.
These problems have been corrected, but it is interesting to note that human
interference and non-standard system configuration caused the automation
problems. This is likely not to the surprise of anyone, but it points out the
need for consistency in system configuration, which is a problem that agents
could eventually solve by themselves given the appropriate goals. This patch
project demonstrates feasibility of designing agents to cooperate via state and
having the hosts and agents use the perceived environment to cooperate on a
complex systems administration problem.
Future work will continue on the construction of a
goal-based higher level abstraction that allows the administrator to place a
goal and have the machines execute that goal via plans. This experiment will be
implemented within that scope as a simple goal, and the systems will "learn" to
complete that goal via installation of the appropriate software and editing the
appropriate system files without the aid of a systems administrator.
The code and edits to the system have been incorporated
into a package that can be installed on any Solaris Sparc-based system. This
would potentially be the agent-based installation method of choice, because it
can be tracked via pkginfo and other package tools. The code can be obtained
from the Sys Admin Web site at: http://www.sysadminmag.com. Enjoy the code and
please send in any comments or suggestions. As always, inspect the code and
test before attempting to use in your environment. Also, please remember that
you assume any risks if you apply the code to your environment. Of course, I am
available for comment at the address provided.
James Hartley is a systems administrator for a government
project somewhere in Nevada. He holds a Master's Degree in Computer Science,
and Bachelors Degree in Applied Mathematics. He is active in the area of AI as
applied to systems control and uses Unix as a test bed for research and
development. He has been investigating AI techniques as applied to systems
administration for the past couple of years. He can be reached for comment at
james.hartley@gmail.com.
|