Implementing
Cascading Shutdown Scripts
Owen Becker
Power issues are a fact of life in modern datacenters. Despite
people's best intentions, PDUs fail, backup generators have shorter
life spans than are needed, and electrical contractors don't always
follow the plans that have been so carefully laid out. Thus, systems
administrators face the possibility of a frantic call saying that
the main power just died and the systems will have to be shut down
before the generators fail. In this article, I'll describe the steps
taken to help address this eventuality.
Background
Our environment consists of six IBM p660s and one IBM p550 running
AIX. We also employ two generic Linux Web servers and two Data Direct
Networks SANs managing more than 100 TB of drive space. Each server
communicates with the SAN via a Fibre link to a Brocade switch.
In addition to a local file system running from the SAN, each production
box mounts a set of fairly large NFS exports from the central production
server. The application we support is responsible for archiving
and delivering data from NOAA's GOES and POES weather satellites.
We are also currently in the process of ingesting more than 180
TB of historical archives for eventual distribution via the Web.
At a weekly branch meeting, I was given notice that a major restructuring
of the electrical subsystems was about to occur. It was scheduled
to be done in several phases, the last being an upgrade to the generators
that would increase our operating time from around 1.5 to 3.0 hours.
Most of the work was planned for off-peak hours, which, for us,
meant Saturday mornings. Faced with the loss of my normal cartoon
intake and mild paranoia about power stability, I decided to make
my life easier and write a set of scripts to automate the shutdown
of the network.
My primary purpose in writing these scripts was to enable the
console operators to shut down the machines without needing specific
details regarding the infrastructure. Ultimately, I wanted them
to be able to address a situation in which both of the systems administrators
were either unreachable or did not have adequate network access
to execute a shutdown; in addition, the operators should not need
any further access.
I will attempt to keep these instructions as generic as possible.
For sake of clarity, I've replaced actual hostnames with either
prod[n], test[n], or web[n]. By design the scripts are simple, and
although tested only on AIX and Linux, they should be easily portable
to other Unix-like systems.
The cascading shutdown procedure consists of two scripts. The
first is called shutdown_group.sh, which resides on a central server
known as the master console. The choice of the master console is
somewhat flexible. We chose prod1, since by virtue of being the
central NFS server, it should be shut down last. The script connects
to the client systems via SSH and initiates the cascading shutdown.
The shutdown_group.sh script also handles dependency ordering. The
second script, called sysdown.sh, acts as a basic wrapper to the
shutdown binary. The script also echoes the hostname for later redirection
to a log file.
Implementation
To begin, create a user named "sysdown" on each of the servers
that needs to be brought down. On default installations of AIX,
/usr/sbin/shutdown is owned by the shutdown group, and accordingly,
we need to add sysdown as a member. With the exception of the master
console, sysdown's shell is set to sysdown.sh. This provides the
console operators with a way to bring down individual machines without
taking down the entire group. On the master console, the sysdown
user runs /bin/sh. Here is an extraction from /etc/passwd:
sysdown:!:204:1:Shutdown User:/home/sysdown:/usr/local/sbin/sysdown.sh
The sysdown.sh script is deliberately bare bones. It only contains
an echo announcing the hostname and a call to the shutdown binary.
Here is the sysdown.sh script:
#!/bin/sh
echo "$HOSTNAME is shutting down at `date`"
echo "..." echo
echo shutdown -F
At this point, the shutdown command is echoed and the echo
should be removed after testing. Users tend to get peevish when, during
deployment, servers go down unannounced. For each system to be shut
down, place sysdown.sh in /usr/local/sbin and change the group ownership
to shutdown. Next, set up the SSH keys. Log on to the master console
as sysdown and execute the following:
[prod1] [~] $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/sysdown/.ssh/id_dsa):
Enter passphrase (empty for no passphrase): (Just hit return)
Enter same passphrase again: (Again, hit return)
Your identification has been saved in /home/sysdown/.ssh/id_dsa.
Your public key has been saved in /home/sysdown/.ssh/id_dsa.pub.
The key fingerprint is:
XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX sysdown@prod1
Although I dislike using SSH keys sans passphrase, in this case it
is necessary. For each client machine, create a .ssh directory in
sysdown's home and set the mode to 0700. Copy the id_dsa.pub to ~sysdown/.ssh/authorized_keys.
From the master console, test that remote command execution works
for all the clients and add the DSA key fingerprint when prompted.
The main script is called shutdown_group.sh and lives on the master
console. The shutdown_group.sh script will execute sysdown.sh on
the client systems, sleep for three minutes, and then run the local
copy. The shutdown_group.sh script is as follows:
#! /bin/sh
clear
echo "/********* Warning!!! This will shutdown all of the defined \
servers. **********/"
echo "/********* Hit CTRL-C within 10 seconds if you want to bail \
out **********/"
echo
echo
echo
echo
for i in 10 9 8 7 6 5 4 3 2 1
do
echo "Cascading shutdown will commence in t - $i seconds and counting"
sleep 1
done
echo "Okay, you asked for it... All defined servers are going down"
echo "shutdown_group.sh was executed by the console ops at \
`date`" >> $HOME/shutdown.log
# Because they run databases, prod5 and test2 need to go down last.
for i in prod2 prod3 test1 prod4 web1 web2 prod5 test2
do
echo "Shutting down $i"
ssh $i /usr/local/bin/sysdown.sh >> $HOME/shutdown.log &
sleep 60
done
sleep 180
/usr/local/bin/sysdown.sh >> $HOME/shutdown.log
Operation
At this point, we need to add shutdown_group.sh as the last line
of sysdown's .profile. Once this step is complete, logging into
the master console as sysdown should initiate the cascade and bring
down each of the defined machines.
We found it helpful to write down sysdown's password and place
it in a sealed envelope along with usage instructions and the home
phone numbers of the systems administrators. Once the account has
been used, the password can be reset and the reasons for the shutdown
logged.
The final and critical portion of the deployment is both explaining
and giving written policy as to when the sysdown account should
be used. The operators must be made aware that the only time the
sysdown account should be used is after given a go-ahead from one
of the systems administrators or when a power catastrophe has occurred
and the administrators are unreachable.
Future Directions and Conclusions
After the successful deployment of the scripts, I worked with
one of the engineers from DDN to integrate our SAN into the shutdown
system. Because DDN SANs use a controller running Linux, we were
able to write a wrapper around their APIs that halts the LUNs and
saves state before calling a shutdown. However, because the drive
cabinets require someone to manually throw switches for a full power-off,
I have been hesitant to integrate the SANs into the cascade.
Although a power failure requiring the operators to use the cascading
shutdown procedure has not yet occurred, we have found the process
very helpful for our scheduled shutdowns. They have added a layer
of convenience and safety to our regular maintenance, and we have
found that automation provided by the scripts has reduced our downtime
significantly.
Owen Becker is a systems administrator currently working for
Global Science & Technology, Inc. He holds a B.S. in computer
science from Alderson-Broaddus College and is primarily focused
on Unix systems administration and security. Owen can be reached
at: owen.becker@gst.com. |