Implementing Cascading Shutdown Scripts

Owen Becker

Power issues are a fact of life in modern datacenters. Despite people's best intentions, PDUs fail, backup generators have shorter life spans than are needed, and electrical contractors don't always follow the plans that have been so carefully laid out. Thus, systems administrators face the possibility of a frantic call saying that the main power just died and the systems will have to be shut down before the generators fail. In this article, I'll describe the steps taken to help address this eventuality.

Background

Our environment consists of six IBM p660s and one IBM p550 running AIX. We also employ two generic Linux Web servers and two Data Direct Networks SANs managing more than 100 TB of drive space. Each server communicates with the SAN via a Fibre link to a Brocade switch. In addition to a local file system running from the SAN, each production box mounts a set of fairly large NFS exports from the central production server. The application we support is responsible for archiving and delivering data from NOAA's GOES and POES weather satellites. We are also currently in the process of ingesting more than 180 TB of historical archives for eventual distribution via the Web.

At a weekly branch meeting, I was given notice that a major restructuring of the electrical subsystems was about to occur. It was scheduled to be done in several phases, the last being an upgrade to the generators that would increase our operating time from around 1.5 to 3.0 hours. Most of the work was planned for off-peak hours, which, for us, meant Saturday mornings. Faced with the loss of my normal cartoon intake and mild paranoia about power stability, I decided to make my life easier and write a set of scripts to automate the shutdown of the network.

My primary purpose in writing these scripts was to enable the console operators to shut down the machines without needing specific details regarding the infrastructure. Ultimately, I wanted them to be able to address a situation in which both of the systems administrators were either unreachable or did not have adequate network access to execute a shutdown; in addition, the operators should not need any further access.

I will attempt to keep these instructions as generic as possible. For sake of clarity, I've replaced actual hostnames with either prod[n], test[n], or web[n]. By design the scripts are simple, and although tested only on AIX and Linux, they should be easily portable to other Unix-like systems.

The cascading shutdown procedure consists of two scripts. The first is called shutdown_group.sh, which resides on a central server known as the master console. The choice of the master console is somewhat flexible. We chose prod1, since by virtue of being the central NFS server, it should be shut down last. The script connects to the client systems via SSH and initiates the cascading shutdown. The shutdown_group.sh script also handles dependency ordering. The second script, called sysdown.sh, acts as a basic wrapper to the shutdown binary. The script also echoes the hostname for later redirection to a log file.

Implementation

To begin, create a user named "sysdown" on each of the servers that needs to be brought down. On default installations of AIX, /usr/sbin/shutdown is owned by the shutdown group, and accordingly, we need to add sysdown as a member. With the exception of the master console, sysdown's shell is set to sysdown.sh. This provides the console operators with a way to bring down individual machines without taking down the entire group. On the master console, the sysdown user runs /bin/sh. Here is an extraction from /etc/passwd:

sysdown:!:204:1:Shutdown User:/home/sysdown:/usr/local/sbin/sysdown.sh

The sysdown.sh script is deliberately bare bones. It only contains an echo announcing the hostname and a call to the shutdown binary. Here is the sysdown.sh script:

#!/bin/sh

echo "$HOSTNAME is shutting down at `date`"
echo "..."  echo

echo shutdown -F

At this point, the shutdown command is echoed and the echo should be removed after testing. Users tend to get peevish when, during deployment, servers go down unannounced. For each system to be shut down, place sysdown.sh in /usr/local/sbin and change the group ownership to shutdown. Next, set up the SSH keys. Log on to the master console as sysdown and execute the following:

[prod1] [~] $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/sysdown/.ssh/id_dsa):
Enter passphrase (empty for no passphrase): (Just hit return)
Enter same passphrase again: (Again, hit return)
Your identification has been saved in /home/sysdown/.ssh/id_dsa.
Your public key has been saved in /home/sysdown/.ssh/id_dsa.pub.
The key fingerprint is:
XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX sysdown@prod1

Although I dislike using SSH keys sans passphrase, in this case it is necessary. For each client machine, create a .ssh directory in sysdown's home and set the mode to 0700. Copy the id_dsa.pub to ~sysdown/.ssh/authorized_keys. From the master console, test that remote command execution works for all the clients and add the DSA key fingerprint when prompted.

The main script is called shutdown_group.sh and lives on the master console. The shutdown_group.sh script will execute sysdown.sh on the client systems, sleep for three minutes, and then run the local copy. The shutdown_group.sh script is as follows:

#! /bin/sh

clear

echo "/********* Warning!!! This will shutdown all of the defined \
      servers. **********/"
echo "/********* Hit CTRL-C within 10 seconds if you want to bail \
      out      **********/"
echo
echo
echo
echo

for i in 10 9 8 7 6 5 4 3 2 1
do
echo "Cascading shutdown will commence in t - $i seconds and counting"
sleep 1
done

echo "Okay, you asked for it... All defined servers are going down"

echo "shutdown_group.sh was executed by the console ops at \
      `date`" >> $HOME/shutdown.log
# Because they run databases, prod5 and test2 need to go down last.
for i in prod2 prod3 test1 prod4 web1 web2 prod5 test2
        do
                echo "Shutting down $i"
                ssh $i /usr/local/bin/sysdown.sh >> $HOME/shutdown.log &
                sleep 60
        done

sleep 180
/usr/local/bin/sysdown.sh >> $HOME/shutdown.log

Operation

At this point, we need to add shutdown_group.sh as the last line of sysdown's .profile. Once this step is complete, logging into the master console as sysdown should initiate the cascade and bring down each of the defined machines.

We found it helpful to write down sysdown's password and place it in a sealed envelope along with usage instructions and the home phone numbers of the systems administrators. Once the account has been used, the password can be reset and the reasons for the shutdown logged.

The final and critical portion of the deployment is both explaining and giving written policy as to when the sysdown account should be used. The operators must be made aware that the only time the sysdown account should be used is after given a go-ahead from one of the systems administrators or when a power catastrophe has occurred and the administrators are unreachable.

Future Directions and Conclusions

After the successful deployment of the scripts, I worked with one of the engineers from DDN to integrate our SAN into the shutdown system. Because DDN SANs use a controller running Linux, we were able to write a wrapper around their APIs that halts the LUNs and saves state before calling a shutdown. However, because the drive cabinets require someone to manually throw switches for a full power-off, I have been hesitant to integrate the SANs into the cascade.

Although a power failure requiring the operators to use the cascading shutdown procedure has not yet occurred, we have found the process very helpful for our scheduled shutdowns. They have added a layer of convenience and safety to our regular maintenance, and we have found that automation provided by the scripts has reduced our downtime significantly.

Owen Becker is a systems administrator currently working for Global Science & Technology, Inc. He holds a B.S. in computer science from Alderson-Broaddus College and is primarily focused on Unix systems administration and security. Owen can be reached at: owen.becker@gst.com.