Article	Listing 1	Listing 2	Listing 3	Listing 4
Listing 5	Listing 6	Listing 7	Listing 8	Sidebar 1
Sidebar 2	jul2006.tar

Building a Bulletproof Penguin: DB2 and Linux Failover

Joe La Chapell and JT Vogt

Implementing a new enterprise resource planning (ERP) system that will impact every division of a company is an intriguing proposition. A lot is expected of such a system, and when we were tasked with this undertaking at our company, we did not want to neglect the foundation of it -- the database. As we researched this project, it became evident that we could satisfy all the necessary requirements with a combination of products: IBM DB2, Red Hat Linux, as well as Heartbeat, Mon, and ipfail. In this article, I will describe how we built a "bulletproof penguin" one piece at a time.

No Single Point of Failure

We needed a solution that would guarantee high availability and failover, so we decided to install two database servers implementing DB2's new High Availability Disaster Recovery (HADR). We also wanted to make certain we had no single point of failure in our hardware or software. We purchased a 14-disk storage array (SAN) and allocated one large RAID 5 partition for the database servers to use for storage. Our SAN controller has no single point of failure, so from a hardware perspective, we are on solid footing. Our HADR solution provides no single point of failure from a software perspective. Since we had committed to DB2, we had to use a shared nothing approach, which, simply stated, means that each node in the cluster has its own file system.

Our Disks

Once we had our disk array, we were able to install our operating system on each of the database servers. We wanted an operating system that was scalable and reliable, so selecting Red Hat Advanced Server as our operating system was a pretty easy choice. Getting Red Hat to talk to the storage array was a breeze using Qlogic's Host Bus Adapter (HBA) drivers and their installation utility called SANsurfer. As for the data storage that the operating system would now communicate with, we created two partitions on the disk array -- one partition for each database server. The following is a layout report of our disk array generated by the "fdisk" utility:

#fdisk -l /dev/sdc


Disk /dev/sdc: 1614.5 GB, 1614500761600 bytes
255 heads, 63 sectors/track, 196285 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sdc1             1     79026 634776313+  83  Linux
/dev/sdc2         80000    159025 634776345   83  Linux

Then we formatted the partitions using the "ext3" file system:

#mkfs -t ext3 -b 4096 /dev/sdc1
#mkfs -t ext3 -b 4096 /dev/sdc2

Finally, we created a directory to mount the SAN to after the DB2 install completed:

#mkdir /db2_data

Installing DB2

At this point we were ready to install and configure DB2. We began with DB2 v8.1 FixPak 9 Enterprise Server Edition HADR, which came shipped as an option with DB2. Installing DB2 was easy with the tools delivered by IBM. Installing DB2 can be accomplished as directed by the following documentation:

http://www.redbooks.ibm.com/abstracts/sg246899.html

but we did deviate from the instructions by copying the installation files to the local disk and running the installer from there.

We were running Java version 1.4 instead of 1.3, which caused the installation script to fail, so we had to modify the installation scripts to look at the new Java Development Kit (JDK). To accomplish this, we opened the file named db2inst, looked for the line that read, "JAVA_PATH="/opt/IBMJava2-131/jre/bin" and changed it to read, "JAVA_PATH="/opt/IBMJava2-142/jre/bin". After we fixed the installation script, we reran the installation application. After DB2 was installed and we had our "db2inst1" user, we could assign the previously created file system to the instance owner (db2inst1) and its group (db2iadm1).

We changed the ownership of our mount point:

#chown db2inst1.db2iadm1 /db2_data

mounted the previously created file system:

#mount -t ext3 /dev/sdc1 /db2_data

and changed the DB2 configuration:

#db2 update dbm cfg using DFTDBPATH /db2_data/database

We decided to keep the logs on the local disks (six disks in a RAID 5 array with an online spare). We will set this path when we create the database.

Logged in as the user db2inst1, we issued a db2start command to start an instance of DB2. Note that a good DBA checks the /home/db2inst1/sqllib/log directory to look for any "db2start" error logs. If there are any errors, "db2start" will create a file named something like "db2start.20060218072215.errlog" in which to store them. The db2diag.log will look like the following:

2006-03-02-11.57.40.989011-360 I1G1319      LEVEL: Event
PID     : 4295           TID  : 3006257664  PROC : db2agent (PSFTPROD) 0
INSTANCE: db2inst1       NODE : 000         DB   : PSFTPROD
APPHDL  : 0-625          APPID: C0A87598.H30A.015902175536
FUNCTION: DB2 UDB, RAS/PD component, _pdlogInt, probe:120
START   : New db2diag.log file
DATA #1 : Build Level, 124 bytes
Instance "db2inst1" uses "32" bits and DB2 code release "SQL08022"
with level identifier "03030106".
Informational tokens are "DB2 v8.1.0.88", "s050422", "MI00105", FixPak "9".
DATA #2 : System Info, 364 bytes
System: Linux db02.example.com 4 2 i686
CPU: total:8 online:8
Physical Memory(MB): total:6042 free:91
Virtual  Memory(MB): total:9039 free:2935
Swap     Memory(MB): total:2997 free:2844
Kernel  Params:  msgMaxMessageSize:8192 msgMsgMap:16384
                       msgMaxQueueIDs:1024 msgNumberOfHeaders:16384
                       msgMaxQueueSize:16384 msgMaxSegmentSize:16
                               shmMax:268435456 shmMin:1 shmIDs:4096
                  shmSegments:4096 semMap:32000 semIDs:1024
                  semNum:32000    semUndo:32000 semNumPerID:250
                  semOps:32 semUndoSize:20 semMaxVal:32767
                  semAdjustOnExit:32767
Information in this record is only valid at the time when this 
file was created (see this record's time stamp)

When we saw this record, we knew things were going great! (Note that this information is very handy to have, and an administrator should know all of it to be successful.) At this point, we created a database (named WSI811) so we could begin testing our connection to the DB2 instance. To do so, we issued the command:

#db2 create database WSI811

Client Configuration

Having a database is mostly useless unless there are some applications that need to access it. We logged on to one of our application servers to test the ability to connect to DB2. The first thing we did was install the DB2 runtime client, which is, by default, a graphical Java installer. For HADR, the client configuration is what actually does the failover. As I will show later, the client is responsible for redirecting the request to the primary server once it receives a signal that the primary database is unavailable.

When prompted to select whether we wanted to create an instance now or wait until later, we selected "install now" and provided the instance information as requested. To make the application server aware of the DB2 instance and the database(s) provided by that instance, we had to catalog the database(s) on the application server, but we could have done that from other clients as well. We actually used a catalog script that is a bit more complex, but the following simpler catalog directive would have sufficed:

#db2 catalog tcpip node db02 remote db02.example.com server \ 
 60000 remote_instance db2inst1
#db2 catalog database WSI811 as WSI811 at node db02

Before attempting a connection to the database, we ran the following to make sure our client was cataloged correctly:

#db2 list db nodes

The output looked like this:

  Node Directory

 Number of entries in the directory = 1

Node 1 entry:

  Node name              = db02
  Comment                =
  Directory entry type   = LOCAL
  Protocol               = TCPIP
  Hostname               = db02.example.com
  Service name           = 60000

Next, we made sure we could see the database by running the following command:

#db2 list database directory

The output looked like this:

 System Database Directory


 Number of entries in the directory = 1


Database 1 entry:

 Database alias                       = WSI811
 Database name                        = WSI811
 Node name                            = DB02
 Database release level               = a.00
 Comment                              =
 Directory entry type                 = Remote
 Catalog database partition number    = -1
 Alternate server hostname            =
 Alternate server port number         = 60000

At this point, we tried to connect to the server. When prompted for the password, we had to use the password for the "db2inst1" user on the database server, not the client from which we were connecting:

#db2 connect to WSI811 user db2inst1
Enter current password for db2inst1:


   Database Connection Information


 Database server        = DB2/LINUX 8.2.2
 SQL authorization ID   = DB2INST1
 Local database alias   = WSI811

After we saw the message above, we knew we had a successful setup and connected to our DB2 instance.

Secondary Server

The next step was to set up the second server exactly the same as the first, with the exception of the file system. We mounted a separate file system for the second database server ("sdc2" instead of "sdc1"). We did this because of DB2's "shared nothing" architecture, a concept that is shared by Google and several other technology leaders. Our mount command for db02 changed to:

#mount -t ext3 /dev/sdc2 /db2_data

We put that mount directive into "/etc/fstab".

When all of the setup tests we ran on the first server worked on the second database server, we could begin our HADR setup. First we had to make sure these servers could communicate with each other. Because our policy was not to trust DNS servers for anything this important, we entered the IP address and hostname of both database servers into our "/etc/hosts" file. The "/etc/hosts" files on each server looked like the following:

127.0.0.1               localhost.localdomain localhost
10.100.1.70             db01.example.com      db01
10.100.1.69             db02.example.com      db02

By doing this, we eliminated the case where communication with the DNS servers interrupted our perfectly tuned database.

Next, we turned our attention to the "/etc/services" file that contained our port definitions. The last three lines of the "/etc/services" file read:

hadr1              55050/tcp
hadr2              55051/tcp
db2_db2inst1       60000/tcp

"hadr1" and "hadr2" ports were defined to transfer the logs between servers. The "db2_db2inst1" port was the port on which the database was listening.

To test network connectivity, we used simple tools such as "dig", "ping", and "nmap". After we found positive results, we continued.

Finally, we made an online backup from our primary database server. We used that backup to restore to our secondary server. We backed up and restored one database at a time to avoid disk contention.

HADR

Since HADR was ready to be applied, we needed to make our instance aware that we were running HADR. We accomplished this by updating the HADR configuration on each of the servers:

#db2 update db cfg for WSI811 using hadr_local_host db02.example.com
#db2 update db cfg for WSI811 using hadr_local_svc 55050
#db2 update db cfg for WSI811 using hadr_remote_host db01.example.com
#db2 update db cfg for WSI811 using hadr_remote_svc 55051
#db2 update db cfg for WSI811 using hadr_syncmode sync
#db2 update db cfg for WSI811 using hadr_remote_inst db2inst1
#db2 update db cfg for WSI811 using hadr_timeout 120
#db2 update alternate server for db WSI811 using hostname \ 
 db01.example.com port     55051

We started HADR on the primary database first. Note that I said, "primary database" instead of "primary database server". This is an important distinction because with our setup, we would be able to have one server be the primary server for a database, but a secondary for a different database -- this is called an "active-active" setup. Back to our setup... to start HADR, issue the command:

#db2 start hadr on database WSI811 as standby

HADR must be started on the standby database before starting it on the primary database.

Since we started HADR on our standby database, we started HADR on the primary:

#db2 start hadr on database WSI811 as primary

At this point, HADR was configured correctly, but we needed to notify our clients about it. This is where the catalog component gets a little more complicated; we wrote a script to accomplish what we needed:

#!/bin/bash
db2 uncatalog node db01
db2 uncatalog node db02


db2 catalog tcpip node db01 remote db01.example.com server \ 
 60000 remote_instance     db2inst1
db2 catalog tcpip node db02 remote db02.example.com server \ 
 60000 remote_instance     db2inst1


db2 uncatalog database WSI001
db2 catalog database WSI001 as WSI001 at node db01

Running the above script on the client should produce results similar to this:

#db2 list node directory


 Node Directory


 Number of entries in the directory = 2


Node 1 entry:


 Node name                      = DB01
 Comment                        =
 Directory entry type           = LOCAL
 Protocol                       = TCPIP
 Hostname                       = db01.example.com
 Service name                   = 60000


Node 2 entry:


 Node name                      = DB02
 Comment                        =
 Directory entry type           = LOCAL
 Protocol                       = TCPIP
 Hostname                       = db02.example.com
 Service name                   = 60000


#db2 list database directory


 System Database Directory


 Number of entries in the directory = 1


Database 1 entry:


 Database alias                       = WSI001
 Database name                        = WSI001
 Node name                            = DB01
 Database release level               = a.00
 Comment                              =
 Directory entry type                 = Remote
 Catalog database partition number    = -1
 Alternate server hostname            =
 Alternate server port number         = 60000

The "Alternate server hostname" and "Alternate server port number" values are blank. These values got populated once we connected to the database as the instance owner (db2inst1) from the client. After we connected to the database, the alternate server hostname and alternate server port number field read:

 Alternate server hostname            = db02.example.com
 Alternate server port number         = 55051

Once we got to this point, we could perform some testing from the application server. We pulled the network cable on the database server. Then we switched the secondary to be the primary using the following command:

#db2 takeover hadr on database wsi811

We checked to make sure we could still connect to the server. Also, we checked our running application(s). The main application that was running generated errors until we had cleared our connection pools.

Understanding Heartbeat and Mon

The core of the fault-tolerant system is a set of programs called "heartbeat" and "mon". While it's not essential to totally understand how these programs interact (one can simply configure them using any number of articles on the Internet), it's invaluable when attempting to set up an atypical installation.

What Is an Atypical Installation

Most step-by-step, "guide"-type documentation regarding heartbeat and mon on the Internet assumes that you will be running in an "active-passive" configuration. In this scenario, you have an "active" primary system that "owns" the sources and a "passive" backup system that basically just waits for the primary to fail. But let's take a step back. The key to the understanding of heartbeat and mon is the concept of a "resource" -- a program, service, etc. that you want to be fault tolerant. For example, in our environment, the DB2 engine is the "resource" that matters. So, in an "active-passive" configuration, we would have one DB2 server, holding all the databases, tables, etc. and the other a copy of it, basically just waiting for the first to fail.

One problem with such a configuration is that this leaves the secondary (passive) server metaphorically "twiddling its fingers". In other words, it's not being used to its capacity unless and until the primary (active) server fails. When "bang for the buck" counts, this is not an optimal way to deal with things. Instead, you might want to distribute databases across servers, having some primary on one and standby on the other, and others in the reverse roles. This is considered "active-active" and is a bit more difficult to handle, in that you must always know who's in charge of which resource at a given time.

Further, most documentation also ignores what to do when you finally get that primary server repaired. How, in the vernacular, does one "fail back"? While this is mentioned, it's rarely, if ever, discussed in detail.

Also, most documentation glosses over the difference between heartbeat, ipfail, and mon, treating them as parts of the same package. Although they most often are used together, they are separate programs with separate responsibilities. Until one understands those responsibilities and how the programs interact, it can be very confusing.

So, if we're going to do anything more than the basic "active-passive", floating IP, failover configuration, we need to answer the following questions:

What is heartbeat? What is mon? What is ipfail? How do they relate?
Now that we've failed over, how do we fail back (recover)?
How do we move from active-passive to active-active?

Heartbeat, Ipfail, and Mon

Heartbeat is, essentially, a tool that determines whether connectivity exists (connectivity to what? just wait) and, if it does not, what it needs to do to rectify the situation. The connectivity comes, basically, in two flavors -- connectivity to peers and basic connectivity ("aliveness").

Heartbeat works in a "cluster". That is, it has a set of peers with which it tests connectivity and, if that connectivity is lost, to whom it cedes resource responsibility. It also can monitor outside connectivity (outside the cluster) through the module called "ipfail" to determine whether it has lost network connectivity. This is a "Good Thing" from the standpoint that a system can do some of its own diagnostics -- it can determine, to some extent, if it can no longer talk to anyone or if, indeed, its peer is the one at fault. It can then take appropriate action.

So, as a thumbnail, all heartbeat does is monitor connectivity and, if the "primary" in the cluster goes away, it starts up a few programs.

What heartbeat does not do is to monitor services (other than the network) itself. For example, it doesn't look to see whether your database is running. For that, we need mon.

Mon is a generic service monitoring tool. Basically, it watches services (like a database engine or disk utilization) and, if appropriate, triggers alerts (via email, pager, or whatever you choose).

So, although you often hear heartbeat and mon referred to almost interchangeably, they are two separate entities. They connect only in that one (mon) is started by the other (heartbeat). Understanding this simple (albeit sometimes subtle) fact is the key to using these tools to create a fault-tolerant system.

To summarize, heartbeat waits for connectivity to be lost, through its own monitoring and through external facilities like ipfail, and mon watches other services. In practice, these three tools relate to each other through their configuration files.

We began configuring heartbeat by first configuring the "ha.cf" file (see Listing 1 for the essentials of ours). This file specifies the "timer values" (how long we go without a response before assuming things are dead, etc.), the members of our "cluster", the addresses to check for exterior connectivity, and what to do in case of lost connectivity. (See the "HADR and Heartbeat Timing Settings" sidebar for more information.)

In essence, heartbeat creates a cluster of nodes. Each system (node) in the cluster is watched by the heartbeat program through the use of a private communications channel. (I won't get into sockets and named pipes, here; suffice it to say it's a private channel.) Heartbeat then keeps tabs by talking through this channel to all members (nodes) of the cluster. If a node goes down, then a member of the cluster tries to take over its services.

It is easy to become confused by the relationship between heartbeat and ipfail. Though the two are most often discussed together, they are really two standalone programs. While heartbeat checks connectivity to other members of the cluster, ipfail checks connectivity with the rest of the world. To this end, you need to make sure you set the ping nodes to addresses that will help to determine whether network connectivity has been lost. So, instead of setting a peer as a ping node, you want to set something like your router, gateway, and other such server.

Think of it this way -- ipfail is a "true-false module" called by heartbeat that tells heartbeat whether it should trigger in the same way it would if heartbeat detected a primary down. It is just like the peer-to-peer portion of heartbeat, but it monitors the network instead.

Next, we need to look at the "haresources" file (see Listing 2). This file lists the resources that matter to us and states which node, when all is right with the world, is the primary for those resources. As discussed previously, we have configured our DB2 clients to know who the servers are and to contact the appropriate one if connectivity is lost.

In this way, we don't need to deal with a "floating IP"; so, in our "haresources" file, we do not list an IP address. Instead, we list the primary node and the resource names "hadr" and "mon" (note that all peers, primary and all others, have the same file -- that is, all have the same primary node entry, thus telling the peers who is ultimately responsible for stuff).

It's important to understand what those resource names are in relation to actual program files. These names are, in fact, the names of scripts kept in the "resource.d" directory. Those scripts follow a very specific format. Like scripts in /etc/init.d, those scripts must accept the following arguments: start, stop, and status. (Listing 3 contains a sample hadr script that allows start, stop, and status.)

It's also important to understand why and when these scripts will get called. They are not called by heartbeat when heartbeat starts on the primary node. Heartbeat assumes these are running on the primary node, so make sure you start them elsewhere (/etc/init.d would be a good place).

They are, however, called when a peer determines that it needs to take over (either because it has detected, through heartbeat, a failure of the primary or has been told by hb_standby to take over). So, in effect, these scripts are what cause the actual takeover of their respective resources when necessary.

The mon program is configured with the "mon.cf" file (see Listing 4 for the essentials of ours), which contains a set of stanzas stating which resources should be monitored and, if the resource is not working, when, how, and how often to send the appropriate alert. (See the "HADR and Mon Timing Settings" sidebar for more information.) Like the "haresources" resource names, above, the "monitor" and "alert" portions of the stanzas are the names of scripts. These reside in the "mon.d" directory. The "monitor" scripts are true-false returning programs that determine whether a resource is okay (true) or not (false). If the resource is not okay, then the corresponding alert script is run. While there is no real set of requirements like those for the haresources scripts, good coding rules should be observed.

With all of these scripts, both for haresources and mon, the responsibility is on you, the programmer, to handle all aspects of failover (haresources), monitoring (mon -- monitor), and alerting (mon -- alert). Heartbeat, ipfail, and mon do not do this for you. So, crack your knuckles, pop open another Jolt, and get down to script writing (see Listing 5 for an example of the "peer.monitor" script as called from the "mon.cf" file).

One thing that will help is if you use the library of shell functions called "shellfuncs" that's provided with heartbeat as an "initial values" library. All of our scripts (see Listings 5-8) begin with a source call to this file:

. /etc/ha.d/shellfuncs

This makes sure that certain values, like HOSTNAME, are shared through all programs. In our environment, this is particularly necessary because we need to know what host we are on in order to know how to correctly assume certain roles. And because we are executing the scripts in a safe environment (e.g., non-root), values like HOSTNAME may not be inherited (yes, we could just use /bin/hostname, but this is just an example, I'll show how we use other variables later). It also provides a handy place for any generic functions that the scripts might need.

These scripts, then, are the real meat of the heartbeat/mon system. It is where all the work necessary for failing over (and back) gets done, and it's totally up to you as to how it happens. Understanding this fact, as well as the relationships between heartbeat, mon, and ipfail, are what allow us to create atypical installations.

Your Server Is Fixed -- Now What?

So now you have successfully failed over to the backup server. And called hardware maintenance. They've fixed the primary server, so now how do you get back to where you were? This isn't always as easy as you might think.

While there is a facility in heartbeat called "auto_failback", you have to remember that it is anything but auto in terms of the resources. It will be up to you to handle failing back.

So, let's create a manual procedure for failing back. To do this gracefully, it is important to follow the order of events:

1. Shut down mon on the standby (the "normally backup") server.

2. Shut down heartbeat on the standby server.

3. Take over resources "manually" on the primary (fixed) server.

4. Start heartbeat on the primary server.

5. Start heartbeat on the standby server.

Step 3 is a bit of "magic happens here". We created a program called takeover that causes a HADR forced takeover of all the databases from the standby to the primary.

The trick to fully automate this procedure is simply to script the process, enshrining it in the scripts that we created for heartbeat (haresources) and mon (monitor and alert) and then turning on auto_failback in the "ha.cf" file. We have decided not to do that here since we want to very carefully monitor the reinstatement of the primary server and not rely on automatic services.

Where to Go from Here (Active-Active Configurations)

Once again, this is where our knowledge of the relationships and duties of the various parts is essential. To create a true active-active setup where each of the servers is "HADR primary" for part of the databases (instead of one being primary for all of them), all we need to do is to revisit the scripts.

This time, we modify them to understand that when they get called, for either starting or stopping, they must take over the correct databases, since they already are primary on the others. This is a place that the shellfuncs file can be particularly helpful. It can contain the list of what services (databases) are typically primary on which server. That way, the scripts on one system are "boilerplate" to all the others in the cluster.

To summarize, the most important part of understanding how to configure and then go beyond typical fault-tolerant installations of heartbeat and mon is understanding the elements of the system and how they relate. Once you understand that, the rest is scripting. And, as a systems administrator, you have done your fair share of that already, so you are almost home!

Conclusion

The setup for our DB2 database went very well. We have a solid foundation for our ERP system, which will provide maximum uptime and maximum reliability. For us, DB2, Linux, heartbeat, mon, and ipfail were invaluable in providing a "bulletproof penguin" for our ERP system.

Joseph La Chapell is an IT professional who enjoys using Linux to solve problems and has been doing so since the late 1990s. He thanks his family for their support in all he does. JT Vogt is a Unix professional with more than 20 years of Unix experience, and he believes that computers should manage themselves. Joseph and JT are both system administrators for Warehouse Specialists, Inc. located in Appleton, Wisconsin.