Building a Bulletproof Penguin: DB2 and Linux Failover
Joe La Chapell and JT Vogt
Implementing a new enterprise
resource planning (ERP) system that will impact every division of a company is an
intriguing proposition. A lot is expected of such a system, and when we were tasked
with this undertaking at our company, we did not want to neglect the foundation
of it -- the database. As we researched this project, it became evident that
we could satisfy all the necessary requirements with a combination of products:
IBM DB2, Red Hat Linux, as well as Heartbeat, Mon, and ipfail. In this article,
I will describe how we built a "bulletproof penguin" one piece at a time.
No Single Point of Failure
We needed a solution that would guarantee high
availability and failover, so we decided to install two database servers implementing
DB2's new High Availability Disaster Recovery (HADR). We also wanted to make
certain we had no single point of failure in our hardware or software. We purchased
a 14-disk storage array (SAN) and allocated one large RAID 5 partition for the database
servers to use for storage. Our SAN controller has no single point of failure, so
from a hardware perspective, we are on solid footing. Our HADR solution provides
no single point of failure from a software perspective. Since we had committed to
DB2, we had to use a shared nothing approach, which, simply stated, means that each
node in the cluster has its own file system.
Our Disks
Once we had our disk array, we were able to install
our operating system on each of the database servers. We wanted an operating system
that was scalable and reliable, so selecting Red Hat Advanced Server as our operating
system was a pretty easy choice. Getting Red Hat to talk to the storage array was
a breeze using Qlogic's Host Bus Adapter (HBA) drivers and their installation
utility called SANsurfer. As for the data storage that the operating system would
now communicate with, we created two partitions on the disk array -- one partition
for each database server. The following is a layout report of our disk array generated
by the "fdisk" utility:
#fdisk -l /dev/sdc
Disk /dev/sdc: 1614.5 GB, 1614500761600 bytes
255 heads, 63 sectors/track, 196285 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 1 79026 634776313+ 83 Linux
/dev/sdc2 80000 159025 634776345 83 Linux
Then we formatted the partitions using the "ext3"
file system:
#mkfs -t ext3 -b 4096 /dev/sdc1
#mkfs -t ext3 -b 4096 /dev/sdc2
Finally, we created a directory to mount the SAN
to after the DB2 install completed:
#mkdir /db2_data
Installing DB2
At this point we were ready to install and configure
DB2. We began with DB2 v8.1 FixPak 9 Enterprise Server Edition HADR, which came
shipped as an option with DB2. Installing DB2 was easy with the tools delivered
by IBM. Installing DB2 can be accomplished as directed by the following documentation:
http://www.redbooks.ibm.com/abstracts/sg246899.html
but we did deviate from the instructions by copying
the installation files to the local disk and running the installer from there.
We were running Java version 1.4 instead of
1.3, which caused the installation script to fail, so we had to modify the installation
scripts to look at the new Java Development Kit (JDK). To accomplish this, we opened
the file named db2inst, looked for the line that read, "JAVA_PATH="/opt/IBMJava2-131/jre/bin"
and changed it to read, "JAVA_PATH="/opt/IBMJava2-142/jre/bin".
After we fixed the installation script, we reran the installation application. After
DB2 was installed and we had our "db2inst1" user, we could assign the
previously created file system to the instance owner (db2inst1) and its group (db2iadm1).
We changed the ownership of our mount
point:
#chown db2inst1.db2iadm1 /db2_data
mounted the previously created file system:
#mount -t ext3 /dev/sdc1 /db2_data
and changed the DB2 configuration:
#db2 update dbm cfg using DFTDBPATH /db2_data/database
We decided to keep the logs on the local disks
(six disks in a RAID 5 array with an online spare). We will set this path when we
create the database.
Logged in as the user db2inst1, we issued
a db2start command to start an instance of DB2. Note that a good DBA checks the
/home/db2inst1/sqllib/log directory to look for any "db2start" error logs.
If there are any errors, "db2start" will create a file named something
like "db2start.20060218072215.errlog" in which to store them. The db2diag.log
will look like the following:
2006-03-02-11.57.40.989011-360 I1G1319 LEVEL: Event
PID : 4295 TID : 3006257664 PROC : db2agent (PSFTPROD) 0
INSTANCE: db2inst1 NODE : 000 DB : PSFTPROD
APPHDL : 0-625 APPID: C0A87598.H30A.015902175536
FUNCTION: DB2 UDB, RAS/PD component, _pdlogInt, probe:120
START : New db2diag.log file
DATA #1 : Build Level, 124 bytes
Instance "db2inst1" uses "32" bits and DB2 code release "SQL08022"
with level identifier "03030106".
Informational tokens are "DB2 v8.1.0.88", "s050422", "MI00105", FixPak "9".
DATA #2 : System Info, 364 bytes
System: Linux db02.example.com 4 2 i686
CPU: total:8 online:8
Physical Memory(MB): total:6042 free:91
Virtual Memory(MB): total:9039 free:2935
Swap Memory(MB): total:2997 free:2844
Kernel Params: msgMaxMessageSize:8192 msgMsgMap:16384
msgMaxQueueIDs:1024 msgNumberOfHeaders:16384
msgMaxQueueSize:16384 msgMaxSegmentSize:16
shmMax:268435456 shmMin:1 shmIDs:4096
shmSegments:4096 semMap:32000 semIDs:1024
semNum:32000 semUndo:32000 semNumPerID:250
semOps:32 semUndoSize:20 semMaxVal:32767
semAdjustOnExit:32767
Information in this record is only valid at the time when this
file was created (see this record's time stamp)
When we saw this record, we knew things were going
great! (Note that this information is very handy to have, and an administrator should
know all of it to be successful.) At this point, we created a database (named WSI811)
so we could begin testing our connection to the DB2 instance. To do so, we issued
the command:
#db2 create database WSI811
Client Configuration
Having a database is mostly useless unless there
are some applications that need to access it. We logged on to one of our application
servers to test the ability to connect to DB2. The first thing we did was install
the DB2 runtime client, which is, by default, a graphical Java installer. For HADR,
the client configuration is what actually does the failover. As I will show later,
the client is responsible for redirecting the request to the primary server once
it receives a signal that the primary database is unavailable.
When prompted to select whether we
wanted to create an instance now or wait until later, we selected "install
now" and provided the instance information as requested. To make the application
server aware of the DB2 instance and the database(s) provided by that instance,
we had to catalog the database(s) on the application server, but we could have done
that from other clients as well. We actually used a catalog script that is a bit
more complex, but the following simpler catalog directive would have sufficed:
#db2 catalog tcpip node db02 remote db02.example.com server \
60000 remote_instance db2inst1
#db2 catalog database WSI811 as WSI811 at node db02
Before attempting a connection to the database,
we ran the following to make sure our client was cataloged correctly:
#db2 list db nodes
The output looked like this:
Node Directory
Number of entries in the directory = 1
Node 1 entry:
Node name = db02
Comment =
Directory entry type = LOCAL
Protocol = TCPIP
Hostname = db02.example.com
Service name = 60000
Next, we made sure we could see the database by
running the following command:
#db2 list database directory
The output looked like this:
System Database Directory
Number of entries in the directory = 1
Database 1 entry:
Database alias = WSI811
Database name = WSI811
Node name = DB02
Database release level = a.00
Comment =
Directory entry type = Remote
Catalog database partition number = -1
Alternate server hostname =
Alternate server port number = 60000
At this point, we tried to connect to the server.
When prompted for the password, we had to use the password for the "db2inst1"
user on the database server, not the client from which we were connecting:
#db2 connect to WSI811 user db2inst1
Enter current password for db2inst1:
Database Connection Information
Database server = DB2/LINUX 8.2.2
SQL authorization ID = DB2INST1
Local database alias = WSI811
After we saw the message above, we knew we had
a successful setup and connected to our DB2 instance.
Secondary Server
The next step was to set up the second server
exactly the same as the first, with the exception of the file system. We mounted
a separate file system for the second database server ("sdc2" instead
of "sdc1"). We did this because of DB2's "shared nothing"
architecture, a concept that is shared by Google and several other technology leaders.
Our mount command for db02 changed to:
#mount -t ext3 /dev/sdc2 /db2_data
We put that mount directive into "/etc/fstab".
When all of the setup tests we ran on the
first server worked on the second database server, we could begin our HADR setup.
First we had to make sure these servers could communicate with each other. Because
our policy was not to trust DNS servers for anything this important, we entered
the IP address and hostname of both database servers into our "/etc/hosts"
file. The "/etc/hosts" files on each server looked like the following:
127.0.0.1 localhost.localdomain localhost
10.100.1.70 db01.example.com db01
10.100.1.69 db02.example.com db02
By doing this, we eliminated the case where communication
with the DNS servers interrupted our perfectly tuned database.
Next, we turned our attention to the "/etc/services"
file that contained our port definitions. The last three lines of the "/etc/services"
file read:
hadr1 55050/tcp
hadr2 55051/tcp
db2_db2inst1 60000/tcp
"hadr1" and "hadr2" ports
were defined to transfer the logs between servers. The "db2_db2inst1"
port was the port on which the database was listening.
To test network connectivity, we used simple
tools such as "dig", "ping", and "nmap". After we
found positive results, we continued.
Finally, we made an online backup from
our primary database server. We used that backup to restore to our secondary server.
We backed up and restored one database at a time to avoid disk contention.
HADR
Since HADR was ready to be applied, we needed
to make our instance aware that we were running HADR. We accomplished this by updating
the HADR configuration on each of the servers:
#db2 update db cfg for WSI811 using hadr_local_host db02.example.com
#db2 update db cfg for WSI811 using hadr_local_svc 55050
#db2 update db cfg for WSI811 using hadr_remote_host db01.example.com
#db2 update db cfg for WSI811 using hadr_remote_svc 55051
#db2 update db cfg for WSI811 using hadr_syncmode sync
#db2 update db cfg for WSI811 using hadr_remote_inst db2inst1
#db2 update db cfg for WSI811 using hadr_timeout 120
#db2 update alternate server for db WSI811 using hostname \
db01.example.com port 55051
We started HADR on the primary database first.
Note that I said, "primary database" instead of "primary database server". This is an important distinction because with our setup, we would
be able to have one server be the primary server for a database, but a secondary
for a different database -- this is called an "active-active" setup.
Back to our setup... to start HADR, issue the command:
#db2 start hadr on database WSI811 as standby
HADR must be started on the standby database before
starting it on the primary database.
Since we started HADR on our standby database,
we started HADR on the primary:
#db2 start hadr on database WSI811 as primary
At this point, HADR was configured correctly,
but we needed to notify our clients about it. This is where the catalog component
gets a little more complicated; we wrote a script to accomplish what we needed:
#!/bin/bash
db2 uncatalog node db01
db2 uncatalog node db02
db2 catalog tcpip node db01 remote db01.example.com server \
60000 remote_instance db2inst1
db2 catalog tcpip node db02 remote db02.example.com server \
60000 remote_instance db2inst1
db2 uncatalog database WSI001
db2 catalog database WSI001 as WSI001 at node db01
Running the above script on the client should
produce results similar to this:
#db2 list node directory
Node Directory
Number of entries in the directory = 2
Node 1 entry:
Node name = DB01
Comment =
Directory entry type = LOCAL
Protocol = TCPIP
Hostname = db01.example.com
Service name = 60000
Node 2 entry:
Node name = DB02
Comment =
Directory entry type = LOCAL
Protocol = TCPIP
Hostname = db02.example.com
Service name = 60000
#db2 list database directory
System Database Directory
Number of entries in the directory = 1
Database 1 entry:
Database alias = WSI001
Database name = WSI001
Node name = DB01
Database release level = a.00
Comment =
Directory entry type = Remote
Catalog database partition number = -1
Alternate server hostname =
Alternate server port number = 60000
The "Alternate server hostname" and
"Alternate server port number" values are blank. These values got populated
once we connected to the database as the instance owner (db2inst1) from the client.
After we connected to the database, the alternate server hostname and alternate
server port number field read:
Alternate server hostname = db02.example.com
Alternate server port number = 55051
Once we got to this point, we could perform some
testing from the application server. We pulled the network cable on the database
server. Then we switched the secondary to be the primary using the following command:
#db2 takeover hadr on database wsi811
We checked to make sure we could still connect
to the server. Also, we checked our running application(s). The main application
that was running generated errors until we had cleared our connection pools.
Understanding Heartbeat and Mon
The core of the fault-tolerant system is a set
of programs called "heartbeat" and "mon". While it's not
essential to totally understand how these programs interact (one can simply configure
them using any number of articles on the Internet), it's invaluable when attempting
to set up an atypical installation.
What Is an Atypical Installation
Most step-by-step, "guide"-type documentation
regarding heartbeat and mon on the Internet assumes that you will be running in
an "active-passive" configuration. In this scenario, you have an "active"
primary system that "owns" the sources and a "passive" backup
system that basically just waits for the primary to fail. But let's take a
step back. The key to the understanding of heartbeat and mon is the concept of a
"resource" -- a program, service, etc. that you want to be fault tolerant.
For example, in our environment, the DB2 engine is the "resource" that
matters. So, in an "active-passive" configuration, we would have one DB2
server, holding all the databases, tables, etc. and the other a copy of it, basically
just waiting for the first to fail.
One problem with such a configuration
is that this leaves the secondary (passive) server metaphorically "twiddling
its fingers". In other words, it's not being used to its capacity unless
and until the primary (active) server fails. When "bang for the buck"
counts, this is not an optimal way to deal with things. Instead, you might want
to distribute databases across servers, having some primary on one and standby on
the other, and others in the reverse roles. This is considered "active-active"
and is a bit more difficult to handle, in that you must always know who's in
charge of which resource at a given time.
Further, most documentation also ignores
what to do when you finally get that primary server repaired. How, in the vernacular,
does one "fail back"? While this is mentioned, it's rarely, if ever,
discussed in detail.
Also, most documentation glosses over
the difference between heartbeat, ipfail, and mon, treating them as parts of the
same package. Although they most often are used together, they are separate programs
with separate responsibilities. Until one understands those responsibilities and
how the programs interact, it can be very confusing.
So, if we're going to do anything
more than the basic "active-passive", floating IP, failover configuration,
we need to answer the following questions:
- What is heartbeat? What is mon?
What is ipfail? How do they relate?
- Now that we've failed over,
how do we fail back (recover)?
- How do we move from active-passive
to active-active?
Heartbeat, Ipfail, and Mon
Heartbeat is, essentially, a tool that determines
whether connectivity exists (connectivity to what? just wait) and, if it does not,
what it needs to do to rectify the situation. The connectivity comes, basically,
in two flavors -- connectivity to peers and basic connectivity ("aliveness").
Heartbeat works in a "cluster".
That is, it has a set of peers with which it tests connectivity and, if that connectivity
is lost, to whom it cedes resource responsibility. It also can monitor outside connectivity
(outside the cluster) through the module called "ipfail" to determine
whether it has lost network connectivity. This is a "Good Thing" from
the standpoint that a system can do some of its own diagnostics -- it can determine,
to some extent, if it can no longer talk to anyone or if, indeed, its peer is the
one at fault. It can then take appropriate action.
So, as a thumbnail, all heartbeat does
is monitor connectivity and, if the "primary" in the cluster goes away,
it starts up a few programs.
What heartbeat does not do is to monitor
services (other than the network) itself. For example, it doesn't look to see
whether your database is running. For that, we need mon.
Mon is a generic service monitoring
tool. Basically, it watches services (like a database engine or disk utilization)
and, if appropriate, triggers alerts (via email, pager, or whatever you choose).
So, although you often hear heartbeat
and mon referred to almost interchangeably, they are two separate entities. They
connect only in that one (mon) is started by the other (heartbeat). Understanding
this simple (albeit sometimes subtle) fact is the key to using these tools to create
a fault-tolerant system.
To summarize, heartbeat waits for connectivity
to be lost, through its own monitoring and through external facilities like ipfail,
and mon watches other services. In practice, these three tools relate to each other
through their configuration files.
We began configuring heartbeat by first
configuring the "ha.cf" file (see Listing 1 for the essentials of ours).
This file specifies the "timer values" (how long we go without a response
before assuming things are dead, etc.), the members of our "cluster",
the addresses to check for exterior connectivity, and what to do in case of lost
connectivity. (See the "HADR and Heartbeat Timing Settings" sidebar for
more information.)
In essence, heartbeat creates a cluster
of nodes. Each system (node) in the cluster is watched by the heartbeat program
through the use of a private communications channel. (I won't get into sockets
and named pipes, here; suffice it to say it's a private channel.) Heartbeat
then keeps tabs by talking through this channel to all members (nodes) of the cluster.
If a node goes down, then a member of the cluster tries to take over its services.
It is easy to become confused by the
relationship between heartbeat and ipfail. Though the two are most often discussed
together, they are really two standalone programs. While heartbeat checks connectivity
to other members of the cluster, ipfail checks connectivity with the rest of the
world. To this end, you need to make sure you set the ping nodes to addresses that
will help to determine whether network connectivity has been lost. So, instead of
setting a peer as a ping node, you want to set something like your router, gateway,
and other such server.
Think of it this way -- ipfail
is a "true-false module" called by heartbeat that tells heartbeat whether
it should trigger in the same way it would if heartbeat detected a primary down.
It is just like the peer-to-peer portion of heartbeat, but it monitors the network
instead.
Next, we need to look at the "haresources"
file (see Listing 2). This file lists the resources that matter to us and states
which node, when all is right with the world, is the primary for those resources.
As discussed previously, we have configured our DB2 clients to know who the servers
are and to contact the appropriate one if connectivity is lost.
In this way, we don't need to
deal with a "floating IP"; so, in our "haresources" file, we
do not list an IP address. Instead, we list the primary node and the resource names
"hadr" and "mon" (note that all peers, primary and all others,
have the same file -- that is, all have the same primary node entry, thus telling
the peers who is ultimately responsible for stuff).
It's important to understand what
those resource names are in relation to actual program files. These names are, in
fact, the names of scripts kept in the "resource.d" directory. Those scripts
follow a very specific format. Like scripts in /etc/init.d, those scripts must accept
the following arguments: start, stop, and status. (Listing 3 contains a sample hadr
script that allows start, stop, and status.)
It's also important to understand
why and when these scripts will get called. They are not called by heartbeat when
heartbeat starts on the primary node. Heartbeat assumes these are running on the
primary node, so make sure you start them elsewhere (/etc/init.d would be a good
place).
They are, however, called when a peer
determines that it needs to take over (either because it has detected, through heartbeat,
a failure of the primary or has been told by hb_standby to take over). So, in
effect, these scripts are what cause the actual takeover of their respective resources
when necessary.
The mon program is configured with
the "mon.cf" file (see Listing 4 for the essentials of ours), which contains
a set of stanzas stating which resources should be monitored and, if the resource
is not working, when, how, and how often to send the appropriate alert. (See the
"HADR and Mon Timing Settings" sidebar for more information.) Like the
"haresources" resource names, above, the "monitor" and "alert"
portions of the stanzas are the names of scripts. These reside in the "mon.d"
directory. The "monitor" scripts are true-false returning programs that
determine whether a resource is okay (true) or not (false). If the resource is not
okay, then the corresponding alert script is run. While there is no real set of
requirements like those for the haresources scripts, good coding rules should be
observed.
With all of these scripts, both for
haresources and mon, the responsibility is on you, the programmer, to handle all
aspects of failover (haresources), monitoring (mon -- monitor), and alerting
(mon -- alert). Heartbeat, ipfail, and mon do not do this for you. So, crack
your knuckles, pop open another Jolt, and get down to script writing (see Listing
5 for an example of the "peer.monitor" script as called from the "mon.cf"
file).
One thing that will help is if you
use the library of shell functions called "shellfuncs" that's provided
with heartbeat as an "initial values" library. All of our scripts (see
Listings 5-8) begin with a source call to this file:
. /etc/ha.d/shellfuncs
This makes sure that certain values, like HOSTNAME,
are shared through all programs. In our environment, this is particularly necessary
because we need to know what host we are on in order to know how to correctly assume
certain roles. And because we are executing the scripts in a safe environment (e.g.,
non-root), values like HOSTNAME may not be inherited (yes, we could just use /bin/hostname,
but this is just an example, I'll show how we use other variables later). It
also provides a handy place for any generic functions that the scripts might need.
These scripts, then, are the real meat of
the heartbeat/mon system. It is where all the work necessary for failing over (and
back) gets done, and it's totally up to you as to how it happens. Understanding
this fact, as well as the relationships between heartbeat, mon, and ipfail, are
what allow us to create atypical installations.
Your Server Is Fixed -- Now What?
So now you have successfully failed over to the
backup server. And called hardware maintenance. They've fixed the primary server,
so now how do you get back to where you were? This isn't always as easy as
you might think.
While there is a facility in heartbeat
called "auto_failback", you have to remember that it is anything but
auto in terms of the resources. It will be up to you to handle failing back.
So, let's create a manual procedure
for failing back. To do this gracefully, it is important to follow the order of
events:
1. Shut down mon on the standby (the
"normally backup") server.
2. Shut down heartbeat on the standby
server.
3. Take over resources "manually"
on the primary (fixed) server.
4. Start heartbeat on the primary server.
5. Start heartbeat on the standby server.
Step 3 is a bit of "magic happens
here". We created a program called takeover that causes a HADR forced takeover
of all the databases from the standby to the primary.
The trick to fully automate this procedure
is simply to script the process, enshrining it in the scripts that we created for
heartbeat (haresources) and mon (monitor and alert) and then turning on auto_failback
in the "ha.cf" file. We have decided not to do that here since we want
to very carefully monitor the reinstatement of the primary server and not rely on
automatic services.
Where to Go from Here (Active-Active Configurations)
Once again, this is where our knowledge of the
relationships and duties of the various parts is essential. To create a true active-active
setup where each of the servers is "HADR primary" for part of the databases
(instead of one being primary for all of them), all we need to do is to revisit
the scripts.
This time, we modify them to understand
that when they get called, for either starting or stopping, they must take over
the correct databases, since they already are primary on the others. This is a place
that the shellfuncs file can be particularly helpful. It can contain the list of
what services (databases) are typically primary on which server. That way, the scripts
on one system are "boilerplate" to all the others in the cluster.
To summarize, the most important part
of understanding how to configure and then go beyond typical fault-tolerant installations
of heartbeat and mon is understanding the elements of the system and how they relate.
Once you understand that, the rest is scripting. And, as a systems administrator,
you have done your fair share of that already, so you are almost home!
Conclusion
The setup for our DB2 database went very well.
We have a solid foundation for our ERP system, which will provide maximum uptime
and maximum reliability. For us, DB2, Linux, heartbeat, mon, and ipfail were invaluable
in providing a "bulletproof penguin" for our ERP system.
Joseph La Chapell is an IT professional who
enjoys using Linux to solve problems and has been doing so since the late 1990s.
He thanks his family for their support in all he does. JT Vogt is a Unix professional
with more than 20 years of Unix experience, and he believes that computers should
manage themselves. Joseph and JT are both system administrators for Warehouse Specialists,
Inc. located in Appleton, Wisconsin.
|