High-Availability
Clustering with Sun Cluster
Justin Buhler
The XX 2006 Olympic Winter Games are over, and chances are that
you or someone you know watched or listened to the broadcasts of
competition during the games. With several million people interested
in the competition results and the high visibility of the project,
service availability was one of the top priorities in the overall
technology design and implementation.
This article describes how we used Sun Cluster, Generic Data Service
Cluster Agent, and some Perl programming to enable some of our Java
services to operate within the Sun Cluster product and become more
reliable and available during games by eliminating the manual operations
required for recovery outside the cluster framework. Before diving
into the topics related to clustering and agent development, see
Table 1 for some key concepts that help provide better context to
the article.
Cluster Framework 101
The following section is intended to be a crash course in Sun
Cluster 3.1; it is not a replacement for the necessary training,
which will fill in the gaps. There are many well-written books and
blogs that describe the Sun Cluster internals and product features.
See the references at the end of the article for more information.
Sun Cluster is clustering software developed by Sun to provide
high availability and rapid recovery to business applications. The
Sun Cluster product currently supports a total of 16 nodes within
a single cluster implementation. Sun Cluster provides two clustering
methods -- failover and scalable -- but I will only discuss the
failover method as it is the most commonly deployed. The Sun Cluster
product provides agents for more than 30 different mainstream applications.
Sun Cluster is installed as a series of kernel modules and system
daemons on the Solaris platform, which collectively provides a non-cluster-aware
application a highly available framework within which to execute.
Sun Cluster leverages features of the standard Solaris operating
system, including IP multipathing for network connectivity resilience,
MPxIO for physical storage path resilience, and Solaris Volume Manager
for file system data resilience, to implement some of its framework.
Table 2 shows a brief summary of the core processes and their roles.
Our standard Sun Cluster architecture is a two-node failover cluster
(shown in a simplistic view in Figure 1), which is two servers with
Sun Cluster software installed and two transports configured via
crossover cable for internal cluster communication.
The application service configuration consists of everything that
makes the application a service, such as the following: all disks,
volumes and FS; mount points; network adapters with network identity;
start, stop, and monitor scripts (Listings 1-3); and the application.
All of these configuration details are captured in a customized
agent.
Since the application is treated as a service/agent, it can then
be transferred between the cluster hosts with minimal downtime.
The service/agent has one identity, which is known within the cluster
framework and to the rest of the world. Therefore, the service can
run on any host in the cluster configuration. However, it can only
reside on one system at a time, unlike traditional computing clusters.
Thus, in the event of a hardware, system, or network failure, the
application service moves to the available cluster host. See Figure
1.
Resource Group Manager
The key to making your applications work within the Sun Cluster
really begins with understanding the Resource Group Manager (RGM)
implementation. The Resource Group Manager is a system daemon executing
in user space and can be seen in the process table as (rgmd). The
RGM's role is to manage your application within the cluster framework
by simply starting, stopping, and monitoring the resources that
make up your application. Table 3 describes the RGM model, and Figure
2 is a diagram to visualize the relationships in the RGM model.
The RGM is notified of events by the cluster framework, which
causes the RGM to execute the correct callback method, such as restarting
the resource within the resource group. The RGM is also responsible
for monitoring each of the resources using the MONITOR_XXXX callback
methods defined for each resource type. In the event of a failure,
the RGM will stop and start each resource using the STOP and START
callback methods. The simple combination of START, MONITOR, and
STOP is how the applications availability is achieved within the
cluster. Another thing to note about resources and resource type
is that timing intervals of each of the callback methods can be
tuned to suit the environment.
Now that I have provided a rough overview of the Sun Cluster framework,
I will describe the use and customization of the SUNW.gds resource
type.
Custom Agent Creations Using Perl
As I mentioned, the Generic Data Service is a generic resource
type within Sun Cluster, which is used to enable our application
within the cluster framework without having to create and register
a custom resource type.
Our particular challenge was that we had to distinguish multiple
Java processes from each other in the system process table, validate
that the processes were online, and ensure that they were started
and stopped in the correct sequence considering the status of the
databases. The following shows how this was achieved.
1. Distinguish multiple Java processes and monitor that processes
are online -- We developed a custom monitoring script in Perl using
the Proc::ProcTable Perl module to monitor several Java processes.
Proc::ProcTable provides generic interfaces to multiple platforms'
process tables. The Perl script is really simple and only checks
to see that the processes are in the process table. This is similar
to the checks that you might already do in Unix (such as, ps
--ef | grep some Java process), but instead of looking for a
name, it looks for a few things that must match true for the monitor
to consider the check successful if the process with arguments and
username match. (PMFD could be used to monitor the process and child
processes.) See Listing 4.
See the "Java Application Hack" sidebar for details regarding
how to distinguish Java processes from each other in the process
table. See the References section for information on Proc::ProcessTable
module.
2. Ensure processes are started and stopped in the correct sequence
considering the status of the databases -- A particular obstacle
in the development of the start and stop scripts was an application
dependency on some database instances, which needed to be online
before starting the application. Therefore, we decided that the
automation of fault recovery was worth the effort to consider the
database dependency in the start and monitor scripts. Note that
this is not a recommended practice; however, I share it to show
the possibilities in the development of any custom scripts within
the cluster framework.
We again relied upon the DBI and DBD Perl modules to help us tackle
this problem. The DBI Perl module is a generic interface for accessing
the database via a database driver provided by specific DBD Perl
modules. The script connects and performs a simple select on a table,
which returns a specific result. If the result returned is correct,
then continue to the next operation. If it is incorrect or the database
is down, then exit 0 (see Listing 5).
The logic above does something very strange if the database is
not online. An exit code 0 ends the startup sequence of the application
as successful; exit code 100 is used for complete failure, and the
special value of 201 will cause a request for immediate failover
to another node but does not start it.
The same code is included in the monitoring script, which monitors
only the databases since they are offline. Once the all databases
are online, the script moves passed the db_check() subroutine.
This logic works because we know beforehand that the application
will exit when the databases are offline. Again, I must stress that
this is only an example to promote a broader range of solutions
and to show that you must have a good understanding of the application
in order to make the right decisions about monitoring.
See the References section for information on DBI and DBD modules.
Connecting the Dots
Now I'll connect all the dots and show you how to create the resource
group and resources for the application. Here are the requirements
of the script provided in Listing 6:
1. Define the service name, which you are enabling in the cluster
in uppercase.
- The service IP address and DNS name will be "hostname-servicename.domainname".
- IPMP groups are named after the service name.
2. The application
will use the SUNW.gds resource type. 3. The application is installed
onto a shared diskset, which is mountable at /opt/app1 and configured
in the /etc/vfstab. 4. The application is network-aware and is
configured to listen on a specific IP address; this IP address
will be configured as a resource. 5. Edit the script and include
a comma-separated list of the port numbers (e.g., 4000/tcp, 4001/tcp).
6. IPMP groups must be preconfigured and group names must be the
uppercase name of the servicename (e.g., Java). 7. Edit the script
and change the AGENT_PATH to the location directory of your application's
start, stop, and monitor scripts. They must only be installed
locally on both nodes and not in any shared storage. 8. The GDS
resource is configured to be dependent on the network and the
disk. If these services are not available, then the cluster will
not attempt to start the GDS resource.
The initialization script must follow this sequence when initializing
the resource group within RGM:
1. Initialize resource group. 2. Initialize the storage resource.
3. Initialize the network resource. 4. Online the resource group.
5. Enable the storage resource. 6. Enable the network resource.
7. Bring the resource group and resources online. 8. Initialize
the application resource.
Here is an example of the commands executed as the root user of
the cluster framework:
# ./build_resource java host1 host2 /opt/app1
See Listing 6.
If all goes well, the resource group (with all the resources)
has been configured and started.
Execute the command scstat -g to verify the status of your
new resource group.
Conclusion
Sun Cluster framework nicely integrates with the Solaris operating
system, taking advantage of existing processes and features and
thereby reducing the overall learning curve required to manage a
cluster system. In addition to good product integration, Sun Cluster
provides a nice framework called Generic Data Service, which allows
us to quickly deploy new applications within the Sun Cluster framework.
Like any deployment, it pays to standardize on naming conventions,
IP addressing schemas, and installation paths.
The goal of this article was to provide a snapshot of the Sun
Cluster product and a short overview of using the GDS data service
to enable your application within the cluster framework.
References
Blueprints for High Availability, 2nd Edition, by Evan
Marcus, Hal Stern: http://www.wiley.com/WileyCDA/WileyTitle/ \
productCd-0471430269,descCd-description.html
DBI: http://search.cpan.org/~timb/DBI-1.50/ and http://search.cpan.org/~timb/DBD-Oracle-1.16/
Designing Enterprise Solutions with Sun Cluster 3.0, by
Richard Elling: http://www.phptr.com/articles/article.asp?p=29316&seqNum=8&rl=1
DBI: http://search.cpan.org/~timb/DBI-1.50/ and http://search.cpan.org/~timb/DBD-Oracle-1.16/
IPMP groups: http://docs.sun.com/app/docs/doc/816-4554/6maoq027u?l=ja&a=view
Planning for the Development of Sun Cluster Agents (excerpted
from Sun Cluster 3 Programming: Integrating Applications into
the SunPlex Environment, by Joseph Bianco, Peter Lees, and Kevin
Rabito): http://www.informit.com/articles/article.asp?p=389711&rl=1
Proc::ProcessTable: http://sourceforge.net/projects/proc-ptable
Solaris Clustering Forum: http://forum.sun.com/forum.jspa?forumID=1
Sun employee blog: http://blogs.sun.com/roller/page/kristien
Sun Cluster homepage: http://www.sun.com/software/cluster
System Administration Software (excerpted from Sun Fire V490
Server Administration Guide): http://www.sun.com/products-n-solutions/hardware/docs/html/817-3951-12/chp5.html
Justin Buhler was the UNIX and Oracle Team Lead for Major Events,
Atos Origin. Past projects included Torino 2006 Winter Olypimcs,
Athens 2004 Summer Olympics, and Salt Lake City 2002 Winter Olympics.
You can contact him at: justin.buhler@gmail.com. |