Article	Figure 1	Figure 2	Listing 1	Listing 2
Listing 3	Listing 4	Listing 5	Listing 6	Sidebar 1
Sidebar 2	Table 1	Table 2	Table 3	may2006.tar

High-Availability Clustering with Sun Cluster

Justin Buhler

The XX 2006 Olympic Winter Games are over, and chances are that you or someone you know watched or listened to the broadcasts of competition during the games. With several million people interested in the competition results and the high visibility of the project, service availability was one of the top priorities in the overall technology design and implementation.

This article describes how we used Sun Cluster, Generic Data Service Cluster Agent, and some Perl programming to enable some of our Java services to operate within the Sun Cluster product and become more reliable and available during games by eliminating the manual operations required for recovery outside the cluster framework. Before diving into the topics related to clustering and agent development, see Table 1 for some key concepts that help provide better context to the article.

Cluster Framework 101

The following section is intended to be a crash course in Sun Cluster 3.1; it is not a replacement for the necessary training, which will fill in the gaps. There are many well-written books and blogs that describe the Sun Cluster internals and product features. See the references at the end of the article for more information.

Sun Cluster is clustering software developed by Sun to provide high availability and rapid recovery to business applications. The Sun Cluster product currently supports a total of 16 nodes within a single cluster implementation. Sun Cluster provides two clustering methods -- failover and scalable -- but I will only discuss the failover method as it is the most commonly deployed. The Sun Cluster product provides agents for more than 30 different mainstream applications.

Sun Cluster is installed as a series of kernel modules and system daemons on the Solaris platform, which collectively provides a non-cluster-aware application a highly available framework within which to execute. Sun Cluster leverages features of the standard Solaris operating system, including IP multipathing for network connectivity resilience, MPxIO for physical storage path resilience, and Solaris Volume Manager for file system data resilience, to implement some of its framework. Table 2 shows a brief summary of the core processes and their roles.

Our standard Sun Cluster architecture is a two-node failover cluster (shown in a simplistic view in Figure 1), which is two servers with Sun Cluster software installed and two transports configured via crossover cable for internal cluster communication.

The application service configuration consists of everything that makes the application a service, such as the following: all disks, volumes and FS; mount points; network adapters with network identity; start, stop, and monitor scripts (Listings 1-3); and the application. All of these configuration details are captured in a customized agent.

Since the application is treated as a service/agent, it can then be transferred between the cluster hosts with minimal downtime. The service/agent has one identity, which is known within the cluster framework and to the rest of the world. Therefore, the service can run on any host in the cluster configuration. However, it can only reside on one system at a time, unlike traditional computing clusters. Thus, in the event of a hardware, system, or network failure, the application service moves to the available cluster host. See Figure 1.

Resource Group Manager

The key to making your applications work within the Sun Cluster really begins with understanding the Resource Group Manager (RGM) implementation. The Resource Group Manager is a system daemon executing in user space and can be seen in the process table as (rgmd). The RGM's role is to manage your application within the cluster framework by simply starting, stopping, and monitoring the resources that make up your application. Table 3 describes the RGM model, and Figure 2 is a diagram to visualize the relationships in the RGM model.

The RGM is notified of events by the cluster framework, which causes the RGM to execute the correct callback method, such as restarting the resource within the resource group. The RGM is also responsible for monitoring each of the resources using the MONITOR_XXXX callback methods defined for each resource type. In the event of a failure, the RGM will stop and start each resource using the STOP and START callback methods. The simple combination of START, MONITOR, and STOP is how the applications availability is achieved within the cluster. Another thing to note about resources and resource type is that timing intervals of each of the callback methods can be tuned to suit the environment.

Now that I have provided a rough overview of the Sun Cluster framework, I will describe the use and customization of the SUNW.gds resource type.

Custom Agent Creations Using Perl

As I mentioned, the Generic Data Service is a generic resource type within Sun Cluster, which is used to enable our application within the cluster framework without having to create and register a custom resource type.

Our particular challenge was that we had to distinguish multiple Java processes from each other in the system process table, validate that the processes were online, and ensure that they were started and stopped in the correct sequence considering the status of the databases. The following shows how this was achieved.

1. Distinguish multiple Java processes and monitor that processes are online -- We developed a custom monitoring script in Perl using the Proc::ProcTable Perl module to monitor several Java processes. Proc::ProcTable provides generic interfaces to multiple platforms' process tables. The Perl script is really simple and only checks to see that the processes are in the process table. This is similar to the checks that you might already do in Unix (such as, ps --ef | grep some Java process), but instead of looking for a name, it looks for a few things that must match true for the monitor to consider the check successful if the process with arguments and username match. (PMFD could be used to monitor the process and child processes.) See Listing 4.

See the "Java Application Hack" sidebar for details regarding how to distinguish Java processes from each other in the process table. See the References section for information on Proc::ProcessTable module.

2. Ensure processes are started and stopped in the correct sequence considering the status of the databases -- A particular obstacle in the development of the start and stop scripts was an application dependency on some database instances, which needed to be online before starting the application. Therefore, we decided that the automation of fault recovery was worth the effort to consider the database dependency in the start and monitor scripts. Note that this is not a recommended practice; however, I share it to show the possibilities in the development of any custom scripts within the cluster framework.

We again relied upon the DBI and DBD Perl modules to help us tackle this problem. The DBI Perl module is a generic interface for accessing the database via a database driver provided by specific DBD Perl modules. The script connects and performs a simple select on a table, which returns a specific result. If the result returned is correct, then continue to the next operation. If it is incorrect or the database is down, then exit 0 (see Listing 5).

The logic above does something very strange if the database is not online. An exit code 0 ends the startup sequence of the application as successful; exit code 100 is used for complete failure, and the special value of 201 will cause a request for immediate failover to another node but does not start it.

The same code is included in the monitoring script, which monitors only the databases since they are offline. Once the all databases are online, the script moves passed the db_check() subroutine.

This logic works because we know beforehand that the application will exit when the databases are offline. Again, I must stress that this is only an example to promote a broader range of solutions and to show that you must have a good understanding of the application in order to make the right decisions about monitoring.

See the References section for information on DBI and DBD modules.

Connecting the Dots

Now I'll connect all the dots and show you how to create the resource group and resources for the application. Here are the requirements of the script provided in Listing 6:

The service IP address and DNS name will be "hostname-servicename.domainname".
IPMP groups are named after the service name.
2. The application will use the SUNW.gds resource type.
3. The application is installed onto a shared diskset, which is mountable at /opt/app1 and configured in the /etc/vfstab.
4. The application is network-aware and is configured to listen on a specific IP address; this IP address will be configured as a resource.
5. Edit the script and include a comma-separated list of the port numbers (e.g., 4000/tcp, 4001/tcp).
6. IPMP groups must be preconfigured and group names must be the uppercase name of the servicename (e.g., Java).
7. Edit the script and change the AGENT_PATH to the location directory of your application's start, stop, and monitor scripts. They must only be installed locally on both nodes and not in any shared storage.
8. The GDS resource is configured to be dependent on the network and the disk. If these services are not available, then the cluster will not attempt to start the GDS resource.

The initialization script must follow this sequence when initializing the resource group within RGM:

2. Initialize the storage resource.

3. Initialize the network resource.

4. Online the resource group.

5. Enable the storage resource.

6. Enable the network resource.

7. Bring the resource group and resources online.

8. Initialize the application resource.

Here is an example of the commands executed as the root user of the cluster framework:

# ./build_resource java host1 host2 /opt/app1

See Listing 6.

If all goes well, the resource group (with all the resources) has been configured and started.

Execute the command scstat -g to verify the status of your new resource group.

Conclusion

Sun Cluster framework nicely integrates with the Solaris operating system, taking advantage of existing processes and features and thereby reducing the overall learning curve required to manage a cluster system. In addition to good product integration, Sun Cluster provides a nice framework called Generic Data Service, which allows us to quickly deploy new applications within the Sun Cluster framework. Like any deployment, it pays to standardize on naming conventions, IP addressing schemas, and installation paths.

The goal of this article was to provide a snapshot of the Sun Cluster product and a short overview of using the GDS data service to enable your application within the cluster framework.

References

Blueprints for High Availability, 2nd Edition, by Evan Marcus, Hal Stern: http://www.wiley.com/WileyCDA/WileyTitle/ \
productCd-0471430269,descCd-description.html

DBI: http://search.cpan.org/~timb/DBI-1.50/ and
http://search.cpan.org/~timb/DBD-Oracle-1.16/

Designing Enterprise Solutions with Sun Cluster 3.0, by Richard Elling: http://www.phptr.com/articles/article.asp?p=29316&seqNum=8&rl=1

DBI: http://search.cpan.org/~timb/DBI-1.50/ and
http://search.cpan.org/~timb/DBD-Oracle-1.16/

IPMP groups: http://docs.sun.com/app/docs/doc/816-4554/6maoq027u?l=ja&a=view

Planning for the Development of Sun Cluster Agents (excerpted from Sun Cluster 3 Programming: Integrating Applications into the SunPlex Environment, by Joseph Bianco, Peter Lees, and Kevin Rabito): http://www.informit.com/articles/article.asp?p=389711&rl=1

Proc::ProcessTable: http://sourceforge.net/projects/proc-ptable

Solaris Clustering Forum: http://forum.sun.com/forum.jspa?forumID=1

Sun employee blog: http://blogs.sun.com/roller/page/kristien

Sun Cluster homepage: http://www.sun.com/software/cluster

System Administration Software (excerpted from Sun Fire V490 Server Administration Guide): http://www.sun.com/products-n-solutions/hardware/docs/html/817-3951-12/chp5.html

Justin Buhler was the UNIX and Oracle Team Lead for Major Events, Atos Origin. Past projects included Torino 2006 Winter Olypimcs, Athens 2004 Summer Olympics, and Salt Lake City 2002 Winter Olympics. You can contact him at: justin.buhler@gmail.com.