Article

apr2007.tar

Oracle RAC Isn't Just for Databases

Chris Page

Depending upon how your day is going, you might think of Oracle Real Application Clusters (RAC) as being as sweet as a Tootsie Pop or as a technology that leaves you in tears like a freshly peeled onion. Regardless of your view, what you'll find at the center of Oracle's RAC database technology is Oracle's very own clustering software. However, you might be surprised to learn that this technology can be used, with Oracle's blessing no less, as general purpose clusterware -- cluster software to provide failover capabilities for your third-party applications. You don't even need to be running their database. In this article, I will review Oracle's clusterware technology at a high level and present a simple configuration for demonstration purposes.

There are different cluster technologies for delivering different types of clusters. Oracle's RAC database technology is known for providing scalability through its active-active database cluster technology. While some clusters help scale throughput by providing parallelism, Oracle's clusterware provides a failover technology. How Oracle databases scale using this technology in an active-active manner is a topic for a different article, but what is important here is that this same technology can support an application in an active-passive manner.

Not surprisingly, the hardware requirements are not all that different from other popular cluster technologies. You will need the software, which you can freely download from Oracle Technology Network (OTN). You will need to license at least one of the nodes in the cluster for Oracle database standard edition or better. This is the technology used by RAC, and although there must be some real limit to the number of nodes that you can have in a cluster, there are RAC clusters out there that are 64+ nodes. This software can support rather wide clusters of nodes by today's standards. Again, note that only one of the nodes in the cluster needs to be licensed. So, there is potentially a compelling financial proposition here.

We're all waiting for the nirvana of clusters with multiple operating systems and hardware. That's not here yet. All nodes in this cluster must be running the same OS and have the same architecture.

Requirements

So, what do you need to get started? You need at least a couple of nodes. They need to have a network interconnect that is on the same subnet and have shared disk. The interconnect is the mechanism for heartbeat exchange, so you need not worry too much about latency or bandwidth for a basic cluster implementation. If you will be using that network interface for anything more, then your needs might change significantly. For instance, if you were running Oracle databases on the cluster, then that interconnect would also be used for maintaining cache coherency so latency and bandwidth could become a more important issue. In any case, a simple crossover cable will not do. The reputed reason for this is that each node factors link status into its health check. In the case of a crossover cable, if a node is powered off, the other presumably good node will see the link go down and, from this information, conclude that it is no longer a member of the cluster and take itself offline.

If you are concerned with the availability of your application, you need to be concerned with the overall availability of your cluster. Depending upon the width of your cluster, you might consider failure of an individual node to be an acceptable event, because you have additional nodes to pick up the load.

In some cases, such as in a two-node cluster with each node running its own application near capacity, the failure of an individual node, while not resulting in an outage, might not meet other performance service levels. If, for whatever reason, having an additional node available to pick up the workload is not an option, it might make sense to build-in redundancy within the server itself. Enabling multi-pathing with redundant HBAs is one way you could introduce redundancy. As needs vary, so do cluster node configurations. Two potential points of failure that you should consider for redundancy are your network in general and your interconnect network specifically.

If your public network goes down, there are likely more issues to address than simply the application that you have running on your cluster. You probably already have some level of redundancy on your public network for this reason. For the interconnect network, you might be tempted to implement the interconnect via an inexpensive switch. I've seen many nodes racked on top of each another with a single cheap flickering switch sitting on top providing intra-cluster communication. Remember that if that switch goes down, so does your cluster.

Consider using the same level of protection for your interconnect as for your public network. Or, at least implement some redundancy by having two of those cheap switches. You'll need two interfaces on each node, but you'll be protected when one of those switches fails or someone kicks a cord. Oracle clusterware allows you to define multiple subnets for interconnect communication. Check out the oifcfg command if you want to learn more. One thing this doesn't do is allow you to use the public interface as a backup cluster interconnect interface.

Implementation

There is some overhead associated with implementing a cluster and with Oracle cluster technology; one way that this manifests itself is in the requirement for shared voting and registry storage devices. Oracle gives you the option to provide external redundancy (RAID) for these devices, or if you can't do that, it will require additional devices onto which it will provide a safeguard by replicating the changes itself. These are critically important to the health of the cluster and must be backed up with data consistency in mind. Oracle provides a set of tools, including ocrconfig and ocrdump, for these tasks.

Oracle allows multiple versions of its software to be installed on a system by storing each version of software in a separate directory. Each installation directory is referred to as an Oracle Home. The Oracle clusterware software has its own Oracle home and, by convention, this directory name is stored in the environment variable ORA_CRS_HOME.

Once you have Oracle clusterware running, you have nothing more than a framework, unless you can derive some value from having a system brought down if it loses connectivity with the cluster. The real value of the setup is derived when you provide the hooks into your application. Sometimes it's easiest to explain a concept with an example, so I'll provide one here. For this example, I'll set up failover cluster services for a reminder daemon, "noodge", which can run on either node, x4200-a or x4200-b, of the cluster.

Oracle clustering software needs to know how to do three things with any application that it is managing. It needs to know how to start, stop, and check the health of the application. The health check provides the information that the framework needs in order to decide whether a restart or failover is needed. The start and stop commands are the mechanism for controlling the application. We provide this information to the framework by providing an executable or a script, called the action script. This action script is required to take a single command-line argument of "start", "stop", or "check" and return an exit code representing either the success (0) or failure (1) of that operation. The operation must complete within a configurable, but predefined, period of time or the status will be unknown.

Whether your cluster successfully manages your application is related in part to the robustness and completeness of the script that you supply. To be complete, it must address all edge cases. It must test for any conceivable failure that you want to protect against, and if your needs are like mine, that means all possible failure conditions. The simplest script will get you going, and the "perfect" script is probably never attainable, so a very good script is a reasonable goal.

When you purchase a clustering solution, most vendors will offer scripts to you either as part of their solution or for an additional charge. Oracle clusterware is ready out of the box for Oracle RAC databases, but that is about it. A script for managing user VIP interfaces is a notable exception, and we will cover that in detail. In regard to a script that is well suited for your application, however, you are likely on your own. Your best bet is to leverage the work of others who have made their scripts for Oracle clusterware or some other framework freely available. That work is likely easily ported.

Operation

Now let's see the clusterware in operation. To begin, we need to tell the framework about the application that we want it to monitor. To do this, we will create a profile for the application and register it with the cluster framework. Next, we'll create the resource profile. This profile tells the clusterware about the application. In the profile, we provide the location of the script that the clusterware needs to use to start, stop, and check it. We also provide a name for the resource. It is a best practice to preface your profile names with a string prefix that identifies the profile as one of your own. As a point of reference, all Oracle resources are prefaced with the "ora." string.

The resource profile can either be created by supplying command-line options or by providing those values in a file. In any case, the creation of a profile is synonymous with creating a file with the name value pairs that are expected by the framework. We'll create the file and specify the values there so that we can easily recall the parameters specified. With this method, a configuration management tool could very easily be introduced when needed. Oracle will create a template file that can be used by executing the following:

$ crs_profile -template -t application -O me.remind.cap

The created file "me.remind.cap" contains a list of name value pairs. Many of the pairs reflect the Oracle database heritage and aren't needed for our application and can safely be ignored. Edit the me.remind.cap file and specify the following configuration:

NAME=me.remind
ACTION_SCRIPT=/export/home/oracle/noodge/noodge.scr
DESCRIPTION=My Reminder Daemon

These are just a few of the parameters that we can specify. There are many more parameters of interest, and they have reasonable default values. I'll examine some of the others, but if you are serious about using Oracle's clusterware, your best bet is to refer to the documentation. Once the profile file is ready, as the root user, we copy that file to the $ORA_CRS_HOME/crs/profile directory. This is the location where the clusterware utilities expect to find all user defined profiles that will be registered by the root user. Profiles registered by non-root users are located in the $ORA_CRS_HOME/crs/public directory. We then register the profile with the cluster registry by executing the following command:

# $ORA_CRS_HOME/bin/crs_register me.remind

Now that the application profile is created and registered, we can check on the status of the application as seen from the clusterware's perspective.

# $ORA_CRS_HOME/bin/crs_stat me.remind -v
NAME=me.remind
TYPE=application
RESTART_ATTEMPTS=1
RESTART_COUNT=0
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=OFFLINE
STATE=OFFLINE

Before bringing the application online, we must make sure that we haven't lied about the existence of the application. We must distribute the application and action script to each node in the cluster so that each node can run the application and have its own copy of the action script. Once ready, we can start up the application via the application framework:

# $ORA_CRS_HOME/bin/crs_start me.remind
Attempting to start `me.remind` on member `x4200-b`
Start of `me.remind` on member `x4200-b` succeeded.

The command reports that the command succeeded. We can verify this by checking the status directly:

# $ORA_CRS_HOME/bin/crs_stat me.remind -v
NAME=me.remind
TYPE=application
RESTART_ATTEMPTS=1
RESTART_COUNT=0
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on x4200-b

The -v command-line option puts the status command in verbose mode. Here it reports the name of our resource, that the TARGET state is currently online, and that the resource is online. So, all is as expected, and we haven't caught the application in transition.

Let's have a look at the status. The reported RESTART_ATTEMPTS shows how many times the clusterware will attempt to restart the application on the same node. The RESTART_COUNT is the number of times that it has actually restarted the application. A subsequent failure on the same node will result in the restart attempt being performed on another node because the value RESTART_ATTEMPTS would otherwise have been exceeded.

The FAILURE_THRESHOLD and FAILURE_COUNT parameters are related to an optional feature. With FAILURE_THRESHOLD greater than zero, oracle will continue to try to get the application running either via restart attempts or failovers until the FAILURE_THRESHOLD is hit. At that point, the cluster discontinues restart attempts. A FAILURE_THRESOLD of 0 indicates that this feature is disabled.

We can stop the application using the crs_stop command:

# $ORA_CRS_HOME/bin/crs_stop me.remind
Attempting to stop `me.remind` on member `x4200-b`
Stop of `me.remind` on member `x4200-b` succeeded.

We can then verify that the application is offline by checking its status using the short version of the crs_stat command:

# $ORA_CRS_HOME/bin/crs_stat me.remind
NAME=me.remind
TYPE=application
TARGET=OFFLINE
STATE=OFFLINE

If we switch back to the oracle user and attempt to start the application, we encounter an error:

$ $CRS_HOME/bin/crs_start me.remind
CRS-0254:  authorization failure

The application permissions stored in the registry do not permit the oracle user to run this resource because we registered the resource as the root user. As root user, we can use the crs_getperm command to see this and change the permissions so that the application runs as the oracle user.

# $ORA_CRS_HOME/bin/crs_getperm me.remind
Name: me.remind
owner:root:rwx,pgrp:root:r-x,other::r--,

As expected, the root user has ownership. Let's change the ownership to the oracle user and group:

# $ORA_CRS_HOME/bin/crs_setperm me.remind -o oracle
# $ORA_CRS_HOME/bin/crs_setperm me.remind -g dba

Now let's verify the change:

# $ORA_CRS_HOME/bin/crs_getperm me.remind
Name: me.remind
owner:oracle:rwx,pgrp:dba:r-x,other::r--,
#

Examples

With the application running, if we "kill" the process and query the status, we will probably see that the process is missing but that crs_stat reports the application is ONLINE. If we refer to our application profile, we can guess that this is because the default CHECK_INTERVAL is 60 seconds. The cluster will in time detect and restart the application. So, if we wait a bit, we see that the process once again is running and a check of crs_stat indicates the restart via a ticked RESTART_COUNT:

$ $CRS_HOME/bin/crs_stat  me.remind -v
NAME=me.remind
TYPE=application
RESTART_ATTEMPTS=1
RESTART_COUNT=1
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on x4200-b

Note that RESTART_COUNT is now "1" indicating that the application has been restarted.

Let's say that the 60 seconds is too long to wait. We can change the profile to check every 10 seconds by executing the following:

$ $ORA_CRS_HOME/bin/crs_profile -update me.remind -o ci=10
CRS-0181: Cannot access the resource profile \

  '/u01/app/oracle/product/10.2.0/crs/crs/public/me.remind.cap'.

Oops. While we changed the ownership of the resource when it runs, it is still considered a privileged profile because we created it as the root user, and it is in the profile directory not the public directory. If we run the same command as the root user, the profile will be updated, but the change won't be reflected until we update the registry by using the -u option:

# $ORA_CRS_HOME/bin/crs_register -u me.remind

We can verify that the check interval has been updated to 10 seconds by running the following:

# $ORA_CRS_HOME/bin/crs_stat -p me.remind | grep CHECK_INTERVAL
CHECK_INTERVAL=10

Now that the check interval is updated, we can test it. Let's kill the process one more time and check its status:

$ pkill noodge
$ $ORA_CRS_HOME/bin/crs_stat -v me.remind
NAME=me.remind
TYPE=application
RESTART_ATTEMPTS=1
RESTART_COUNT=1
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on x4200-b

$ $ORA_CRS_HOME/bin/crs_stat -v me.remind
NAME=me.remind
TYPE=application
RESTART_ATTEMPTS=1
RESTART_COUNT=0
FAILURE_THRESHOLD=0
FAILURE_COUNT=0
TARGET=ONLINE
STATE=ONLINE on x4200-a

What we see here is that on first query, the application's failure had not yet been detected and reflected in the status. The next status query only indirectly reflects the failure by reporting that the application is "ONLINE on x4200-a". The application failed over to the other node. Note that on failover, the RESTART_COUNT gets reset to zero. In theory, a faulty application could bounce between multiple nodes indefinitely. This is where the FAILOVER_THRESHOLD, discussed earlier, becomes a feature that you should consider implementing.

We can verify that the cluster software detected the failure and moved the service to the other by examining the cluster registry daemon (crsd) log file on the node where the application was running (x4200-b). This logfile is $ORA_CRS_HOME/log/x4200-b/crsd/crsd.log, and the contents of interest are:

2007-01-07 19:37:10.686: [  CRSAPP][8557] CheckResource \
  error for me.remind error code = 1
2007-01-07 19:37:10.690: [  CRSRES][8557] In stateChanged, \
  me.remind target is ONLINE
2007-01-07 19:37:10.690: [  CRSRES][8557] me.remind on x4200-b \
  went OFFLINE unexpectedly
2007-01-07 19:37:10.690: [  CRSRES][8557] StopResource: setting \
  CLI values
2007-01-07 19:37:10.700: [  CRSRES][8557] Attempting to stop \
  `me.remind` on member `x4200-b`
2007-01-07 19:37:11.823: [  CRSRES][8557] Stop of `me.remind` \
  on member `x4200-b` succeeded.
2007-01-07 19:37:11.824: [  CRSRES][8557] me.remind \
  RESTART_COUNT=1 RESTART_ATTEMPTS=1
2007-01-07 19:37:11.824: [  CRSRES][8557] me.remind Uptime \
  does not exceed uptime_threshold
2007-01-07 19:37:11.824: [  CRSRES][8557] me.remind ran out of \
  restarts on x4200-b
2007-01-07 19:37:11.837: [  CRSRES][8557] me.remind failed on \
  x4200-b relocating.
2007-01-07 19:37:12.162: [  CRSRES][8557] Attempting to start \
  `me.remind` on member `x4200-a`
2007-01-07 19:37:13.281: [  CRSRES][8557] Start of `me.remind` \
  on member `x4200-a` succeeded.

Sure enough, we can see that all is as expected, and the log file does a good job of explaining the process each step of the way.

If the failure of process on x4200-b was the result of a hardware failure that has now been corrected and we want to move processing back to that node, we can do so by relocating the application:

$ $ORA_CRS_HOME/bin/crs_relocate  me.remind -c x4200-b
Attempting to stop `me.remind` on member `x4200-a`
Stop of `me.remind` on member `x4200-a` succeeded.
Attempting to start `me.remind` on member `x4200-b`
Start of `me.remind` on member `x4200-b` succeeded.

We can verify that the application has relocated using the familiar crs_stat command, but this time with the -t option so that we can view the information in table form:

 $ $ORA_CRS_HOME/bin/crs_stat -t -v  me.remind
Name         Type         R/RA  F/FT   Target   State    Host
----------------------------------------------------------------
me.remind    application  0/1   0/0    ONLINE   ONLINE   x4200-b

There is one more key feature that Oracle clusterware provides and that is the ability to create dependencies between resources. In this example, we considered a simple application, but we didn't discuss how clusterware might help us with a server application that provides network services. The Oracle framework makes this very easy by providing an action script that we can use to set up and tear down virtual IP (VIP) addresses that our application might need.

To do this, we first create the resource by using Oracle's provided action script, providing a resource name, the IP address for the VIP, the interface name, and network mask:

$ $ORA_CRS_HOME/bin/crs_profile -create me.remind.vip \
      -t application \
      -a $ORA_CRS_HOME/bin/usrvip \
      -o oi=e1000g0,ov=10.15.137.201,on=255.255.0.0

We then register the VIP resource:

$ $ORA_CRS_HOME/bin/crs_register me.remind.vip

As root user, we then set the permissions that the resource must run as root, but that oracle user may execute it:

# $ORA_CRS_HOME/bin/crs_setperm me.remind.vip -o root
# $ORA_CRS_HOME/bin/crs_setperm me.remind.vip -u user:oracle:r-x

As the oracle user, we can then start the resource:

$ $ORA_CRS_HOME/bin/crs_start me.remind.vip
Attempting to start `me.remind.vip` on member `x4200-b`
Start of `me.remind.vip` on member `x4200-b` succeeded.

The final step is to establish the dependency on this VIP by the application. The dependency means that the Oracle cluster will make sure that the VIP resource is running on the node before starting the application. Making this change is as simple as updating the "me.remind" profile and setting "REQUIRED_RESOURCES=me.remind.vip" and updating the registry:

# $ORA_CRS_HOME/bin/crs_register -u me.remind

Now when we start the application, the VIP is seen as a prerequisite and started for us:

$ $CRS_HOME/bin/crs_start me.remind
Attempting to start `me.remind.vip` on member `x4200-b`
Start of `me.remind.vip` on member `x4200-b` succeeded.
Attempting to start `me.remind` on member `x4200-b`
Start of `me.remind` on member `x4200-b` succeeded.

This concludes our tour of Oracle's clusterware. There are other solutions available that can draw some great pictures for you. This platform doesn't offer pictures, but it does offer an easy-to-use framework that can be leveraged to provide failover capabilities across multiple nodes.

Chris Page is a consultant with Corporate Technologies, Inc. He has an interest in high availability and scalability in general and Oracle databases in particular. He has a BSEE and an MSCS. You can contact him at cpage@cptech.com.