Article

Oracle RAC Primer

Chris Page

If you are using Oracle databases, and you aren't already working with Oracle Real Application Cluster (RAC) databases, it's likely only a matter of time before you are. Oracle RAC databases can be used as a platform to provide scalability and high availability for Oracle databases, and this technology is being widely adopted as recent versions have demonstrated that Oracle delivers on these promises. In this article, I'll explain how a RAC database environment differs from a conventional Oracle database. The focus here is on the architectural features which set it apart, rather than an exhaustive review of the whole database architecture. A familiarity with Oracle databases is helpful but not required.

Review of Terms

I'll begin with an extremely quick review of the terms and architecture of a typical, non-RAC, Oracle database environment as shown in Figure 1. An Oracle "database" is the collection of files. These are the configuration files, data files, and log files needed for recovery. Database objects are typically tables, which are logically organized as rows and columns. Each object has a name, which along with its "schema" name uniquely identifies it. The data for any given object resides in one or more data files on disk. This collection of data files is called a "tablespace". Any given data file is in one and only one tablespace. There are multiple tablespaces in a database. Some are required and reserved for system use, while others store the application data. An Oracle "instance" is the collection of processes, including SMON, PMON, DBWR, LGWR, along with shared memory for caches, etc., called the system global area (SGA), which runs on the system and allows orderly access to the database. An Oracle System

Identifier (SID) is used to identify an instance. For each database, there is one instance. Clients using the database connect to an Oracle "listener" running on the server. In the connection request to the listener, the client specifies the SID it wishes to access. Once connected and authenticated as a user in the database, the client has access to database objects based upon its permissions. That's a lot of information to digest in one paragraph, but it covers the concepts that you need to know about Oracle architecture in order to understand how RAC is different.

RAC Architecture

The problem with the conventional architecture is that the database is on a single server. What happens when the server goes down? What happens when more processing power is needed but the server is fully configured? Oracle RAC provides a highly available and scalable platform for databases. When one server goes down, another one can take over. In fact, multiple servers can be concurrently providing services for the same database. This is how RAC solves the scalability problem. Need more processing power? Add another node to the cluster. What RAC does not do, is provide for disaster recovery. There is still only one copy of the database that could be lost to fire, flood, or Fred. There are other solutions to provide for disaster recovery, and while RAC may be part of a DR solution, it isn't the entire solution.

A RAC database is a single database with multiple instances as shown in Figure 2. Each instance runs on its own server, with the single database on shared disk. From a database user's perspective, the application remains unchanged. Nothing special needs to be done to an application for it to go from a non-RAC to a RAC environment. Behind the scenes, there are many architectural changes that make this possible. From an administrative standpoint, one very helpful feature is the addition of performance views specific to the RAC environment.

For example, while v$session is still valid for the instance that you are connected to, gv$session is also available. It aggregates v$session information from all instances and uses the same structure that you are familiar with, but adds a column identifying the instance from which the row values are derived. While it is true that a migrating application will run, for it to run well, the application architects must understand how RAC works and how to avoid potential bottlenecks. There are many similarities to the introduction of SMP and multithreading. While any application could run, in order for it to scale, you had to identify and remove the bottlenecks. RAC is a great technology to solve your scalability problems, but just as adding CPUs doesn't make a poorly designed application run faster, RAC will not work magic on poorly designed databases.

Scalability

Let's first consider how RAC provides for scalability. I've already mentioned that this is done by adding additional nodes with instances that can concurrently access the database, but what is happening behind the scenes? There are both structural and process changes that permit this. Because there is a single database, the instances must work hand-in-hand to coordinate access and changes.

While there is a single database in RAC, the structural items that are not required to be shared are duplicated to reduce contention. For example, while there is only one set of user data, each instance maintains its own set of log files for recovery purposes. Although all of the logs are shared and available to all the instances, only one of the instances is accessing a set of logs at a time. Each instance also has its own "undo" tablespace, which is a tablespace used by the system for recovery and to provide data consistency. In the event of a node failure, another instance will access these files to provide whatever recovery services might be required while continuing to provide database services. Likewise, even archived redo logs should be available to all instances because they would be required for media recovery. If you lose a node and a data file, you'd be out of luck if you needed archive logs that were stored only on that node.

In Oracle 10g RAC, a "service" is most easily described as a collection of instances. While historically, a client would specify an instance to connect to by specifying a SID, in a RAC environment, the connecting client instead specifies a service name. Oracle RAC can then perform server-side load balancing to connect the client to one of the instances associated with that service. Multiple service names can be defined for a given database.

One can easily imagine a mixed workload environment for a database consisting of both OLTP and DSS types of transactions. In such an environment, two services might exist, MYDB_OLTP and MYDB_DSS. Clients would connect to one or the other depending upon their workload type. It might be that the RAC cluster is made up of two types of servers. Some might have large memory footprints and be best suited for DSS, while those with less memory might be suitable only for OLTP workloads. In such a case, the MYDB_DSS service definition might consist of only those instances on the larger servers, while the MYDB_OLTP service definition might include only those on the smaller servers.

The concept of services offers flexibility from a resource management standpoint. There can be an overlap of instances in service definitions. It might be that the MYDB_OLTP service definition maps to all instances as shown in Figure 3. When it comes to service definitions, for each service name, each instance in a cluster is marked as "unavailable", meaning the instance cannot be used, "preferred", meaning that it will be used for the service, or "available", which means that the service should not use the instance, but in the event of a failure, the instance will be available for workload associated with that service.

A service is simply the list of instances that can be used when a client connects and specifies a given service name. It can be very useful for managing workload. If you need to do maintenance on a node, for example, you can change the service definitions to free it up. If a particular client base needs additional resources, you can redefine the service to include additional or higher powered instances.

Multiple RAC databases can run on the same RAC cluster. In a typical environment, two business functions might have dedicated hardware that can meet their peak resource needs, but needs often vary depending upon the calendar cycle. For example, a financial application that quickly needs to close the books might be on a different cycle than engineering. Using services with RAC, a single cluster can provide resources to be shared between both groups and provide redundancy (see Figure 4). In this example, each database has two dedicated instances for processing. The instances on node 5 are available in the event of node failure, but because the service is available and not preferred, no clients will connect until one of the preferred instances is no longer available for use.

Cache Fusion

Oracle's "Cache Fusion" is an architectural component that you'll only find in the RAC environment. There exists only one copy of any given block of the database on disk. This same block might be cached in memory at the same time in multiple instances, and Cache Fusion provides the cache coherency required between the caches of multiple instances. In a single instance architecture, having a good cache hit rate is often desired for good performance, because the cost of going to disk in terms of access time is much greater than for a memory access. In the case of multiple instances, if a given data block is needed, and not in local cache, Oracle will know if the block is resident in the cache of another instance. If it is, then Cache Fusion will transfer that block via network rather than having the instance read the block from disk. This is potentially a significant performance improvement. With the addition of each node, the total size of memory available across the cluster to perform caching increases.

This feature that permits applications to effectively scale across multiple nodes can also be the reason that a database application might perform worse in a RAC environment. Imagine that an application has an implementation that produces contention for a particular block. Such an application would not scale well as additional users and CPUs are added in a single instance Oracle database environment, because contention would increase.

In a RAC environment, there would be multiple instances contending for this single block. This single block would be transferred from instance to instance whenever it was needed. While it may be faster than disk access, this block transfer over the network is likely slower than a direct memory access. At least in a single instance database, that single hot block would be in local memory. In a RAC environment, it is bouncing between nodes. This is just one example of how a poorly designed application's bottleneck becomes amplified in a RAC environment. Properly designed database applications are critical if scalability across nodes is desired.

Two other areas must be addressed from an architectural perspective for best scalability. These are the interconnect network used by Cache Fusion and the shared storage system. With the addition of each instance presumably comes increased load on these two components. The data on the shared storage system can be managed via Oracle ASM, a cluster file system (including NFS), or raw devices. The performance and management of each of these options must also be considered when thinking about scalability.

High Availability

Let's consider high availability. In a conventional, clustered, single instance database, a client connects to an Oracle listener that is listening on a virtual IP (VIP) address. In the event of node failure, the VIP is relocated, and the disks are mounted on another node. The database listener is started, the instance starts up, and it performs a recovery. When client applications detect the loss of connection, the application reconnects to the database and begins anew. How long this takes depends upon how long it takes to detect the failure at the client, how long it takes to import the disks, and how long it takes to perform a recovery on the database. If the node were already running another application, the load on the node at that time would also impact the amount of time all this takes.
Let's now have a look at how the Oracle RAC platform provides for high availability. RAC depends upon clustering software to manage node membership in the cluster. Oracle provides their own clustering software, but it can leverage third-party clustering solutions as well. As discussed, in a RAC cluster, each instance can be actively accessing the same database. When one of these instances goes down, clients connected to other nodes in the cluster are for the most part unaffected and there is no application outage. Those clients that were connected to the failed instance will failover to other instances in the service to which they were connected. The impact that this failover has on the client application depends upon a number of factors that we will soon identify.
Oracle client applications that connect to a database start by connecting to an Oracle listener. To get to the database, the client needs a host address, port number, and service name. Whether thick or thin, it is a best practice for Oracle clients to perform their own load balancing across available nodes when establishing connections. This is done so that no one listener becomes overloaded in the cluster.
Oracle provides server-side load balancing as well. Once a connection is made to any given listener, the client might get redirected to a different instance via a redirect, but this would happen behind the scenes. Client-side load balancing is done either via application coding or via specifying load balancing in an Oracle TNS Names description. This load balancing comes into play at connection time only.
Like a conventional cluster, Oracle RAC leverages VIP technology. Each node has its own VIP and each node's Oracle listener listens for incoming connections using this VIP. In the event of a node failure, the VIP is relocated to a different node. This is done so that clients trying to connect to the listener on that node will very quickly get an error and be able to connect to one of the other listeners. The client would otherwise be subject to a TCP timeout, and depending upon the network configuration, it might be a while given that the default timeouts can be minutes to hours.
From an application client's perspective, what happens when the node it is connected to goes away is configurable. In the simplest and worst case, the client connection that is lost times out. The application gets an error message and either handles the error itself or presents the error back to the user, who will have lost the connection and any transactions in process. However, the user can otherwise reconnect to one of the other available instances, which are still available, and continue working. It looks like a network "blip".
Fast Connection Failover
Clients configured for Fast Connection Failover (FCF) will follow a path very similar to the worst case scenario described, but it will all happen much faster. In an FCF configuration, the Oracle driver used by the application must be configured as a consumer of Fast Application Notification (FAN) events. FAN events are generated by a RAC cluster. It is an event framework for communicating load advisories and cluster change events. When a node goes down, the FAN event notifies all subscribers of the event. A client subscriber connected to the failed node gives up on the connection right away rather than waiting for the TCP timeout. This is potentially a significant time savings. A loss of connection still occurs, but the client can more quickly respond.
Transparent Application Failover
Transparent Application Failover (TAF) can be achieved in a RAC environment with the right mixture of configuration and coding. TAF is a function of the Oracle Call Interface (OCI), so Oracle network client libraries need to be installed on the client. While TAF can be used even in non-RAC environments under certain configurations, it might be simplest to implement TAF with RAC databases. With TAF, when a connection is lost, the driver traps that error, and reestablishes a connection on behalf of the unsuspecting client application.
Before getting into the details of what needs to be in place for TAF to happen smoothly under most conditions, let's examine the mechanics. There are two methods, BASIC and PRECONNECT, one of which must be specified in the TAF configuration. Using the PRECONNECT method, the client establishes a connection to two different instances at the time of initial connection. Only one connection is used for processing, but even the idle connection consumes resources in the backup instance simply by being in existence. With the BASIC method, the failover connection isn't attempted until the time of failover.
There are three types of TAF, which are NONE, SESSION, and SELECT. Using a type of NONE disables TAF functionality. The other two types specify how TAF should handle a disconnect when the client is in the middle of executing a read-only query (e.g., a SQL SELECT statement). Under these circumstances, with TAF type SELECT specified, the driver will reissue the query on the client's behalf and return only those rows that it had not yet returned. Aside from the delay associated with re-execution, the client need not know the difference.
If a failure occurs while the client application is modifying data, the behavior associated with these two types is the same. From an application's perspective, the connection will remain, but the statement being executed will fail and an error message will be returned. A robust application will be prepared to catch this exception and reissue the command. This type of application behavior is not specific to TAF; as a robust application will already be designed to deal with such types of errors that would arise during dead-lock. Based upon my experience, most applications are not designed so robustly.
With TAF, the application will be reconnected transparently, but session state will be lost. TAF provides callbacks so that the application can reestablish session state. If there is no session state to restore, and if the failure occurs while the session is idle, then the whole loss of connection is essentially a non-event as far as both the application and end user are concerned. To get the most out of TAF, clients require current OCI drivers and can be configured at the client or set by policy at the database server.
Conclusion
Oracle RAC databases provides a sophisticated platform for achieving scalability and high availability for databases. To achieve the promise of RAC, databases need to be designed for concurrency, and applications need to be coded for robust operation.
Chris Page is a consultant with Corporate Technologies, Inc. He has an interest in high availability and scalability in general and Oracle databases in particular. He has a BSEE and an MSCS. You can contact him at cpage@cptech.com.