Sun Cluster 3.x Quorum Issue
Peter van der Weerd
Clustering software usually consists of a collection
of scripts and binaries that unconfigure an interface, bring down an
application, unmount some file systems, give away a group of disks, and
reverse this procedure on some other machine. This goes for all Unix
cluster solutions. Of course, there are some differences on different
levels. Different vendors use different storage products and software to
manage devices, have their own ideas about establishing and maintaining
membership between the clustered machines, and so on.
So, if Unix clustering is so straightforward and
common practice, why am I writing an article on it? Good question. In this
article, I will not list all pros and cons of all different cluster
products. I will describe one con of one Unix cluster product, elaborate on
it a bit, and come up with a script that could help. Specifically, I will
cover the quorum issue in Sun Microsystems' Sun Cluster 3.x product.
Even though Sun, to my mind, has one of the
most advanced cluster solutions in the field, there is a drawback. This
drawback is the quorum device issue, or to be a little more exact, the
ignored issue of losing the disk that is your quorum device.
Quorum Device
You may already know about the quorum device issue,
but just to make sure, here's a short recap. Unix clusters all have
some sort of heartbeat protocol that uses either a dedicated network or all
available networks to communicate and establish "membership".
Membership means that both, or in the case of more than two nodes, all
nodes have to be aware of each other at all times. You cannot have one node
that is unable to communicate with its cluster buddies and decides to run
applications on its own.
A key concept in Unix clusters is high availability,
but you do not want your applications to be doubly available. Access and
changes to your data should at all times be controlled by one source, or at least monitored by one source to prevent
two individual instances from changing your data without knowing about each
other. Data integrity is considered more important
than availability.
If, at any time, membership can no longer be
guaranteed, an important decision should be made -- which part of
"broken" cluster should continue, and which part should
instantly die to avoid messed up data?
This is where the quorum device comes in. To avoid a
so-called partitioned cluster, both take a run for the quorum device,
reserve it, and start counting the number of votes they have. For example,
in a two-node cluster, the number of votes present when membership is ok,
is three. One vote for each node, and one vote for the quorum device. This
makes a total of three. When membership is gone, one node will succeed in
reserving the quorum device, the other node will not succeed. This will
leave the slower node with only one vote. One vote is less than half of all
possible votes. The faster node has two votes, which is more than half of
the votes. It's easy. If one node has more than half of the votes, it
will run all applications; if a node has only half or even less than half
of the possible votes, it will panic.
This reservation process differs according to product.
Since this article deals with the Sun Microsystems solution, I will
concentrate on their method. Sun Cluster quorum disk reservation is a
two-step process that begins with SCSI-reservation. But this
SCSI-reservation is only valid and ruling for as long as the node is up and
running. If the "surviving" node were to be reset, the
SCSI-reservation would be gone for a short while. At this point, the other
node, being panicked and all, would be sitting there, waiting for the
opportunity to be able to reserve the quorum disk. This is not what Sun
Cluster wants. Sun Cluster says "the last node to leave the cluster
should be the first one to join", meaning that Sun Cluster wants to
prevent a node that doesn't have the most recent configuration from
starting a cluster on its own. So, SCSI-reservation is not enough.
Reservation Key
Each node joining the cluster puts its reservation key
on the quorum disk. Putting reservation keys on disks is supported in
SCSI-III and is called "Persistent Group Reservation". Once
membership is established, all nodes will have their keys on the quorum
disk. In the case of heartbeat/membership loss, the keys of the nodes that
panic are removed from the quorum disk. If, at any time, a node that
panicked out of the cluster tries to reserve the quorum disk, it will see
that its key is not on the disk and politely inform you that the disk is
owned by someone else. It will wait until the other nodes want to establish
membership and do nothing. This is how Sun Cluster prevents older
configurations from starting a cluster.
More than Half of the Votes
What if, in a three-node cluster, two nodes die? Would
you want the surviving node to panic? Probably not. But in the scenario of
needing more than half of the possible votes, the node would definitely
die. In a three-node cluster, there are four possible votes: three nodes
and a quorum disk. When two of the three nodes decide to leave, that one
remaining node is left with only two out of four votes, which is not more
than half. Solution: the quorum disk vote is the number of nodes minus one.
In a three-node cluster, the total number of votes would then be five. Now,
if two nodes fail, the surviving node would have its own vote plus the two
votes of that disk, which amounts to three, which is more than half.
Additionally, it removes the other nodes' reservation keys from the
disk to make sure that it is boss at all times.
What If the Quorum Disk Dies?
Now, here we have an issue. Some call this a single
point of failure, but it is not. If the quorum disk fails, nothing happens.
If you set up your storage correctly, your data is mirrored, so losing a
single disk will not affect your data. Where vote counts are concerned, you
have no problem either: losing the quorum disk will not leave your nodes
with half or less than half of the votes. A single point of failure it is
definitely not. So what is the problem? The problem is that there is no
daemon that monitors your quorum disk and creates a new one when the old
one dies. If your quorum disk is dead and you reboot a node, the other node
will not have enough votes left and will panic. This is not what you want.
You want a new quorum device the minute the original one dies, because the
quorum functionality is not something that can be mirrored.
The Solution
Make sure that you do not need a quorum disk. Sun
Cluster has recently decided to let you create heartbeat networks the way
you want to. So you can use every interface on your cluster nodes to
transport heartbeat. It is not very likely that you will lose all network
interfaces at the same time. And if you do lose all network connection,
what's the availability then? No network means no clients.
Unfortunately, Sun Cluster will not let you build a two-node cluster
without a quorum disk at this time.
A Cluster
That was the theory bit. Now, let's have a look
at a real example. Assume we have a two-node cluster with one quorum disk.
But, which disk is the quorum disk? The average customer does not set up a
Sun Cluster; Sun sets it up for them. Imagine then that you are the average
customer and want to know which disk is your quorum disk. Do the following:
node1#/usr/cluster/bin/scstat -q
-- Quorum Summary --
Quorum votes possible: 3
Quorum votes needed: 2
Quorum votes present: 3
-- Quorum Votes by Node --
Node Name Present Possible Status
--------- ------- -------- ------
Node votes: node1 1 1 Online
Node votes: node2 1 1 Online
-- Quorum Votes by Device --
Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 1 1 Online
Don't let the odd name "d4s2" confuse you. This is not a Solaris Volume Manager device name. It is a
so-called "Did" name. In a Sun Cluster environment, all devices
have a unique name. So, device d4 is the same device on node1 as it is on
node2. It may well be that the original device file name on node1 is c1t3d0
where it is c2t3d0 on node2. This way the controller information on both
nodes has become irrelevant.
It is advisable to have data on your quorum disk. Make
sure your quorum disk is part of a mirror that is used by a clustered
application. Sun Cluster will only detect that a disk is broken when it
cannot be accessed anymore. SCSI errors will not occur if you do not access
the disk. A quorum device that is part of an active mirror in a clustered
environment is a good quorum device. You will see when it is broken because
the scstat command
will tell you that it is offline.
A quorum device that is offline is a failure, and the
vote count is prone to disaster. One less vote (a node that dies) and your
applications will not be available anymore because
the other node will panic and not be able to achieve membership for as long
as the dead node remains dead. You will have to create a new quorum disk
before that happens. To do so, you must first choose a new device then
remove the old one. Obviously, you cannot remove the only quorum device you have. You might compromise quorum, even though the device is broken. So, just go with the flow
and create a new quorum disk once you see that the present quorum disk is offline.
Broken Quorum Disk
We break the disk...it is broken:
node1#/usr/cluster/bin/scstat -q
(output skipped)
-- Quorum Votes by Device --
Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 0 1 Offline
As you can see, the disk is offline and obviously not
available to function as a tie breaker in the case of heartbeat loss. We
double-check by running the Sun Cluster diskpath monitor
"scdpm" and letting it collect all failed devices:
scdpm -p all|grep Fail
node1:/dev/did/rdsk/d4 Fail
node2:/dev/did/rdsk/d4 Fail
Select New Quorum Disk
Before you can select a new quorum disk, you must
determine which disks are available as quorum disks. The new quorum disk
must be a disk that is accessible by both cluster nodes. It should be a
shared disk:
scdidadm -L
1 node2:/dev/rdsk/c0t2d0 /dev/did/rdsk/d1
2 node2:/dev/rdsk/c0t0d0 /dev/did/rdsk/d2
3 node2:/dev/rdsk/c1t9d0 /dev/did/rdsk/d3
3 node1:/dev/rdsk/c1t9d0 /dev/did/rdsk/d3
4 node2:/dev/rdsk/c1t10d0 /dev/did/rdsk/d4
4 node1:/dev/rdsk/c1t10d0 /dev/did/rdsk/d4
5 node2:/dev/rdsk/c1t11d0 /dev/did/rdsk/d5
5 node1:/dev/rdsk/c1t11d0 /dev/did/rdsk/d5
6 node2:/dev/rdsk/c1t12d0 /dev/did/rdsk/d6
6 node1:/dev/rdsk/c1t12d0 /dev/did/rdsk/d6
7 node2:/dev/rdsk/c1t13d0 /dev/did/rdsk/d7
7 node1:/dev/rdsk/c1t13d0 /dev/did/rdsk/d7
8 node2:/dev/rdsk/c1t14d0 /dev/did/rdsk/d8
8 node1:/dev/rdsk/c1t14d0 /dev/did/rdsk/d8
9 node1:/dev/rdsk/c0t0d0 /dev/did/rdsk/d9
This list shows all Device IDs as well as the official
device file name of all disks on both nodes. All drives that appear twice
in the list are shared devices. This means you can pick any of these disks
as long as the diskpath monitor thinks they are ok:
Drive d4 is broken, so we pick the next in the list
if it is still ok:
scdpm -l all:all
(output skipped)
node1:/dev/did/rdsk/d4 Fail
node1:/dev/did/rdsk/d5 Ok
node1:/dev/did/rdsk/d6 Ok
(output skipped)
Obviously, drive d5 is ok. So, we select that one:
scconf -a -q globaldev=d5
We check whether d5 is now a valid quorum disk:
scstat -q
(output skipped)
-- Quorum Votes by Device --
Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 0 1 Offline
Device votes: /dev/did/rdsk/d5s2 1 1 Online
(output skipped)
Now, we can remove the old quorum device:
scconf -r -q globaldev=d4
At this point, we are back to where we were before
the disk broke. Until this new quorum disk breaks, we can rest assured that
in the case of heartbeat loss, one node will survive by grabbing the quorum
disk and establishing maximum vote. Unfortunately, we cannot go
around checking the health of the quorum disk all the time. So, it might be
an idea to write a script to do this for us (see Listing 1).
This script will check whether your quorum disk is
still okay, every 5 minutes. If the quorum disk is not okay, it will take
the first shared disk from a list, until all disks are gone. And, if all
disks are gone, you will very likely have other things to worry about than
your quorum disk.
Conclusion
Since virtually all Unix cluster solutions use a
voting mechanism in combination with quorum disks, there is probably
something about quorum disks that makes them either popular or inevitable.
To avoid disappointments when such a disk fails, it is good to take
precautions. I hope this article will help a bit.
Peter van der Weerd works as a freelance Solaris, HP-UX, and Linux trainer in Europe.
|