Article	Listing 1	Listing 2	Listing 3	Listing 4
Listing 5	Listing 6	Listing 7	Listing 8	Sidebar 1
Sidebar 2	jul2006.tar

HADR and Heartbeat Timing Settings

One of the hardest things to figure out was the correct timing settings. For most folks, setting these values is just a matter of preference (i.e., how soon you want things to fail over). But timing becomes critical when dealing with resources like a HADR database, which, when it fails over, takes some time to do so. If you don't configure these numbers correctly, you can end up with a "split-brain" scenario. This is essentially a situation where both the HADR primary and HADR standby believe they are the primary for the database. They are then "independent", with neither one knowing what the other is doing nor shipping logs to the other. After a lot of digging and some judicious translation of German texts, we found the following suggestions to avoid split brain:

Initially, set deadtime to 60 or more (we used 120 when we started) and set warntime to between 1/4 and 1/2 of the deadtime value.
Once things are running, recheck these figures by checking the logs for any "late heartbeat" warnings. Then, set deadtime to 1 1/2 to 2 times the longest time interval that went by without a heartbeat. Set warntime to the keepalive time, times two.

On our tuned systems at the lab, we use the following values, because the systems sit on the same network segment and are fairly "unloaded":

keepalive: 1
deadtime: 20
warntime: 5