For many businesses, eight minutes is not a meaningful amount of time. Earlier this year, one of our primary database servers was hit by a particularly nasty bug in the Linux kernel, resulting in approximately eight minutes of downtime for our payment gateway. To put this into perspective, Braintree will process over $5 billion this year. With this kind of volume, over $75,000 passes through our gateway in eight minutes. So, while eight minutes may not sound like much, it actually translates to a meaningful amount of money, which is unacceptable for our customers, and unacceptable for us.
In a recent blog post we mentioned that, as of July, we have automated database failover via Pacemaker to reduce downtime in the event of database problems. I wanted to share some more details on how we consider this topic at Braintree and why we think that automated failover is the right thing for us.
There are several challenges and risks that come with this automation. When a master database fails, you need to decide which replica to promote, generally based on lag time. You want to decide quickly to shorten downtime, but a suboptimal choice can lead to data loss. There is a risk of flapping—where your clustering software passes the master role around the cluster—possibly causing even more problems. Further, manual interaction with automated systems can be problematic. It is important to have means to halt the automation so manual work can take place. It is also important to have means to restore your automation after the manual work is complete. Possibly the most terrifying risk is that of a split brain scenario leading to data inconsistency or corruption. For us, this means the old master failing to shut down and relinquish it's IP. There must always be a way to forcibly shut down a node that won't do as it is told, thus the automation's STONITH tools need to be widely used, well tested, and well exercised.
We operate on a few principles that we feel greatly mitigate these risks, and help us cover as many failure cases as possible as conservatively as possible.
First, we take advantage of Postgres 9.1's synchronous replication. Each master has a synchronous standby server in the same datacenter, and an asynchronous standby in another datacenter. Note that there is no election or other decision making here. There is only one candidate for promotion to the master role.
Second, we forbid flapping. Failover is a one-way operation only. Pacemaker is allowed to promote a synchronous standby into a master role, but it is never allowed to touch a failed master.
Third, we configure Pacemaker to freeze if it gets confused and has no idea what to do. At this point, a human must intervene to sort things out. We use Nagios and PagerDuty to notify the on-call developer of all issues in the Pacemaker cluster so someone is prepared to intervene if Pacemaker has problems failing over.
All of this went through extensive testing before going into production, from STONITH-ing to
killall -9 postgres. And, now that it is in production, we regularly test and exercise the tools involved.
At the end of the day, we are very happy with the solution we ended up with, and have a thorough understanding of what it does and doesn't do. We've restricted the automation actions to those that are well understood. The point isn't to solve for every case; the point is to solve for the cases that we know how to solve for. Six months ago, I never would have imagined being in favor of any sort of automated database failover, but a lot of reading and extremely thorough testing got me there.