Data Availability Basics

Database high availability is all about managing the risk of database failures.

When you talk of high availability, you hear all kinds of terms and acronyms: SPOF, MTTR, MTTD, MTTF and “9s”.

I cut through all that and provide 3 basic principles for creating great database high availability!

Principle 1: Avoid risk by removing SPOFs

I mentioned that creating high availability for your databases is about managing risk of failures.

One side of risk management is risk avoidance.

Risk avoidance is like choosing not to let your kid go skiing because he might break a leg.

While you could prevent applications from using the database, it’s not realistic.

So in the case of the database, you avoid risk by removing database server as a single point of failure, and create a true database layer .

This database layer has multiple servers with the same data. It also has methods to move connections to a new primary database server when failure happens.

It’s important to note that implementing high availability can not stop at the database layer.

And so your business must implement high availability across all layers of the stack for its most mission critical systems.

Principle 2: There’s no such thing as always available

There are diminishing returns with investing in high availability.

That brings us to the second principle: You can’t spend your way to 100% availability.

And this is why you see high availability Service Level Agreements (SLAs) in terms of 9s.

When you understand this, you start to invest in risk mitigation, the second side to Risk Management.

Risk mitigation is accepting reality that failures will happen.

And you invest in ways to reduce the impact of those failures.

Setting appropriate expectations on availability Service Level Objectives (SLOs) are important here.

As an example, setting the expectation that your business should meet 99.999% availability, or 26 seconds of downtime a month, is accepting reality

Another area of goals to establish is around the time to detect failures (MTTD) and the time to recover (MTTR) from failures.

These goals help steer your team in the direction of creating appropriate processes and procedures for risk mitigation.

The processes and procedures the teams develop can be a mix of manual and automated steps, as appropriate for the risk tolerance of each environment.

Principle 3: Continuously test in production

Your highly available system is only as strong as the weakest link in the chain.

And if your risk mitigation strategies are never tested, then those strategies are your weakest link.

So Principle 3 is to test them. Continuously. In production.

Wikipedia article on Chaos Engineering.
Wikipedia: Chaos Engineering

Netflix popularized the bold act of testing in production with their Chaos Engineering methodology.

And this has helped them develop a robust system that can withstand failures with minimal impact to their service.

Follow their lead by implementing Principle Number 3 to regularly test your full stack high availability processes in production.

Conclusion

Of course, the devil is the details of how you do this for your environment.

Great Data Guardians know that these principles of removing single points of failure, establishing methods to reduce the impact of failures, and testing those methods in production will set their company up with the most available database system money can buy.

But they also know that the most available database system isn’t worth much if the full tech stack isn’t designed to follow these same principles.

2 Comments

Comments are closed