Database Performance Frameworks

Database Performance is a primary concern for Database Administrators today.

How can one make sense of all the data that we collect to analyze the performance of the database?

I want to highlight some frameworks to help you get started watching the right signals for Database Performance!

Video: Database Performance Frameworks

Overview of Database Performance Frameworks

If you are collecting all the right bits of Database telemetry, you are likely wondering how to begin making sense of all that data.

As a Database Administrator, can you relate?

If you are a developer that is interested in why your queries are running slow, it’s easy to be overwhelmed.

Luckily, no matter what your role is, there are a few frameworks that can get you started.

In this post, I call out three frameworks:

  • USE
  • RED
  • CELT

I’ll discuss how to use (pun intended) them as signals to capacity issues and performance issues.

USE framework

Let’s start with USE.

USE was developed by Brendan Gregg as a framework to teach engineers what to look for in system resource bottlenecks.

USE stands for Utilization, Saturation, Errors.

My method for making this useful is to have a graph for each of the four system resources: CPU, Memory, Disk and Network.

The idea is to graph the actual utilization of these resources. For instance, how much bandwidth the network is using. Or how many iops the disks are processing.

Saturation is basically the point in which utilization can’t increase, so work is queuing up. Ideally, I might inject annotations on the graph to highlight when saturation happens.

If I can’t inject annotations, then I look at the graph for plateaus of the resource.

Error rate is useful to know when resources might be degrading or failing, or doing more work due to temporary issues.

USE is great for quickly identifying bottlenecks on system resources.

And this is good for capacity planning long-term.

But the presence of bottlenecks won’t tell you the root cause. So you will still need to dig into the issue.

RED framework

Now, let’s take a look at RED.

RED was developed by Tom Wilkie to help describe service health.

This framework captures the signals for Rate, Errors, Duration.

It focuses on how well and efficiently an application is able to handle the workload asked of it.

In a database application, the workload is the queries.

So if you can measure how many queries per second (Rate), how many errors are being generated (Errors), and how long these requests take (Duration), you can quickly see patterns in the workload.

Spikes in Rate or Errors could be DDoS attacks, or a fresh launch of your product or a failed deployment.

Spikes in Duration that coincide with dips in rate might indicate a bottleneck on the system resources.

Tracking error rates signals how efficient your workload is.

If your application issues a query and gets an error in return, you’re doing more work on the database and potentially providing a poor user experience.

CELT framework

RED and USE can be used together to give you a full picture of your database performance.

RED provides insight how well the database is handling the workload, and USE shows how well the database is utilizing the system resources available to it (on a dedicated database server).

CELT is similar to RED, as it tries to signal how well the database is reacting to the workload.

As best as I can tell, CELT was developed by Baron Schwartz and employed in the Database Monitoring solution VividCortex (now SolarWinds DPM).

It tries to solve one specific issue with RED. And that is, RED has no concept of concurrency.

It’s often the case that databases can hit a plateau to the Rate when you get too many concurrent requests.

By adding ‘concurrency’ to the framework, you can begin to see if your workload is scaling right.

That is, if you are making 10 times the number of requests at a time, are you getting 10 times the Rate without increasing Duration?

I suppose CRED would have done alright just by adding concurrency to the acronym, but for CELT there are different terms:

Concurrency, Errors, Latency (instead of Duration), Throughput (instead of Rate)

By adding this fourth signal, it’s a bit easier to get a sense of why the metrics are behaving the way they do.

Visibility Frameworks usefulness

Alright, so we have USE, CELT and RED. So what?

Why are these frameworks useful?

Again, these frameworks are signals to performance of your database.

These frameworks will not allow you to drill deeper into root cause, but they provide a good jumping off point to know that you need to investigate more.

They allow you to ask questions for your investigation.

“Why am I spiking in Latency, yet my Throughput is lower?””

If you do nothing else than graph USE and CELT or RED, you have a quick glance to see if things are stable.

As a bonus, the charts can be distributed to other teams who may not have any concept of database performance.

Developers might already be familiar with RED. SREs likely know about USE and RED or CELT.

And business execs know how to interpret graphs for spikes and dips.

So you’re speaking a common language when justifying additional spend, or even better, reducing spend!

Conclusion

It is often a good idea to have a simple set of metrics to understand the health of any system.

This helps when communicating with stakeholders who may not have any understanding of database performance.

A Data Guardian can leverage the USE and RED or CELT frameworks to provide these metrics for database performance health.

With this in place, you can quickly monitor for any anomalies that need deeper investigation.

These frameworks help you form the right initial questions to start your investigation.

And starting with the right questions can help you get to the right answer faster!