Required Database Telemetry
Database performance is critical to your role as a database administrator.
This shouldn’t be a surprise.
If the database is slow, the application is slow. If the application is slow, users leave.
Today, I talk about the database telemetry you need to collect to give you insight into how your database is performing.
What is Database Telemetry
Ok, what is database telemetry?
Once again, I have to refer to Wikipedia’s definition:
Telemetry for software is the metrics of how the application is performing.
In our case, the application is the database.
The application must have instrumentation that exposes these metrics, of course.
So database telemetry is the exposed information from the database on how the database is performing.
It’s crucial to collect the right things so you can tweak and tune the database for optimal performance.
For databases, there are three main categories of telemetry you need to collect!
Metrics
The first is probably the most obvious. You need to collect metrics.
Databases expose various data points about how different parts of the system are being used.
This can include how well memory structures are used, how many connections are actively being used or left idle, and how efficient rows are accessed.
Databases will expose hundreds of these metrics and they are useful for understanding how the system is responding to the query workload.
You should also collect statistics of how well the overall system resources like CPU, RAM, disk and network are being used.
Without these metrics, you will not be able to spot bottlenecks in the system.
But collecting these metrics alone are not enough.
It is easy to chase red herrings and tune the wrong things because of what you thought you were seeing in the metrics.
Logs
The next bit of data to collect are logs.
Unfortunately, I find that collecting database logs is often left out of most observability tools.
Logs can contain a lot of useful information that you can’t get strictly from the database metrics.
Basic logs can tell you errors and corruption that the database is spotting during operation.
They can tell you the cause of crashes.
Logs for database systems like Galera can also provide useful information about state transfers and flow control.
If you have enabled an audit log, you’ll want to ingest that data into a central location to look for intrusion and attacks on your database.
You can also get useful log-like information from database tables for things like transactions locking.
Queries
And last but not least, we have the queries.
How the database performs depends entirely on the questions asked of it.
Unfortunately, collecting query data is the hardest of all of these.
You generally will suffer a performance hit to collect all the queries.
And collecting some of the queries ends up leaving out data.
So many will choose not to collect the information because of this.
However, without knowing what queries are happening and being able to correlate those queries to the response in metrics, you really can’t be effective in your performance tuning efforts.
Without the query correlation, you are just guessing when you tune something.
“This might help, let’s try it.”
DBA performance recommendations without query data
Guesswork isn’t how you want to manage your database, is it?
So collecting query data is critical!
Conclusion
Data Guardians take database performance seriously.
So you need to collect telemetry that tells you how the database is working.
Collecting metrics alone is not enough.
Remove the guesswork of your tuning efforts and collect the three main bits of telemetry: metrics, logs and queries.
Then you will be better equipped to pinpoint causes of bottlenecks and make more accurate tuning recommendations.
Pingback: Database Performance Frameworks - DistributedDBA