Blameless culture
Have you ever made a mistake on a production system?
Maybe you spent hours restoring the wrong backup.
Or maybe you dropped the wrong table.
What were you feeling when you realized you had to tell your boss?
Fear? Anxiety? Shame?
If those are your expectations, your company is doing incidents wrong.
Today, I want to talk about companies that develop a blameless culture for incident management.
A story about an incident
I was once part of a consulting team that managed production MySQL environments for a pretty well-known company.
One night, I got an alert that there were discrepancies in the data between the replica and primary of one of this customer’s databases.
While investigating, it turns out that binary logging had been globally disabled on the primary earlier that day.
It was bad that MySQL allowed binary logging to be disabled globally, but that’s what we had.
You see, it had been disabled globally by accident while trying to resolve some earlier incident that caused replication to break.
I was in the situation where I had to report to the client that our team had messed up again and set back the recovery by a day or more.
I’ll be honest. I considered quietly fixing it to avoid conflict with the customer.
Punishing mistakes
This reaction came out of fear that I’d lose my job.
And that is a common theme when you look back at the industry in 2014.
If you make a mistake on the database, you could easily cost your company millions of dollars in revenue.
Sometimes, they may call an intern out publicly.
But some management types think that such a mistake is a firing offense.
They implement additional policies and procedures that are designed to prevent future mistakes.
And while being intentional and slowing down is a good way to mitigate errors, the fact is we are human. And humans make mistakes.
So you will never implement enough processes to remove all mistakes.
The trick is adjusting your attitude in how you respond to those mistakes.
Friction slows down innovation
You see, when you create friction and red tape for every incident to try to avoid future incidents, you paralyze your team.
They start moving slower and slower.
They stop taking risks.
And eventually they’re not doing anything other than maintaining what exists.
And if all you are doing is maintaining, you are not innovating.
The main benefit of an error budget is that it provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability.
Google SRE book, Embracing Risk
Build a blameless culture
So what is the solution to this?
We need a response to prevent million dollar mistakes. Otherwise, our company just might go out of business.
Put simply, you must adopt a blameless culture.
After all, chances are that employee that made a million dollar mistake feels incredibly guilty and will do whatever they can to not make the mistake again.
So in effect you just spent a million dollars training them.
If you end up firing them, you have lost that investment!
I’ll once more draw from Google’s SRE book:
If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.
Google SRE Book, Postmortem Chapter
I was able to bring the missing binary log incident to my manager and the customer because I had trust with the client.
Hiding the issue would likely have been discovered and trust would have been lost.
What it looks like in practice is resolving the issue, and then reviewing what went wrong. And implementing system-level solutions to mitigate future possibilities of that mistake.
I’m not talking about systems like two or three levels of approval. I’m talking about automation or code-level changes that prohibit the cause of the mistake.
Conclusion
There are many benefits to a blameless culture.
One is psychological safety and trust of employees which leads to increased morale and a reduced incidents.
Another is not stagnating innovation for fear of making mistakes.
However, it only works if the company embraces it completely.
This means leaders must model the behavior.
If you are an individual contributor at a company that doesn’t have a blameless culture for incidents, you have two choices.
Try to foster one with your management. Or leave.
I’d love to know your stories of bad incidents and what the company reaction was!