Postmortem meetings: what are, how to conduct and the benefits
Photo by Alexis Brown on Unsplash
Let's be honest, we work with software and incidents will happen.
You can try to prevent this, create crazy monkeys that will make you ready for when it happens, but it's going to happen. Someone maybe is paged in the night with some production bug and all servers down.
Maybe it's just a small bug that doesn't affect not even ten people or maybe is an outage that everyone talks about on twitter.
Anyway, it's going to happen and your team will probably find a way to solve, but what happens after that?
What is a post mortem meeting and how is this related to culture?
Post mortem meetings are one chapter of the Google SRE Book and one of my favorites, to be honest.
The chapter in the book, in summary, talks about learning.
The meeting is done not to find guilty or who we gonna blame this time, they're made to understand how can we improve and learn from our mistakes.
The main focus of this meeting is on finding the procedural or tooling problems that caused the incident. The point is that if you had an incident because someone made a manual error, perhaps the problem is the manual process and not the person's mistake
"The cost of failure is education."
The person who will conduct this ceremony needs to follow a few strategies to extract the best possible from the people involved, action points must be created with owners and we must learn to avoid the mistake again.
Avoid blame and don't try to find who's guilty when something bad happens. We're humans and make mistakes, but more important is that we should learn from our mistakes to avoid them in the future. If we're afraid, to be honest, and people point fingers saying things like: "wow, so the downtime was your fault". How can we learn? How can we find the root cause?
Not blaming humans should be a mantra within an organization, Dave Zwieback in “The human side of Postmortems” points out: “Your organization must continually claim that individuals will never be the 'root cause' of disruptions”, this should be reinforced by team and organization leaders, otherwise Postmortem will tend to blame human error.
Don't blame people, they make mistakes. We must learn with mistakes to make improvements.
I like to remember the prime directive of retrospectives at the beginning of every postmortem meeting.
"Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand."
Norm Kerth, Project Retrospectives: A Handbook for Team Review
There are a lot of template structures for postmortem meetings and it is good that you follow one of these to don't get lost in the middle. Here you can find a lot of them
I like to have the following items:
- Summary - Short description of what happened
- Timeline - Detailed description of everything that happened and led to the problem until everything that was made to solve. Also is good to include affected users, changes made and also any event that happened after the incident was solved
- Impact - How much time your software or product was down? How many users were affected?
- Action Points - Anything that can prevent this or any other incident related to happen in the future.
You can add whatever you want to your report, what your team think is important to have.
Root cause identification: The Five Whys
To analyze any problem and why this problem happened you can use: The Five Whys, a root cause identification technique. Here’s how you can use it:
- Begin with a description of the impact and ask why it occurred.
- Note the impact that it had.
- Ask why this happened, and why it had the resulting impact.
- Then, continue asking “why” until you arrive at a root cause.
List the "whys" in your postmortem documentation.
Every action point must have an owner
This is super important, and maybe we don't value enough. If something doesn't have an owner, how can you know if something is going to be made? This person doesn't need to do the task, but she needs to make sure that everything is going with the action item
SumUp Incident of the Month
Here at SumUp, we do something super nice to make sure that we understand that incidents happen and should be open with this.
Every month, we have a meeting called All Hands with all the engineering team and we present the biggest incident of the month. So, we talked about what the team learned and how can we can keep avoiding new mistakes.
Also, we have an open calendar of postmortems, where you can join the ones you think you can learn something from it.
What to know more about it?
How do you run post mortem in your company? Did you like it? have you been in a "find the guilty meeting"?
Also, I would like to thanks a lot Hernandes Sousa, our Brazil SRE Manager at SumUp who taught me a lot about what is DevOps and how we can evolve with learning.
If you wanna know more about DevOps and SRE, don't forget the read Google's SRE books or the amazing The DevOps Handbook
We have jobs open for SRE/DevOps Engineer at SumUp, come talk to me or Hernandes if you're interested.
Are you doing postmortem meetings at your company? How it's going send me a tweet at @herberth3nrique!