Anatomy of an RCA

Siddharth Ram
4 min readMar 12, 2023

--

My previous post covered the why of an RCA. It often surprises me that translating the ‘why’ into the ‘how’ is often a challenge for teams. So here is my take on how you should structure RCA’s — and the review process.

Before we get into this:

My blog exists because I value written communciation deeply. The only way to get better at anything is to do it more often. Great written communication is such an undervalued skill in Engineers! I am often surprised how little thought has gone in RCA’s, and how immature the writing is.

RCA’s are an opportunity to learn from our mistakes. Shortchanging RCA’s is a sign of an immature engineering culture.

In addition, in an age where we work remotely, written communciation is hugely important to avoid misunderstandings. I insist on written communication on key decisions, archived for posterity (see, for example, Architecture Decision Logging)

If you think you are a great engineer and are a good writer, contact me. I am hiring :)

Principles

  1. Write from the perspective of the reader

You, as the author, understand the problem and the circumstances well and will be biased towards explaining it from your perspective. This will make for a poor read — the reader does not have your perspective. So take the time to provide a background which will help the reader understand the landscape before delving into the problem. If appropriate, add terminology and explain acronyms.

2. RCAs require leadership attention

If your P1 RCA’s do not have leaders — and by leaders, I mean executives— present, then you do not have the right people on the RCA’s. At my current Employer, Velocity Global, the COO, CPO and I (the CTO) attend all P1 RCA’s. This is in addition to VP’s and directors.

2. Provide a timeline

Having a timeline in mind helps the reader understand the period and the magnitude of the impact

3. Describe the problem in detail.

Share what happened and what the recovery from the problem was.

4.Explain the impact

External Impact: It is important to share what the impact on the customer was. Was it a minor inconvenience? A major outage? Were we able to cover for internally, and hide the impact from customers?

Internal Impact: Did internal teams perform heroics to recover from the problem? Did the SRE’s or other teams have to put in extra hours to fix the problem? Share the internal perspective

The impact should be measured not in loose metrics like ‘it impacted a lot of our customers’ or ‘it took a long time to recover’. These statements have little value. Instead, state ‘During the incident 251 customers were unable to access their accounts’ and ‘Problem detection took 23minutes, which was X minutes more than our agreed upon SLA’. If you are unable to make such crisp statements, you do not have proper monitoring/observability metrics about your system.

5. Share the key people driving incident management

Who was the incident commander? Was this handled only by the Customer Support team? Were engineers, SRE, DevOps and other teams involved? Note that RCA’s are always blameless so feel free to be vocally self critical.

6. Deeply understand the Root Cause

This is the most important section. The root cause is what the RCA is all about. It is important to have a deep understanding of the problem and to provide a detailed analysis of what went wrong. It is also important to share what measures have been taken to prevent the same problem from happening again.

We use the ‘5 Whys’ framework for understanding the Root Cause. Keep asking ‘Why’ till we get to a cultural, architectural or process root cause. 5 is a nominal number. To get to root cause, you may need to ask Why more than 5 times. Do not stop at a shallow ‘there was a bug in the code that was not caught’. The point of an RCA is to wipe out a class of defects.

7. Use an Ishikawa (Fishbone)diagram if appropriate

Many incidents are caused by a ‘swiss cheese model’. The Swiss cheese model of accident causation illustrates that, although many layers of defense lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow the accident to occur.You may have to write multiple 5 Why’s to account for each one of the problems that lead to an incident. An Ishikawa diagram helps understand the multiple ‘holes’ in the cheese lining up to result in an incident.

8. Provide a conclusion

The conclusion should summarize the RCA and provide recommendations for future prevention. It should also acknowledge the people involved in incident management and thank them for their efforts.

9. Include a section for follow-up actions

It is crucial to have a section that outlines the follow-up actions that will be taken to prevent a similar incident from happening in the future. This section should detail any processes or changes that will be implemented and any training or communication that will be necessary. Capture this using your trouble ticketing system, and ensure that there is a date by when the fix is in place.

Table Of Contents

--

--