When I got into the DevOps field, I was exposed to The Five Whys — a popular analytical method used in incident postmortems. The Five Whys is one type of root cause analysis (RCA): “The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question ‘Why?.’ Each answer forms the basis of the next question (link).”
In a body of research about how systems really fail, I discovered a powerful critique of root cause analysis. The critique is known as “the new view of human error.” Any place where accidents or unwanted outages can occur has a lot in common with managing information technology systems. While most readers of this post will be in the software industry, much of the research comes from other fields such as industrial accidents, medicine, shipping, and aeronautics. After learning about this research, I came to the conclusion that root cause analysis is misleading, and even harmful. In this piece, I will explain their critique of RCA, and I will present what I think is a better idea.
To illustrate the conventional thinking, I will start with media coverage of one IT outage. There is nothing special about the story that I have chosen — every company that operates software in production has had something similar happen. On Business Insider can be seen the headline Amazon took down parts of the internet because an employee fat-fingered the wrong command. According to writer Kif Leswing, an employee made a mistake and consequently, a part of Amazon’s cloud crashed. It sounds very simple. Every story turns out to be about how some careless — or perhaps incompetent — person spilled their coffee on a network switch and the internet went down. But in reality, it’s not that simple.
Writing about the same story for The New Stack, Lee Atchison leads with Don’t Write Off the AWS S3 Outage as a Fat-Finger Folly. In my opinion, Atchison had a better explanation than the other writer. He points out that AWS follows DevOps best practices, such as scripting instead of typing, validation of inputs, and audit trailing. While it was not entirely untrue that the ops engineer made a mistake (the engineer typed the wrong command). However, there was a bug in the validation part of the script, so it did not reject the incorrect command. The command kicked off a cascade of failures that took down part of their system. The full story shows a reality more complex than one user: one mistake.
An important distinction in the research about failures is the division between simple and complex systems. Simple systems are linear and sequential. Each part is connected to one adjacent part. Simple systems fail in simple ways — like a row of dominoes falling over. Researcher Dr. Eric Hollnagel has noted the pervasiveness of the “row of dominoes” metaphor in the coverage of all kinds of incidents. It’s the easiest metaphor to reach for. Hollnagel has a great slide on this point that incorporates pictures of dominoes from a variety of sources.
The computer systems that that include our applications are complex, not simple. As Kevin Heselin says in Examining and Learning from Complex Systems Failures, “complex systems fail in complex ways.” Heselin continues “the hallmarks of complex systems are a large number of interacting components, emergent properties, difficult to anticipate from the knowledge of single components, ability to absorb random disruptions and highly vulnerable to widespread failure under adverse conditions”. Some of the best thinking in this area is from Dr. Richard L. Cook, a medical doctor and author of the landmark paper How Complex Systems Fail. In this source, Dr. Cook covers 18 points that characterize complex systems. I’ll go through some of them below, but I encourage everyone to read the entire paper to get the others.
John Allspaw, another one of the thought leaders in this field, characterizes complex systems as those systems built from components having certain properties: diversity interdependence, adaptive, and connected. Allspaw emphasizes that complex systems exhibit very nonlinear behaviors, meaning a small action or small change can lead to what seems like a disproportionate and unpredictably large event.
In a complex system, it is difficult — or impossible — to understand the relationships between all of the parts, and therefore how the system as a whole will respond to one single small disturbance. One of the constant themes in this literature is that there is no single cause for an incident. Dr. Cook says: “small, apparently innocuous failures join to create [the] opportunity for a systemic accident.” He continues, “each small failure is necessary, but only the combination is sufficient to permit failure.” Multiple things must go wrong in order to produce a systemic outage.
Dr. Cook emphasizes that we — designers and operators of systems — take proactive steps to protect our systems, and we are pretty good at it. A big part of our job consists of designing systems to be resilient. We have a good idea of many of the things that can go wrong, and we have a “Plan B” in place for the more common failure modes — and a good many of the less common ones too. Our efforts succeed — often: the systems we manage don’t fall over at the slightest sign of trouble. If one thing goes wrong, we are pretty good at preventing that from turning into a complete outage. If a server fails, a database host, or even a datacenter, most production systems will keep humming along. According to Dr. Cook, “there are many more failure opportunities than overt system accidents.”
I hope it is now evident why the idea of a single cause does not reflect reality. Mathias Lafeltdt writes in The Myth of the Root Cause that “Single-point failures alone are not enough to trigger an incident. Instead, incidents require multiple contributors, each necessary but only jointly sufficient. It is the combination of these causes — often small and innocuous failures like a memory leak, a server replacement, and a bad DNS update — that is the prerequisite for an incident. We therefore can’t isolate a single root cause.”
For example, let’s look at a well known accident in which a ship capsized in the port of Zeebrugge. Researcher, Takafumi Nakamura looked at the entire system. He created a diagram explaining how the failure occurred with the failure on the far right. You can see the multiple causes and the relationships between them. All of the contributing causes and relationships, in some way, contributed to the failure. If you take away maybe even just one of them, then you might have avoided the failure.
Systems contain what Dr. Cook refers to as latent failures. When a realized failure requires multiple interacting causes, complex systems are always in in a state in which some — but not all — of the multiple contributing causes, necessary for a complete failure, have already occurred, only not enough of them for a systemic failure. It is as if a few threads in your sweater are broken, but you don’t see the hole, yet. It is impossible to eliminate all of these partial failures.
According to Dr. Cook, “disaster can occur at any time and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present; by the system’s own nature.”
Some examples of latent failures are:
- DNS records resolved an incorrect IP address
- Degraded hardware has not yet been removed from cluster
- Configuration management failed to run, so the host has not received security patches or other changes.
- Deploy did not complete in a consistent state but failed to report it
- Software upgrade contained an unreported bug
Latent failures may exist in a system you maintain without causing an outage, but, when an outage occurs, you may find that these latent failures were there for quite some time and contributed to the outage, but were not discovered until after the fact. Latent failures exist because complexity prevents us from understanding and eliminating all of them. And if we could, our changes to the system would introduce other — different — latent failures.
We are constantly changing our systems. We must improve our products in order to stay competitive. No one can afford to sit still for too long. One of the major motivations for making changes is to deliver more business value to our customers. Another key reason for changes is to avoid failure. We are constantly either fixing things that are partially broken, improving the robustness of systems, adding more monitoring, improving alert thresholds, and in other ways trying to reduce the chance of failure. And while we often succeed, avoiding one failure paradoxically introduces new and different failure modes into the system. Every change represents a new opportunity for misconfiguration, introduces a new failure mode, or creates different interactions between parts of the system in ways that may we may not fully understand.
Point #2 in Dr. Cook’s paper is “Complex systems are heavily and successfully defended against failure.” Defending against failure is a big part of our job. We defend systems against failure with techniques like redundancy, auto-scaling load, balancing load, shedding monitoring, health checks, and backups. One of the main elements in defense against failure is ops and DevOps, the people who manage systems. Because we are creative, we can make decisions on the spot to mitigate potential failures and prevent them from turning into outages. According to Dr. Cook, “the effect of these measures is to provide a series of shields that normally divert operations away from accidents.” Because of the efforts of system operators, many accidents and outages do not occur.
Dr. Cook also emphasizes that everything does not have to be perfect. Number 5 among his 18 points is “complex systems can run in a degraded mode.” Complex systems are partially broken — all the time. They are always somewhere in the grey area between succeeding and failing — and being improved or replaced. This works well enough, more often than not in part because the system operators are able to work around the flaws. Hollnagel, in How Not to Learn from Accidents, characterizes systems as moving somewhat randomly in a two-dimensional color-coded space scattered with potholes representing possible outages. And sometimes you are lucky when you pass nearby a pothole but you don’t step in it. Other times, not so much.
The problem with root cause analysis is that RCA assumes one originating cause for each failure, which then ripples down the row of dominoes, ending in an outage when the last domino falls. The five whys assume that each cause has one antecedent, and, when you step through it five times, you find the root. Why five? Five is a completely arbitrary number. If we did not stop at five, we could have The Six Whys, or The Seven Whys — and we would get a different root cause each time. Real-world systems are not made out of five dominos arranged in a line. The system probably could have survived any one of the five causes by itself. All of the five (or more) contributing causes were in some way part of the story. There is nothing that makes any one of them the root.
To understand the nature of the failure, the collaboration of the jointly contributing causes must be understood. According to Dr. Cook, root cause analysis is fundamentally wrong because “overt failure requires multiple faults. There is no isolated cause of an accident. There are multiple contributors to accidents. Each of these is necessary, but insufficient in itself, to create an accident. Only jointly are these causes sufficient to create an accident.” According to John Allspaw, “these linear chain of events approaches are akin to viewing the past as a lineup of dominoes and, in reality, complex systems simply don’t work like that. Looking at an accident this way, ignore surrounding circumstances in favor of a cherry-picked list of events, it validates hindsight and outcome bias and focuses too much on components and not enough on interconnectedness of components.”
Some of the researchers in this area emphasize the subjectivity of root cause analysis. RCA is non-repeatable. When an event has multiple jointly cooperating causes, you are forced to pick one. It is completely arbitrary which one you pick. The one you pick may be a result of your personal view of things. If someone else examined the same incident, they could equally well pick a different one from among the multiple joint precedents. Eric Hollnagel has coined the acronym, WYLFIWYF: What You Look For Is What You Find. Hollnagel’s point is that, when entering the investigation of an accident looking for a certain thing, the investigator is to choose from among many possible contributors the one that meets their preconceived biases.
I can illustrate this point with the following diagram. The postmortem is conducted following the well-known “Three Whys” method. Causality runs from left to right. During the post-mortem, we work backwards to unearth the root cause, moving from right to left. The incident has three immediate jointly contributing causes [F1, F2, F3]. The investigation settles on F1. Proceeding in the same manner from F1 which has three causes [F4, F5, F6] the investigators arbitrarily and subjectively settle on F4, and then in the third step, on F9. Is F9 the “root cause?” What about the three antecedents prior to F2 and three more prior to F3, and so on for all the other nodes that I did not draw, which would be a tree of 27 nodes at three levels of depth if the fan-out is three (not to mention, why stop after three?). So out of quite a lot of nodes, you arbitrarily picked this one and called that the root cause. But there was nothing so special about that one. All 27 nodes in the graph participated to some extent and the incident might not have occurred had even one of those nodes not been present.
Let’s look at another example from the horrific Tenerife air traffic accident, where two 747s collided, said to be the deadliest air traffic accident in history. It was really astonishingly bad luck, which had partly to do with these two planes being rerouted to the same airport, which did not typically have 747s landing there. The air traffic controllers weren’t used to them. A lot went wrong, and it all contributed to this result. Nearly everything had to happen as it did or the accident would not have occurred. I went through some stories about this and made my own graph showing the antecedents.
If you agree with me so far, then you might ask, “Why do we do RCA anyway?” Researchers have identified several reasons. One idea why is from John Allspaw. “Engineers don’t like complexity. We want to think that things are simple and root cause does give us a simple answer. It might not be correct, but it is simple. So it does address that need. And if we find the root cause and we fix it, at least for a time, we can tell ourselves that we’ve prevented a recurrence of the accident.”
Another reason for RCA is organizational and political drivers. Some companies require an RCA. The customer may want to know what’s happened. The idea of finality means that, if we can get past it, it helps us deal with the trauma. And it may create the illusion that we found something we can fix.
Other researchers have pointed out different kinds of cognitive biases, such as hindsight bias. Hindsight bias makes things appear much simpler in retrospect than they were at the time. From Doctor Cook’s paper, “knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to the practitioners at the time than was actually the case.” It always appears after the fact that somebody made a mistake — if you are predisposed to look at it that way. But at the time people made the decisions that are, in the post-mortem, being identified as causes of failure, it was much less clearly that anyone was making a mistake. The operator did not have all the information that we now have, in retrospect. The operator thought that there was a good chance that the action would fix a problem and avoid an outage. Their actions were taken under a much greater degree of uncertainty at the time.
Human error is often surfaced as a cause of failure during an RCA. Recall the Amazon engineer who spilled some coffee on his keyboard as one example of an outage being attributed to human error. This is yet another cognitive bias: we tend to look for mistakes that humans made when those are present among other contributing factors. The human decisions stand out to us as distinct from other types of perhaps equally important problems.
However, the attribution of human error as a cause of failure is overstated. We need to look at the awareness of the people operating systems and understand their actions at the time. Often, what we (after the fact) call human error was a reasonable decision that a person made under conditions of uncertainty and stress. John Allspaw stated in a presentation at Velocity Conf [link no longer available] “if you list human error as a cause, you’re saying, ‘I can’t be bothered to dig any deeper.’ Human error consists of normal human behavior in real world situations, under conditions of adaptation and learning. What we do every day is make mistakes and adapt.”
Dr. Cook emphasizes that all operators have two roles. One is producing outputs and the other is avoiding errors. Producing outputs means keeping the system running so our customers can use it. Avoiding errors means preventing the system from falling over. Note that a straightforward way to avoid all errors would be turning off the entire system. However, this is not a realistic option because a business must deliver products in order to create value. These two conflicting objectives must always be balanced against each other: we must produce outputs, and some errors are a cost of that. But as Dr. Cook emphasizes, after an accident, we tend to place more emphasis on avoiding errors and — to some extent — forget (or underweight) the importance of producing outputs. The many times that operators produced business value and avoided errors are not counted because we don’t do post-mortems when things go well.
Hollnagel, who I cited earlier, emphasizes that avoiding error is itself is a cause of error. System operators cannot be told to take no actions ever, at al. When issues are discovered, we must assess whether a corrective action is a risk worth taking. System operators take actions which are aimed at preserving output and avoiding error — and some of those changes, while succeeding in their own context, will cause a different set of errors, either immediately or by creating latent failures which later result in actual failures.
Dr. Cook explains that every operator action we take is always a gamble. Anything that you change may destabilize the system. In a post-mortem, it looks like the operator did something stupid and careless that caused the outage, but at the time it was an educated guess, a gamble, with the aim of preserving the successful operation of the system. When your job is to take calculated risks, some of the time you lose — but that doesn’t mean that taking some risks is a bad idea, nor that we should never change anything. And given our limited understanding of complex systems, it doesn’t mean that the operator was incompetent.
Human understanding of complex systems is incomplete for many reasons: other operators made changes and did not tell everyone; high stress situations, or perhaps a lack of sleep if the operator was alerted during off hours. Dr. Johan Bergström, one of the advocates of the new view says, “human error is never a cause; instead, it is an attribution of other problems located deeper in or higher up the system, a symptom of those problems, not a cause.”
We have to step back and ask, “What is the point of doing post-mortems?” It is not to find the root cause. It is to find what are the most unstable areas of the system. Where we can make some improvements to improve stability or remove latent errors? Given the vulnerabilities that we identify, and our finite resources that we can devote to improvement, we should ask “Where is the greatest return on our efforts?” Focusing on the two goals of maintaining stability and avoiding error, we should be asking, “How did the system fail?” We need to understand more about the complexity of our systems and how the parts are interrelated.
I have one last example here. A vendor integration stopped working on a product, and an outage resulted. The simple explanation for what happened is that the vendor changed the behavior of their API and the system could not handle that. Here’s a diagram which shows multiple jointly contributing causes. You can see that any one of these might be something that the organization could improve upon. You might decide not to do all of them, but you could certainly decide on the top three or more improvements you could make that will make your system more stable.
My suggestion for ops teams who use the five whys is to instead ask, “How did the system fail?” Identify the contributing factors that jointly contributed to the outage. Draw a diagram like the diagrams in this article. And then, through that diagram, identify the maximum leverage points to make improvements in your system. Create tickets, and put those in your tracking system.
Robert Blumen is a DevOps Engineer at Salesforce