Root Cause Analysis allows us to leverage historical learnings and past problems into furthering our products down the innovation continuum.
It seems like everything is always failing. The hardest part of our jobs as engineers isn’t always the design process, it’s the endless redesign and fixing of problems that may or may not have been your fault. We constantly face challenges in our designs and workflows – determining the root cause of a problem can be more cumbersome than first glance.
What may first appear as the root cause may resultingly be found to be another effect of the core problem. If we continue to engineer assuming that a problem is mitigated while in actuality the root cause has not been found, failures in our designs can persist and our engineering ability comes into question.
Preventing failure and enhancing our engineering prowess is a necessary aspect of our job and proper root cause analysis can profoundly impact our ability in these areas. However, while root cause analysis (RCA) is commonly thought of as a post-failure tool, this limits its usefulness. We can impactfully utilize RCA tools in our designs of new products, in the incremental update process, and to expedite our revision process.
Learning from root cause gives us as engineers a certain ability to see the future and prevent possible failure – thus improving our innovation skills. Usage scenarios of root cause analysis can be boiled down into these three categories:
- Addressing Problems with Previous Products
- Designing Incremental Updates
- Leveraging Historical Lessons
In theory, root cause analysis is simple, but in practice, it requires a skilled aptitude to its methods.
RCA is practically defined as a problem-solving toolset that allows us, engineers, to identify the root cause of a fault. Looking even deeper, root causes are those that result in the full removal of a problem-failure sequence, not just the partial mitigation. To understand the process further, we need to walk through the steps of RCA in an applicable engineering failure scenario.
The what of the failure
The first step of every RCA procedure is to identify the event or failure that we need to investigate. We will examine the failure of a manufacturing machine that stopped due to a blown fuse. Haphazard mitigation by an engineer would be replacing the fuse and getting the machine back up and running. This solves a cause, but it fails to identify the root cause, thus resulting in a high potential for repetitive failure. In this case, our ultimate failure we want to prevent is the machine stoppage.
For any problem, we can determine the final failure and lay the groundwork for RCA by asking What. Expanding further, we do this first by defining what failure we need to prevent. Next, we determine the sum total of negative effects, which clarifies the extent of the final problem that needs to be solved. Finally, we can begin to develop failure modes to begin an investigation. In the example of the broken machine, this first failure mode is represented by the blown fuse.
Asking why to find the root cause
After we fully identify the failure and the initial cause mode through the What phase, we move into the Why phase. Asking why encompasses the investigation of the causes of each failure mode and instituting a failure flow to track causes and effects back to the root. In this step, we need to collect and organize everything we know about a certain event. We also want to determine if there were any other factors that might have affected failure.
We would want to determine whether the machine was overheating if it had an unusual noise, was the operator paying attention? All of these inquisitions provide us further answers into why something might have failed. They also help us direct our causation flow down a path that will help us find the root cause. Knowing if the machine was hotter than usual may allow us to investigate whether there was a problem with lubrication.
How is this the cause?
The What and Why steps build a framework filled with information that sets us up for finding the root cause of the broken machine. Asking How brings our knowledge together to determine the probable root cause.
In this step, we need to fully sequence our failures in the machine until we cannot find a cause for our furthest problem. In the context of our broken machine, we would trace the failures through the blown fuse to insufficient lubrication, then to a broken pump, onto a worn shaft, finally arriving at the conclusion that scrap metal parts had gotten into the pump and worn the shaft.
Solidifying the final root cause would be determining that insufficient protection from scrap metal on the pump housing resulted in scrap harming the shaft and so on. The root cause of our machine failure then is just that, insufficient protection from scrap metal in the pump housing.
At the end of this step, we should be left with an assumed root cause that will be checked in the final step of our analysis.
Validation of the root cause
At this point, we feel great and think the problem is fixed. However, before we address the problem and start designing solutions, we need to expand our understanding of the root cause to encompass all factors. The root cause may be scrap getting into the pump, but we need to investigate whether this is a natural phenomenon or if it lies in human error, design error, or organizational error.
Perhaps another machine is being operated too close to the failed machine or perhaps the human operator is employing wrong manufacturing techniques. This step checks our “root cause” to help us determine how to address it. If we determine that another machine is operating too close, we can simply move the machine and avoid having to redesign a new pump housing. The Cause step is meant to check out the original analysis and provide us with an understanding of how to fix the problem.
Systematic Improvement and Continuing Innovation
Understanding and being able to apply the steps of root cause analysis is essential to being able to improve our designs and better ourselves as engineers. When we compound multiple root cause analyses in one design, we progressively step up its effectiveness to reach as close to optimum design characteristics as possible.
There are other methods we can implement into the design process that can help us prevent failure rather than having to use RCA to address and redesign from failure. Methods like Abstraction Laddering allow us as engineers to fully define our design goals and create products that meet the intended outcome without complexities that may lend themselves to failure.
We can also use something like an Agile approach to product development which allows us to work effectively on a team and become more effective in our collective output. Even with these techniques, not all failures can be prevented and thus RCA is still an essential aspect to the engineer’s toolset.
Aside from techniques utilized in finding the root cause, there are also very real technologies that can give us more data and expand what we know. Tools like the Internet of Things including sensors, AI systems, data management tools, these all provide us with more information that makes finding the root cause that much easier.
By becoming experts at RCA and understanding failure, our current design failures are leveraged to equate to future success, both in design and engineering ability. Instead of relying on your “intuition” alone to solve a problem, this analysis method leverages your intuition into a proven methodology that can maximize our ability to solve problems.
Finally, proper tracking of previous RCA conclusions can compound our knowledge of failure, thus strengthening our ability to counteract it. If we ever hope to end the tediousness of fixing and finding problems, then we have to learn from failure. Otherwise, we’ll just be stuck solving the same problems over and over again.