Root Cause Analysis. RCA. Everyone has one, or should have one. In fact, many tools talk about having a system or process for this, but as far as we can tell, most have very little root and even less cause. How can this be ?
In theory, RCA is simple. Closely and carefully examine a problem, failure, or alert and determine its one or many root causes, both inside (technical) and outside (usually non-technical) the system. Usually experts do this. Usually without enough data and often by the seat of their pants. They try to establish dependency and temporal sequence chains, causal relationships, and planetary alignment plus deep knowledge to figure out what went wrong.
Today, more than a few systems are now offering “RCA” features which I would think would do some or all of the above analysis, but alas the appear to not do this. Instead, they strive to provide more information so someone else can make a better RCA determination. This is like saying we fix your car by making better wrenches. More data is better, but domain, dependency, and detailed knowledge is what really matters.
Our job is to design, build, and most importantly, manage, large-scale Internet systems and we have to do RCA work every day, often in messy systems where things go wrong in a hurry, where documentation is inadequate, and sometimes the hangover has not yet worn off. So we needed better tools to help us, often at 3am when things melt down.
Thus we have real RCA tools, that do real analysis of the situation, and understand the real relationships and typical event sequences for specific domains, such as web/app servers running PHP that often run out of key resources in several somewhat predictable ways. Or disk space, DB, or IO RCA tools that understand those domains.
These tools do real root cause analysis and offer their best estimate of the real cause and sequence, of course linked to the best practice resolution or automation remediation to solve it all. More advanced versions can tell you how they reached these conclusions, too.
All of these are part of OpsStack, our Cloud Operations Platform that can design, build, manage, monitor, troubleshoot, tune, and secure large and complex modern on-line systems for our customers.
Real RCA. Get it today.
We just sent you an email. Please click the link in the email to confirm your subscription!