Return to site

Root-Cause & Auto-Troubleshooting: Inside OpsStack's Expert System

Cutting-Edge Ops Technology

Root-Case Analysis & Troubleshooting - The bane of our existence as Operations Engineers.

We are forever trying to fix broken stuff, usually clouds, servers, services, apps, and more. It’s never easy and we usually rely on intuition, some logs and monitoring, and occasionally a Ouija Board.

Having troubleshoot all types of diverse systems for nearly 40 years (since I was very young), I firmly believe great troubleshooters are born, or bred very early. But good tools, training, and practice can help everyone find problems and fix them, in systems of any complexity.

Ouija Boards can help . . .

Our OpsStack Cloud Operations Platform now includes the first true Troubleshooting Expert System for cloud and IT systems. On a technical level, it’s a Cause-Oriented, Inference Engine-based, Production Rule System with Forward-Chaining, along with formal Test, Repair, and Bayesian Network Multi-Dimension Risk/Costing.

That’s a mouthful. But it’s seriously powerful and cool, and with this we’ve built a full IT Ops Troubleshooting Engine. Below is more about how we thought about and built this.

First, we looked at every possible system, technology, and toolset for this, and ended up having to build our own, as we couldn’t find any real troubleshooting engine or tool out there. A few monitoring systems claim to do root cause analysis but, in reality, they do nearly nothing except give you some more data.

Then we had to look at how to make this practical and useful by the working engineer, with rules they can understand and update as they see fit. Plus it has to fit info info, flow, and thinking of how Sysadmins work. This means an expert system.

We look at quite a few examples, mostly from the auto, medical, and surprisingly, cell-tower industries. They all have many examples going back decades of how to improve field technician troubleshooting of increasingly complex systems. All are interesting, though many are too simple for our needs in larger systems with more moving parts, history, related systems, etc.

So how to build this ? We can’t use If-Then logic, as that fails miserably on even simple systems due to conflicting rules, complex logic, and ease of making errors. It also can’t really chain nor infer anything.

The much more common flow-chart-oriented systems are interesting, but still too simple for real use with a dozen or dozens of rules, conflicting rules, exclusion factors, and more. We did find the system that has good examples, including Tasks, Tests, Observations, etc. for basic troubleshooting.

Moving on, a simple rule engine that just lays AND/OR logic is not really enough as we really need to score / sort options, have include/exclude rulesets, deal with unresolved facts, and be iterative in some ways.

If-Then, Flowcharts, Decision Trees all too simplistic for real Ops use on real Systems.

We also looked at Bayesian Networks (Probability Directed Acyclic Graphs, DAGs) for core logic, but these only work well when there are well-defined probabilities and simple additive failure modes. Useful for costing and for actions, or even scoring, but not for the core logic elements we need. Also quite hard to understand and manage for non-experts.

So, what to use ?

Cause-Oriented, Inference Engine-based, Production Rule System with Forward-Chaining, Scoring, and Exclusions! There is a lot in that, so let’s break it down.

Cause-Oriented means each Problem starts with a fixed list of Possible Causes, as determined by experts. Then our goal is to build a list of likely causes, sorted by likeliness. The Rules are actually tied to their Parent Cause, i.e. the Rules are used to directly score the likelihood of this Cause being the real Cause. Our system allows various Scoring actions, such as +/- to Raise or Lower the Score, or absolute scoring.

A key part of troubleshooting is excluding Causes, which really helps the Engineer avoid blind alleys and wasted time for Causes that cannot be real. Thus, our system has special rule results for this, so we can categorically exclude a Cause. This removes that cause from the evaluation and results (though we show the Exclusion list).

Inference Engines, a key part of AI, are complex tools that take inputs as knowledge or facts, and produces output that are new knowledge. It does this by inferring things from the data and logic. New knowledge then produces more inference and then more knowledge by using Forward or Backward Chaining. We use Forward Chaining, which starts with facts and works towards results (Backward Chaining starts with Results and works backwards to decide which facts it needs).

A Production Rule system actually implements and manages this, as it’s an AI tool that uses the Inference Engine and related logic to evaluate the Left-Hand-Side (LHS) of a rule (the IF part) to then execute the Right-Hand-Side (RHS) of a rule (the THEN part). It uses Forward Chaining to drive evaluation of the LHS, then update the RHS, and looping until there are no new facts or results.

Putting it all together in practice, the entire Expert System is invoked for a given Alert or a Problem we are trying to solve. Each Alert/Problem has its own set of Possible Causes, Fact Requirements, and Rules.

All about Facts, Causes, Rules, Exclusions & Fixes

Then the system loops on the Cause list, and for each Cause, loops on the Ruleset until no more knowledge is generated. Each Rule outputs a score, new knowledge, exclusions, or combinations. Rules can have outputs when the LHS is true, false, or can’t be evaluated because of missing Facts/Knowledge. This set of logic and output combinations plus chaining is what makes the system so powerful and flexible.

The Fact Requirements tell the system what data it needs to get from our monitoring, metrics, alerting, log, history and other systems / sources. Building this initial Fact/Knowledge-base is the first step in the rule process. There is also a large subsystem dedicated to getting Facts, as metrics are easy, but histories, configuration, actions, etc. tie back to OpsStack’s models, CMDB, and other core elements.

Ah, and the whole thing is actually implemented in PHP using the powerful hoa\rule system that uses mostly English / SQL like-rules that can be easily built by experts and engineers. This improves Rule-writing efficiency, avoids mistakes, and aids debugging. Rules, Facts, and other things are specified in JSON.

We’ll write more about this soon, including how we use this, examples, and lessons learned soon as I’m sure we’ll have to enhance and adjust as we and our customers build increasingly complex rule and datasets.

Get more information on OpsStack at

All Posts

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!