Return to site

Next-Gen Monitoring:

Anomalies, Clusters, Predictors & more

Monitoring is Monitoring is Monitoring. And Alerting. All the same, for many years now, right ?

Well, not really. Today’s advanced MSPs (Managed Service Providers) and Monitoring Tool Providers are offering much more powerful alerting systems based on a variety of analytics and statistics, now often called Machine Learning (and overusing of this term). The main goal is to reduce the noise of false alerts while increasing sensitivity to real problems.

OpsStack's predecessor systems at ChinaNetCloud have done deep and advanced monitoring for a long time, though we are continually upgrading our systems to provide better service to our customers. So we’d like to update you on what we’re doing now and seeing in the market.

The core challenge is that traditional threshold-oriented alerting is a very coarse-grained tool, produces a lot of false alerts, and misses a lot of significant issues that should be investigated. This is especially true for MSPs with diverse customer bases, as every system is different and thresholds are just not good enough across a large customer-base, nor within complex and dynamic server fleets of any single customer.

Monitoring is Monitoring. Or not . . .

There are several parts of advanced “Machine Learning” monitoring, so let’s look at what we are doing:

The core challenge is that traditional threshold-oriented alerting is a very coarse-grained tool, produces a lot of false alerts, and misses a lot of significant issues that should be investigated. This is especially true for MSPs with diverse customer bases, as every system is different and thresholds are just not good enough across a large customer-base, nor within complex and dynamic server fleets of any single customer. Trust us, as we've fought this battle for nearly a decade across hundreds of systems and hundreds of millions of users.

We've fought this battle for nearly a decade across hundreds of systems and hundreds of millions of users.

Today we see people talking about advanced “Machine Learning” in monitoring, so let’s look at what this means and we are doing:

For simplicity's sake, we'll talk about traditional Metric (and Log-Related Metric) monitoring, which looks at all the usual suspects, such as CPU, IO, RAM, Queries, and more plus the more advanced Latency, Throughput, and Error Rates. More advanced systems also look at App and Service-level alerting, i.e. don’t alert unless my app or upper-level service is impacted, e.g. I don't care if a web server dies as long as I have others; don’t wake me up; I'll fix it tomorrow.

So, how can we do advanced alerting ? The first way is Predictive Analysis, where we look to the future to see when we may have a problem, usually resource exhaustion such as disk space, CPU, or RAM. Essentially you look back in time to build a model, such as linear regression, and project forward to see when or if you’ll have problems.

For example, if you have 20% free disk space and it’s dropping 5% per hour, you will have serious issues in a few hours. Same with CPU and RAM, though these may look out further in days or weeks. This gets harder with things like Java heap garbage collection, log rotation, etc. We’ve done Predictive Alerting for many years, mostly with Linear Regression, but we are looking at more advanced models.

The second interesting method is basic anomaly detection, sometimes called cluster analysis. This is basically looking back in time to build a model of ‘normal’ behavior and levels, such as normal CPU load on your app servers.

One major challenge is seasonality, where the data varies predictably over time, such as over a day or week, such as high CPU from 3-7pm during peak shopping hours, or high use during the business week but low use during weekends. There are various complex models for this, but the idea is to model expected behavior and metric values, such as we expect 67% CPU for the next hour.

Then we alert if the data is higher or lower than this by more than a few standard deviations. This works for both high and low deviations, so it finds problems that cause "low" metrics, such as code, data or network errors that result in unexpectedly low load.

We’ve done this for a while and it’s quite challenging as while there are some very complex, multi-factor, multi-season, trend-following models that are very powerful, they are hard to tune well. But when they work, they provide great insight into real problems that are otherwise hard to see (such as stuck Java threads that eat 10-20% CPU forever).

Another anomaly detection method is Fleet Clustering, which looks across a group of servers to see if any single one is out of line with the others. For example if you have four app servers, we’d expect their CPU to be quite similar. But if one is somewhat higher or lower than the others, something strange is happening. It may not be an emergency, but needs looking at some time soon. This analysis is pretty easy to do as long as you smooth the data a bit, and is of great value in larger systems.

Finally, Cross-Metric Correlation is a powerful way to help reduce alerts and increase sensitivity, especially when looking at higher-level service issues. For example unusual RAM use on an App server correlated with longer response times can raise the score or priority. This can then alert or remove that instance from the Load Balancer pool for a restart.

This can be even more powerful in larger DB systems when long-queries, slow response times, lock issues, and IO all correlate to serious issues that need attention, either via a call at 3am or automated load shedding, report query killing, etc.

Anti-Correlation can also be used when things should be related, such as requests per second and CPU use, but suddenly are not. This can indicate errors of various kinds, or attacks, or strange behavior that needs some attention - lots of resources correlate in this way such as CPU, Bandwidth, and IO, though it's important to separate dependent vs. independent metrics.

Three Methods: Prediction, Anomaly Detection, Correlation

These three concepts are all powerful tools and state-of-the-art in Machine Learning-based anomaly detection in modern monitoring systems. Our OpsStack Operations Platform includes all of these as we continually work to improve our detection systems in service to our customers, especially in these days of increasingly dynamic systems, containers, clouds, and more.

Learn more about our Total Ops Platform on our OpsStack.io

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OKSubscriptions powered by Strikingly