AI for Monitoring

Cognitive computing approaches to the monitoring problem haven’t worked in the past and still don’t work now. The future might still make it work, but it’s unlikely to be because of a change in AI technology unless that change is in the per-process execution economics. For AI to be a better choice for monitoring than humans, an expensive mathematical process needs to be made cheaper than a simple mathematical process. Let’s walk through the details.

What is Monitoring

Monitoring in IT Operations or Security contexts is typically discussed in opposition to Observability… but rarely discussed as itself. So let’s look at some definitions. I like what we see when leaving the IT field out of it… governments and non-governmental organizations (NGOs) talk about Monitoring and Evaluation instead. “M&E is a combined term for the processes set up by organizations… with the goal of improving their management of outputs, outcomes and impact. Monitoring includes the continuous assessment of programmes based on early detailed information on the progress or delay of the ongoing assessed activities.” In Information Technology contexts, Wikipedia offers several more specific variations for Application Performance, Event, Business Transactions, Networks, Systems, User Activity, and Websites. That proliferation is a good setup for the Observability conversation… the idea being that each of those different domains introduces so much contextual specificity and need for local knowledge that a generic monitoring approach won’t do. In the monitoring software product, a monitoring definition looks more like “collect data and alert when important things are found in it”. The a priori assumption is that the person (or algorithm) writing the monitors understands what’s important in the data.

What is Observability

Since understanding what’s important is obviously a challenge in the real world, IT has recently embraced the systems theory concept of Observability: “internal states of a system can be inferred from knowledge of its external outputs.” Specifically in modern monitoring software, the goal is to enable discovery of problems without requiring an up-front investment in understanding of the applications that are being used or the use cases they’re used for. That’s why none of these observability vendors have extensive integrations pages… I’m sorry, actually they all do. Some level of effort is required to get data out of an existing system and understand it well enough to monitor. That effort is expended before AI is even on the table. Again, an a priori assumption that the person writing the monitors can recognize what’s important in the data.

How do humans do this task

Still, let’s be fair and skip ahead to where the observability monitoring solution is up and running, data is collected, and it’s time to write some alerting rules. If a human has to do this, they will balance the amount of time that they have for this task, their knowledge of the domain, and their willingness to be paged in the middle of the night. This balance can lead to four possible paths for a given service:

Do nothing. If the thing breaks and it matters, then something that’s important will also break and we’ll find out from that. Reactive ScreamOps™️ FTW.
Write a matching monitor from the thing’s logs. If the known bad state is found (say, a stack-trace from a crash that is not followed by a startup message in the next sixty seconds), then alert.
Write a threshold monitor from the thing’s metrics. If the value of the thing’s measured attribute goes over a known bad number (say, the CPU load is greater than 95% for ten minutes), then alert. There’s two types of these… percentile thresholds and hard value thresholds. Each has its place.
Write a control chart monitor from the thing’s metrics. If the value of the thing’s measured attribute goes way out of range from what it has been in the recent past (say, we were using 30MB of outbound bandwidth per minute per thread in the last seven days but for the last hour we’ve been using ten times that), then alert. If Shewhart control charts are too fancy, this can be done even more cheaply with multiples, or orders of magnitude.
Each of these monitors takes effort to build and manage over time, but they don’t typically cost a huge amount to run (unless we look at security use cases, sorry). The argument for Observability over Monitoring is that you can save some time in making these monitors. Instead of learning how all the logs and metrics of each thing work, just determine what to do from observation: grab all the data, see what good looks like, alert when it stops acting like that. This can be handy when determining a threshold: if you’re not sure if you should use 80% or 90%, just look at prior data to see how high the peaks that led to problems have been! It’s also dependent on being able to determine that the system state is in fact “good”, and that all “bad” outcomes will change the attributes of the data that you’re looking at… both of those assumptions are not perfectly true.

How does AI do it

So what does an AI model do? More Observability, less Monitoring. The main path is to create rules and models in some form of neural network, test them against a known dataset until they perform well, and then turn them loose on unknown datasets to see what happens. This can be quite functional, especially in use cases where the cost of failure is very high (fraud prevention in finance and insurance for instance). The rules fall into one of two modes: detect threshold breaches (as above), or detect cluster compatibility (even though the details are different, this set of transactions looks exceedingly similar to a set of transactions known to be bad). The benefit to this approach is that it can produce rules that the human wouldn’t have bothered with or thought about. Cluster detection is genuinely neat, and provokes comparisons to the human analyst’s intuition.

There’s an interesting gotcha in this though: The AI can’t select “Do nothing” unless it’s been provided with a service graph, or the resources to generate one (meaning some human understands how to identify all the entities and their services relationships to each other). Again, that means a need for significant human time investment to bootstrap and maintain the system — just like Monitoring, not so aligned with Observability. Notably, a sufficiently experienced and capable human doesn’t actually have to do this investment because they’re flexible enough and responsible enough to make good guesses about when doing nothing is safe. Those guesses are rapidly self-correcting when they’re made by the SREs on call as well. The AI has no concerns with getting a three AM alert storm. Sufficiently experienced and capable humans are expensive, so monitoring and observability products in the real world do need that significant time investment to bootstrap and maintain them so that normal people can get value. Since adding AI doesn’t save that time investment… it doesn’t help with the main problem.

I cannot fail to mention one other path… we could use a large language model (LLM) approach to generate rules in the platform language of choice or translate rules from another language to your platform’s language. That can be a handy time-saver for prompting a human analyst to finish the job, but LLMs by their nature cannot be relied on to produce accuracy. They are finding the most statistically probable string of tokens for the prompt they were given, they do not have any model of correct or incorrect.

Why that’s not effective

I think that cognitive compute approaches fail to be better or cheaper in both Monitoring and Observability contexts. If they’re testing for threshold, they work by determining that behavior is consistent or inconsistent with the past or a generated ideal model (which is also built from the past). To the extent this is true, it could be represented just as well and more cheaply with a simple control chart. In other words, the model is doing a reasonable thing, it’s just doing it at a higher cost than is necessary.

If they’re testing for cluster similarity, that’s a different problem where I do think that cognitive computing approaches are valid; but I don’t think that they’re attainable outside of the aforementioned high value use cases. If we want to produce a low-cost product, then we need to automate the recognition of problem states and thereby the production of rules that recognize match with problem states. In other words, we have to solve data labeling, which as far as anyone can tell seems like it will require a real human intelligence for the foreseeable future.

One attempt to handle this is to do past-predicts-future for the cohort model (even more syllables than Observability!). The cognitive computing system will observe the complex system, build a complex model from it, build rules from the model, and thereby notify humans when facets of the data no longer match that model. This isn’t cheap, so a simpler approach is to drive synthetic transactions through the complex system; then you can set up threshold monitors for metrics like how long the synthetic transactions take at each stage. A basic problem with both of these approaches is that the produced alerts are impossible to interpret because they’re alerting on deviation from a norm that wasn’t understood either. If you didn’t know how sessions between cockroachdb and those Fargate sessions were supposed to act, how can you say if the observed anomaly matters? However, there is an even more basic issue: abnormality isn’t bad. Abnormality comes from the real world, and it is not in itself a thing to fire alerts over (unless you just want them to be suppressed). A well-thought-out system will use abnormality as a factor to increase the importance of an alert, not the reason to generate an alert.

Monkeynoodle.Org