Observing System Failures

Published by

Jack Coates

on

June 22, 2025

The system is down, and half a dozen teams join a Zoom call that stretches over a couple of days and a Slack channel that lasts for weeks. Vendors and integration partners are bridged in and out, executives pop by to see what’s going on, and the hours pass with various teams trying to prove or disprove theories about their systems. Each team has their own sources of data and ways to use it. Some are discussing metrics in Grafana or Datadog, some are reviewing logs in Splunk or Elastic, others are comparing traces in NewRelic or Honeycomb. Everyone knows this is terrible, so all of those vendors (and many more!) have pitches to be the single pain of glass where all the types of data are brought together and enriched with context. The demo scenario of “see a problem in metrics, click to locate problem in logs, click to root-cause in traces” is old enough to drive a car in most US states. So, why doesn’t it work yet?

I and my friends have been on the vendor side of this several times, and there are some technical challenges to overcome.

Scale. Observability via metrics, logs, and traces means three individually massive data processing problems. Making one system that handles all of them is far from simple. They also have different expectations for latency and retention, so that’s neat.
Context. The really tough problem is context though. That click-through demo requires linking metrics to logs to traces, probably via some form of that dirty word in data circles, “tagging”. Tagging introduces cardinality explosions and human error, so that’s gross. But without it there’s no way to identify the services and entities that are involved in a problem, so everybody has to do something like it. Worse, it’s not one-and-done; particularly in an ephemeral microservices world, context tags have a valid lifetime. Adding temporal sensitivity to the relationships between metrics, logs, and traces greatly magnifies the complexity of the scale problem, because queries become much more expensive to produce and execute. If they’re in a metrics-friendly language like PromQL that’s probably survivable, but good luck to SREs trying to do this in SQL.
User Experience. A final challenge: the expectations for user interfaces are different for each of those teams. Log people expect to search logs with a domain specific language that feels like SQL and Python had a baby (there’s several that fit this description). Metrics people expect to see information dense charts that use a lot of mathematical power to expose hidden data facets. Traces? We’re looking for flame charts. Innovative rethinking of the entry points is about as welcome as a triangular steering wheel… so the UX designer seeking to capture all three entry points has to find common ground between three disparate tools in addition to smooth flows between them.
Economics. Which brings us to the cost and pricing problem. Different teams have different budgets and pricing model expectations, which creates complexity for the vendor trying to solve all the observability problems at once. The simplest startup pricing has three “starts with” sections, while the bigger companies have pricebooks that look like a conspiracy theorist’s wall chart.

None of that is completely insoluble, and there are at least half a dozen extremely similar feeling products in the market today. It’s just work. But maybe there’s a better way? What if AI could help you explore the data? My suspicion is that there is absolutely a valuable user experience here, a copilot that helps the SRE or developer find their way around unfamiliar data. However, I don’t think that there is an AI answer to deploying or maintaining the system in a real production environment. The reason why is that the blockers are the same ones that stopped The Semantic Web from being a thing: it’s not about tech, it’s about politics (read: the will to do the work must exist across all involved teams) and money (read: a cost-benefit analysis that successfully justifies completing the work). OSI Layers 8 and 9. In other words, some assembly is required in real life.

get the data — politics, money. Getting, tagging, and using data takes work. There’s copilot assistance possible in modeling, but access, analysis and storage still need hefty support.
make the relationship graph — politics, money, tech. Somehow get a relationship graph of the services and entities and how they support each other. Maybe this is by hand, maybe it’s AI, maybe it’s Maybelline, but if you don’t know A depends on B it’s pretty tough to say why A stopped. There’ve been attempts to automate this process for many decades, using network observations and traces and source code analysis… I haven’t seen it succeed yet, but maybe the breakthrough is just around the corner. Until then, automated tools start the process, maybe make it to 60 or 70 percent at best, and then people finish it.
tune detections and dashboards — money, tech. In order to be an effective tool, the o11y system must be able to fit into the sociotechnical system. Alerting only when needed, with sufficient context, without tautological resource waste.
migrate teams to the new system — politics, money, tech. O11y is not new, and the new o11y systems are almost never the first thing introduced into a customer environment; there are people and processes that must be changed.
maintain through change — politics, money, tech. And finally, the sociotechnical system must be maintained as services and entities and relationships and teams come and go.

Simple! So, where would AI fit into this… again, as a copilot, helping the humans ask all the necessary questions and evaluate how far down the path they’ve gotten. It’s not so much removing people, as assisting people. Sure, many companies would be a-okay with removing some people, but AI systems aren’t there. The theory is that the people getting assisted could then be less expensive, but so far that isn’t panning out. Rather, it seems that less expensive candidates aren’t getting offers. The next question though is whether all this complexity is the norm or the worst case scenario. After all, a lot of incidents are simple. Maybe “that last change was bad, revert it” or “increase resources and check back later” does the trick. An AI could certainly do that… but so could a simple script. So how long does the more expensive solution keep getting used instead of the less expensive one?

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Observing System Failures

Share this:

Discover more from Monkeynoodle.Org