Metrics and Observability

I wrote this as a Twitter thread in March of 2018, but the character constraints of Twitter at that time made it extremely cryptic. Also, it’s staged as a response to Splunk’s introduction of the metrics index… and to be honest, that’s no longer interesting to me. This is an expanded set of observations about metrics in observability, without the Splunk part.

Metrics are considered one of the big pillars of Observability (O11Y), even if there’s some disagreement about how many pillars there are. However you count pillars, Cindy Sridharan’s explanation of metrics is the best: summarized, you can predict the amount of traffic and you can do math, but you lose context.
The heart of Observability in my opinion is unifying all the data types so that you get the benefits of each without losing the context that the others can provide. Commercial interlude: If you’re hitting limits in doing that, I think we’ve got a great answer at Observe.

The thing about metrics is, they are quantitative — just numbers. That’s why they’re so great, because the math is so easy I can figure most of it out (with my socks off). This is pretty great for making dashboards: “Your volume has N bytes free”. Done! It gets a little more complex when you’re doing timecharts and reducing data, but you can pretty easily show state with metrics.

Metrics are kind of garbage for alerting though. Is the state we’re observing Good? Is it Bad? To be honest, I think quantitative metrics on their own are almost entirely useless for decisions. You can easily trigger an alert, but you can’t easily make a one-size-fits-all alert that won’t false positive on some systems (different resource allocation profiles), on some days (Holiday Problem), or external change (pandemic lockdowns? Welp, there go our behavioral models of network traffic). Besides, even if you do generate a perfect alert by monitoring a metric, such as “Image cache processor disk has been full for the last 10 measurement periods”, it’s an event, not a metric.

Decisions from metrics need qualitative context, whether it’s “we should generate an alert” or “we should engage some humans to make decisions”.

  • Do we need to allocate more space now or or can it wait til later?
  • How much more space do we need? Is that a permanent boost or a temporary one?
  • What about our budget & schedule, can we wait until the next billing cycle starts?

Interestingly, those nuanced contextual problems are going to consolidate to a discrete decision: more storage purchased at time: yes or no. Amount of storage: [small|medium|large]. Allocation timeframe: permanent or temporary. Quantitative data is mixed with qualitative context to make a quantitative decision. There are some qualitative context elements that can be detected by or collected into computers to help them make these decisions, which is great. But if the context needed to make a decision is only held by the human operators, then those humans will need training to use metrics data in order to use it at all. Imagine this conversation, happening over and over for every panel of every dashboard: “Fellow human, I will now teach you our monitoring tool’s contextual framework. That line-chart panel on the left in the third row of the second dashboard? It is measuring X metric, which is measured in Y units, at Z interval. We’re measuring normalcy, so we think that the normal range is from A to S today. If the X line goes higher than T, follow the runbook!”

Sounds tedious… can we encode that into a KPI, set an SLO, and alert when it’s breached? Sure, but it hasn’t improved anything, because it still breaks when an outside change means that the observed normal is wrong. This is the Holiday Problem: if you’re alerting that traffic is unusual, you have to filter out normal reasons for it to be unusual. The first time I ran into this was measuring a large customer’s internet traffic, in which we alerted them every time that the workday and workweek started or stopped. Guess what, it’s Monday morning, you’ve got a big increase of traffic! So we made a lattice and compared normal to normal in four hour blocks aligned to each other… guess what, there are three day weekends. Let’s say that you successfully get and maintain a machine-readable list of all holidays that affect your organization (a task that I’ve heard suggested many times, but that I’ve never seen done in a organization larger than about 500 people). What happens when HQ is closed by a blizzard on a Wednesday? Your anomalous behavior alerts go off, that’s what. Worse, what happens when you’re big enough that you have to think about international holidays. Oh and by the way, holidays inflect your customer traffic as well as your employee traffic. The human operator has to know more organizational context than the machine will ever have access to in order to correctly state if a rule is firing properly, generating false positive, or generating false negative.

What if you want to compare many KPIs with each other and answer a large and complex question, such as “how much storage increase do we project this service to need in the next six months”? This is not even feasible without qualitative metrics.

  • Finance, “Do we need more storage in the remainder of this fiscal year?”
  • Operators look at 4-tier hybrid cloud hierarchy, “How much storage are we even using now?”

One of the things that makes this really hard to do is the tendency of metrics collection systems to ignore (or leave as optional attributes) the unit size, periodicity, and granularity of metrics. Without these, you’re potentially comparing apples to oranges whenever you look at multiple metrics sources, even if they’re supposed to be the same type of data from the same type of system. It would be ideal if those three were considered required attributes, but that requires a level of standardization that is tough to reach. Talk of changing standards also raises the specter of potentially breaking backward compatibility, which is yet another challenge to all that machine-learning that’s supposed to be happening.

That said, extending the required attributes of metrics should make it easier to see problems with the data collection systems, using the periodicity and granularity of metrics. “This dashboard panel expects 15 metrics/period, but it’s now getting 3 from 1/6 of the configured probes, & 1 OutOfCheese Error.” You can do that today by doing backward modeling of each metric, but it seems like using a hammer to drive screws.

So, why don’t the sources figure out the context on their own and send that along? Tools should compute useful values & compare metrics qualitatively, producing events that explain themselves. “Tier 3 storage is 95% full” is a useful alert when it’s coming from a system that can know 95% is very close to capacity because resource allocation is only 100GB. Even better would be an alert like “Tier 3 storage full estimated in four hours”.

Contextual decisions could then made early, or even automated more safely. “Because several tiers are estimating they’ll run out of storage in the next week, we can safely say it’s likely that usage will exceed capacity during the signing manager’s vacation. I think we should buy more space now.”

The first answer to any “Why don’t all the sources just do this” questions is always “Why should those vendors and open source communities take that effort”. The answer is because determining the real world importance of a metric, listing the attributes needed for operators to make important decisions, and defining the KPIs that matter need context that only those vendors have. “Disk full” is pitifully primitive. A service provider or vendor knows better KPIs for how to monitor the storage they provide.

%d bloggers like this: