Book Review Content Corporate Life Customer Support fitness garmin Licensing Monitoring Observability Operations Partnership Product Management Products Sales Security User Experience

Book Review: Learning OpenTelemetry

Published by

on

Learning OpenTelemetry book cover

Slim volume packed with good stuff, I enjoyed this book. Notably, Observability is defined more practically as the practice of knowing what’s happening instead of the black box outputs definition.

There’s a good strong call out against MELT-style thinking. This is the idea that there’s three (or four, or N) silos of Observability. MELT’s a handy acronym for Metrics, Logs, and Traces. Somebody noticed the E and stuffed Events in, which is sort of like taking “Earth, Air, Fire, and Water” and sticking “Atoms” into it so you can have an acronym. Observability is not a stack of tools unless you’re selling a stack of tools. I laughed out loud at “The Three Browser Tabs of Observability.”

The chapter-leading quotes referencing maps and history in the first half of the book are very apropos. Lack of standardization, no unifying story, no map of the territory… that is the status quo and a problem, because users are presented with context-free data. OTel divides context into hard context (explicit relationships) and soft context (implicit relationships). Context links resources and transactions. Hard context is useful for a service map. Soft is useful for dimensional explanations.

Telemetry types are convertible (log to metric for instance) but that’s expensive and can introduce error, so it’s better to layer the types together via context links at production. Which the OTel stack does, which is a great thing to leverage when you can be a data source. It’s an opportunity to make and use semantic, portable telemetry, which is a game-changer. In the past this kind of contextual awareness has either been in a narrow and expensive silo, or floating high up above the data in “people talking and trying to explain what the dashboard says”.

With or without semantic telemetry, you have to collect and store and process all this data, and it’s always a cost optimization problem at every step of the way. Doing what you can with what you can afford to have. There are days I question the wisdom of “everything we do is OpEx, leased from a vendor,” and those days are more common when looking at surprise bills from activities which don’t directly serve the customer. Semantic context coloring on telemetry increases the production cost, decreases the human analytics cost, and the jury’s out on what it does to storage and automatic analysis. It certainly increases portability of data across use cases and analysis techniques. OTel tries to pack the semantics and portability in at production (well, at a Collector, which is more analogous to the LogStash Engine or Splunk Heavy Forwarder than the ElasticBeats or Universal Forwarder). That Collector system is more likely to be in the source environment than in the bus or sink, which gets really important when outsourcing storage and analysis instead of rolling your own stack all the way from source to sink. Edge computing pushes costs into revenue generation producers instead of cost centers of analysis. That makes a harder accounting problem, but there’s less data to process and often more resource available at the edge. If the overall solution is less expensive, that can cover some accounting cost.

There’s a suggestion that the discipline of Quality Assurance has died because a QA engineer can’t hold the system in their head anymore. I can’t really agree with that reasoning; I think it’s rather a matter of pushing the harder to define problems onto the glue people in order to decrease costs.

Looking at the types of data that OTEL handles:

  • Traces are defined as structured logs that come with the four golden signals (LETS). Traces are positioned as the most important signal type because they’re transactions focused. It is my opinion that this is rooted in two things: a business and project goal to replace proprietary APM, and a constraint of ephemeral cloud-native architecture. Which of those drivers came first? How many angels can dance on the head of a pin? Doesn’t matter, and it’s tough to argue against “the customer’s transaction is the most important thing we can look at”.
  • Metrics in OTEL have semantics, so they suck less. There’s a lot of room for improvement, so that’s good.
  • Logs are ubiquitous but weakly coupled. Existing tools rely on correlations to make context. The book’s view is that logs are a fallback for when you haven’t adopted traces (e.g., systems that aren’t transactional or haven’t been rewritten). This strikes me as charmingly naïve when reviewing the vast mass of systems that continue to make money and problems without getting updated. Everyone knows the COBOL on mainframe world is still alive, but did you know there are still important systems running on Windows Servers and CentOS 5 too? Also, AIX didn’t die, and Oz isn’t real.

Some grab bag items: OTel has Semantic Conventions: exciting and dangerous stuff. We always want to cook explanations in, because it’s really hard to use numbers when you don’t know what they represent, but it’s really hard to make vendors and open source projects and customers agree on standards. The OTLP network protocol is the heart of the standard, as IP is the heart of the TCP/IP protocol family. History suggests that’s a good idea. The OTel project has avoided the analysis front end on purpose but I’m not convinced that it’s entirely safe from a Silicon Valley pirate approach. Speaking of TCP/IP, the love-hate engineering dance between that open protocol and Cisco Systems over the last few decades is quite a thing. Chapter four ends with a prediction (accurate, in my opinion) that OTel will win because Observability vendors are adopting it. Chapter Five gets detailed about implementation but the upshot is that all this implementation work is the last time you’ll need to mess around with it, because it’s an abstraction that vendors plug into. Like IP. Library instrumentation is a neat idea, but still pretty theoretical… something along the lines of “if all the open source libraries do the needful it’ll be like getting commercial quality APM for free.”

Monitoring infrastructure services covers a lot of sound theoretical ground, but a bit light on practical advice. Just you know, get the CloudTrail stuff and K8S stuff into the OTel stuff, but be careful not to collect too much. A conventional wisdom approach to cardinality, which makes sense given the strong traces and metrics assumptions. I like that meta-monitoring isn’t ignored; who’s watching your watchdogs and all that. The Serverless or Function-as-a-Service conversation is neat, but again light on the central question of who pays for telemetry. If the FaaS platform provides the data from under the hood, they’re probably going to have to pay for it, and so that data is going to be pretty minimal. If they don’t provide it, or they provide an option that executes “over the line” where you can see and control it… then it’s almost certainly executing on your bill. An APM-ish library instrumentation approach is a reasonable way to handle that problem.

The pipelines chapter answers some of those K8S integration questions, which I like, especially as in introduces OpAMP. Coming from an endpoint management background, I have strong feelings about configurable platforms and the control planes that make them work. The fact that OpAMP and ECS aren’t ready definitely say something about my biggest issue with Open Telemetry; pace of development ain’t fast. This chapter of course ends with a cogent discussion of cost, with some reasonable advice for consideration of value and conversion of bulky things to smaller things. Again, thinking of who pays for the execution is interesting; not only is centralized execution more costly than distributed because of the compressed timeframe requirement of a bus or sink, but doing it at the source is also likely happening on someone else’s platform bill.

Wrap up: there’s three axes of Observability: deep vs wide, rewrite the code or the collection, centralize or decentralize the team. Treading carefully here because managing these cost/benefit curves is my current job, and those opinions are on that blog… these are all cost problems, which means pace layering leaves them open to technical disruption. If you can afford either position on these three scales, then you can pick the one that fits your culture better. If you can’t afford to go where you want, then you’re stuck with what you can do, or even doing nothing. That leaves tension needing to be resolved, which neatly explains the existence of an open source project like OpenTelemetry. It’s exciting to work with this technology.

Stewart Brand's Pace Layering diagram

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Previous Post
Next Post