Why is Getting Data In hard?

Published by

Jack Coates

on

February 3, 2024

Maybe first we should ask why people say it’s hard. After all, this shiny modern world is full of one-liners to install agents, hook in libraries, listen to your provider’s pub-sub, or just post stuff at an endpoint. It’s never been easier to get data, and it’s not like writing to syslog was that hard.

The real challenge isn’t ingesting raw data, it’s refinement. Data Value and Volume are Inversely Proportional, and the raw data you’ve pumped into the thing isn’t very helpful to most people looking at it.

A common response to the problem is to drive at a use case: As Alice the Admin I want to see if my databases are up so I can do something if they’re not. This is remarkably difficult to do well from a visibility platform though: very few situations produce logs or metrics to say “I am dead”, and absence of data is costly to detect in a timely fashion, so it’s easy to have a green dashboard and a dead database at the same time. Next thing you know you need a heart-beating agent with a control plane, and that’s harder to pick up off the open source shelf.

Similar pains can hide behind lots of other use cases: the data platform vendor is after all not an expert in all the things that their customers might like to see. Shall we normalize your data for security and compliance? I’d love to give you some business insight, maybe there’s a generally applicable problem to solve there? The problem with modeling data too closely to a use case is that it narrows your total addressable market to customers with that need. That might still work if you have a content team supporting a land-and-expand motion on a platform product, especially if you can attract partners… but if you’re little, you probably want to maximize the number of potential customers for your first product.

So that starts looking like doing some data refinement but not too much. Maybe unravel those nasty nested JSON blobs into neat key-value pairs describing disambiguated resources, but stop before you have a dashboard replication of the source vendor’s product. It’s tempting to proactively choose or invent an information model of course… but that way lies madness. Don’t model until a customer demands, and don’t force a customer to model; those were arguably time saving operations in on-prem land, but in the cloud they’re either increasing your COGS (cost of goods sold) or the customer’s bill.

Most importantly, those models don’t always work because they’re not always built from the proper level of expertise. Preemptively modeling from the use case side or the data ingestion side without deep knowledge of the other side has failed several times in my career, and to be clear I’ve been the owner of those mistakes more than once. Ideally data model authors have deep strength in all the places customers expect, but it usually doesn’t take long to find weakness in the real world. Witness Splunk CIM’s handling of proxy and web server logs, or OCSF’s modeling of cloud services, or UDM’s scaling pain. Those weaknesses are real painful to fix post-deployment.

And so the best data tools spend more effort on flexible schemas that let you change your mind later, data ingest that widens the pipe as much as possible, basic reports, and flexibility instead of certainty. In other words they are general purpose, not specific, and that means they are harder to use but more rewarding of investment.

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Why is Getting Data In hard?

Share this:

Discover more from Monkeynoodle.Org