I’ve been reviewing the APIs of a number of software vendors lately, looking at how you pull data that they don’t support pushing. It’s producing a bit of flashback to working with ugly things from the old days. Here’s a fun fact, apropos of nothing specific to any current project I’m sure… humans might assume that time-stamped data is sorted in order of time, but databases don’t actually do that thing unless you ask them to do it.
That fact has produced lots of subtle crap bugs in my life from Python or Java or Perl or VBScript code sucking logs out of a database. Not only do you get drops and dupes because of clock skew and daylight savings and time zones (that’s a given from sorting by time)… but you can miss chunks of data entirely by assuming sorting happened. Maybe just a few lines, but maybe really big chunks, like days of data.
Friends don’t let friends put logs in a database. Using a database for logging used to be the kind of bizarre Maslow’s Hammer outcome that stuck out as really strange, but it is growing popular in these modern times because for containers or serverless systems the database is a managed service and the filesystem is ephemeral. Given this sad state of affairs, those friends also help each other to use an automatically ascending row id instead of a time stamp when the database path is the only place to write logs. You know the great thing about an ascending row ID set by the database engine? It goes up with every row and it doesn’t duplicate, no matter what time the event emitters think they have. That makes it pretty easy for an event reader asking the database for rows to say “WHERE ROWID > CHECKPOINT”. It’s still a good idea to sort your data, but an ascending row id is easier to work with than a datetime stamp. When all you have is time, and there’s no sorting? The reader code will seem to work and it’ll probably pass QA and it’ll ship, until someone looks really closely (usually because they have an incident and can’t troubleshoot it because the logs got rotated into /dev/null instead of being collected into safe storage).
The software industry operates by creative destruction, forgetting everything and starting over from scratch every few years. Now lots and lots of products deliver their logs by putting a REST API in front of a database full of logs. Those APIs almost never have decent controls over the SQL they’re producing, and they certainly never use row IDs. Sometimes you can’t even ask for sort order at all, and time is the only way to indicate that you want a specific set of records. Anyone ever wonder if APIs like that are fit for the purpose of extracting a complete log of events?
The vendor doesn’t notice if they don’t use their own API for this purpose… and why would they? They’re able to get logs directly from the source instead of going through the API after all. The customer will eventually notice and complain, but only when it’s too late. At that point that one vendor will of course jump on it and fix the problem within six months, yay. But where does that leave the next vendor down the line? The forces producing this issue are global in nature, but discovery and remediation are local. When fixing a global issue only happens one bug at a time, it’s never noticed at industry scale… it’s just part of the friction that means we can’t have nice things.

