Tag: Operations
-

Sorting Events by Time
I’ve been reviewing the APIs of a number of software vendors lately, looking at how you pull data that they don’t support pushing. It’s producing a bit of flashback to working with ugly things from the old days. Here’s a fun fact, apropos of nothing specific to any current project…
-

Maintenance Windows and Breakage
Lorin Hochstein recently wrote about normal incidents, “a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.” Instead of an accident or an error, it is an incident which is the outcome of proper behavior.…
-

Who the Tech Is Meant For
I’m pretty fascinated by the effect of social code matching in product design. In order to market and sell products you have to fit them to the buyer: language, use cases, pricing, packaging, sales motion, and more. In large and small ways, a company’s go to market or an open…
-

How Do I Drive Remediation SLAs?
Question: I want to get my organization to patch things in a timely fashion, how? Can I just set an SLA (Service Level Agreement) of “patch the criticals in 30 days” and track that? Speaking as a vendor who’s worked with patching systems for everything from big banks and government…
-

Using a Data Lake for Business Insight
At a former employer some of us used to joke internally that we made the world’s cheapest business intelligence tool and the world’s most expensive log search tool. Business intelligence (BI) use cases are cheap from a data platform perspective, because value and volume are inversely proportional. All the work…
-

Testing Product in the Field
DevOps: there is no QA, there is no infra, testing and support are everyone’s job. This works okay for unit test level work, but end to end functionality involving multiple teams breaks all the time. You can ask DevOps to take that on too, but they’ll just laugh. You can…
-

Shewhart Control Charts
As a monitor writer, I want to alert when a value has changed quickly a lot in one direction or another, but i don’t want to set hard-coded thresholds because the value’s range is expected to slowly evolve. My goal is to get useful alerts and avoid false alarms. Examples:…
-

Uptime nines aren’t equally distributed
Once upon a time, I worked at a hosting company… sadly, after a hardware upgrade gone wrong, the database server behind a customer’s website was sitting open on a data center floor with a cracked motherboard during their launch event. We provided an overall yearly uptime better than three nines…
-
VMBlog Post on Decentralization
linking to this piece I wrote for VMblog Why Decentralized Work Calls for Decentralized Data
-

Metrics and Observability
I wrote this as a Twitter thread in March of 2018, but the character constraints of Twitter at that time made it extremely cryptic. Also, it’s staged as a response to Splunk’s introduction of the metrics index… and to be honest, that’s no longer interesting to me. This is an…
