Maintenance Windows and Breakage

Published by

Jack Coates

on

September 4, 2023

Lorin Hochstein recently wrote about normal incidents, “a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.” Instead of an accident or an error, it is an incident which is the outcome of proper behavior.

An example of proper behavior that causes incidents is system updating. Regardless of whether you’re using servers-as-cattle, servers-as-pets, or serverless infrastructure, the services that you operate are a maze of dependencies. Understanding that maze of dependencies is hard enough, but understanding the way that it interweaves with the rest of the world makes the task seem impossible.

So it’s no wonder that huge amounts of effort have gone into complexity reduction, abstraction layers, and process mitigations. Unfortunately that effort does not successfully reduce incidents, it just makes them abnormal.

You cannot block all change that affects a system, so when you block some change, you are increasing the risk of unexpected interactions to a point of certainty. “Unexpected” is the key here: allowing change will certainly cause changes, but those are expected. Preventing the changes that you can control just produces a false sense of safety from change.

“But wait, don’t hundreds or thousands of organizations practice change control and maintenance windows?” Yes, they do: the goal is to provide a predictable, stable experience to their users. I’m saying that this does effectively control the pace and occurrence of change, but that it does not make the changes any safer or more predictable. Instead, it makes change slower, more dangerous, and often forces those organizations down poorly tested paths like rollback or restoring from backups. I think the chief benefit these organizations are getting from their change control is the fact that people are getting together to monitor the change and look for side effects. I also think the odds of seeing unpredicted side effects are quite high.

So why do organizations do this thing? Because their users demand that changes to IT systems never disrupt their workflows. While they cannot make that demand of other infrastructure such as roads, they can make it to the IT teams that they employ.

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Maintenance Windows and Breakage

Share this:

Discover more from Monkeynoodle.Org