Security Chaos Engineering review

Excellent book, introduction helpfully posted here.

I’ve read a goodly number of information security books; there’s a weird (good weird) feeling to this one. Obviously some of that is from Kelly Shortridge’s (and Aaron Rinehart’s?) eclectic interests: a recipe for Mexican hot chocolate is used as a process mnemonic, for instance. If you don’t see the brilliance of this GEB-esque riff on the meaning of security to societies, then you may not enjoy this book either.

That said, I find it intriguing that the book does not assume its reader is equally at home with synergistically bounding across a holistic web of chaos. It advises lots of structure and process layered together to assist the linear thinker with handling a non-linear world. That kind and empathetic outreach is an ideal way to approach life.

I also enjoy the book’s emphasis on experimentation: while documents and changelogs might technically exist, staff continuity is often a fond memory and an experimental attitude of discovery is a very wise way to approach a production network. That said, the phrasing “security chaos” is probably a non-starter for so many organizations. In Enterprise Roshambo, disrupting production on purpose is heavily frowned upon. I recently spoke with someone who observed a chaos engineering team getting shut down after their tools and programs sat unused, and it’s not hard to imagine that occurring all over.

It’s a pleasure to see the Antifragile approach applied in a more realistic, or at least applicable to reality fashion, with less VHW (vigorous hand waving). There are good nuggets of clear advice here. There are also some glossed over challenges, like “compliance needs still have to be met even if you think it’s not directly important”. Reality is complicated and many faceted; anytime some one says they’re describing “the real world” in opposition to a straw man, there’s going to be some truth left by the wayside.

However, that’s all just failings of well, all the tech and business and business in tech books I’ve ever read. People gotta eat, books gotta sell, and this one is a lot less dumb than most.

Some neat concepts:

Repeatability, Accessibility, Variability (RAVE design)
Effort Investment Portfolio
Four Failure Modes Resulting from System Design
Distributed, Immutable, Ephemeral (DIE pattern)

It takes a while for “Coupling” to be defined in a way that adequately translates to the practical. The Danger Zone is tightly coupled and complex interactions, and critical infrastructure shouldn’t be like that… and we move on. Eventually ESB (Enterprise Service Bus) architecture is given as an example of loose coupling… but it’s not that Kafka makes ingest more secure, it’s that it compartmentalizes the ingest problem. Instead of defending a monolith against all input, you have multiple components that each might fail or defend themselves differently. However, a few pages later it’s noted that quirky and hard to debug failures come from the boundaries between components. Fear not, as might be expected from a book with lots of James Scott quotes, it is eventually explained. Coupling is the degree of flexibility in a system. A tightly coupled system can only operate one way (but is easily understood and documented). Loose coupling is the Goldilocks zone of enough commonality for legibility, without destroying resilience. ClickOps is an example of tight coupling because the human operator is bound into the process (although the human might do something unexpected through accident or distraction the next time they execute the process).

I liked the concept of, and could have used more attention to, interfaces. Can’t find the quote now, but I recently read something on social media like: “as a new QA engineer i find out which teams don’t like each other because that’s where I’ll find lots of bugs”. Vigorously nodding my head over here. Bugs are where intention and implementation differ, and you’ll get a lot more of that where people aren’t communicating well.

There is a good amount of focus on security as a function of building and delivering, which I love. The book suggests eliminating the idea that security resilience is different from regular old quality. Similarly, security in the engineering needs stack is usefully compared to database administration (hint, no one is calling for a DevDBAOps approach). It is ridiculous to keep security concepts at arms length, in a separate team that has to post facto review everything for risk. Might as well ask your lawyers to do code reviews.

I strongly agreed with this advice: use Dependabot, automate dependency updates, and don’t assume that pinning your requirements decreases your risk. Systems operate in a world that changes, and we have to change with them to be resilient. Requirement pinning should be a temporary measure to buy time for a bigger upgrade or dependency change. It’s very depressing to find a production system with lots of libraries pinned to years-dead, unsupported projects… unless you’re attacking that system.

Other practical tidbits for evolving to CI/CD begin with considering system level goals: what is truly critical and what can be thrown out the airlock during a crisis? Small and quick improvements begin with writing down how things work, then looking for ways to quickly improve. Small projects might be scripting a common ClickOps path. Big projects might be replacing flaky components with more boring tech and memory safe languages: the oxidation project at Mozilla sounds interesting. Polyglot projects are rare, but not prima facie bad. Look at games development for instance, where a high speed engine (say QUAKE2.EXE) is coupled with high iteration gameplay code (say POWER1.WAD). Security (or otherwise!) teams can get a lot of leverage by making “paved road” infrastructure for their colleagues: processes, tools, policies that enable quick self-service. It’s important to note that the self-served goal may look like it’s outside of “security chaos”; the example given here is a dev kit framework that allowed projects to spin up quickly without making their own account handling and login infrastructure. A self serve chaos monkey is not likely to get a lot of take-up from a stressed on-call SRE.

The book has a good vulnerability prioritization technique. Is the new vulnerability ease to scale and automate, like Log4J? Then worry, and take action now. Does it take the attacker lots of complicated steps to reach their goal, like Rowhammer? Then wait for normal patching processes to take care of it. A new vulnerability in a common CPU will get a lot of press because it’s wide spread and hard to patch, but it might also be very difficult to capitalize on. There is always a risk of someone very skilled and non-rational doing some straight-up malice like Dark Avenger from the olden days, but we like to think those people all work for a government agency today and are pointed at specific targets. Clever stealth is important if I want to read Kamala Harris’ messages, but probably a waste of effort if I just want to do industrial espionage against a bunch of startups, and certainly wasted effort if my goal is to install a ton of zombie agents that I can sell.

I must admit cringing a bit at the book’s embrace of IaC principles, here renamed Configuration as Code (CaC). Having written “Infrastructure as Code Sucks”, “Security Observability and Detection as Code”, and “Maintenance Windows and Breakage”, I guess I’m not a fan. Hermetic reproduction of systems is an impossible to hit goal in a chaotic world full of entropy, and while it’s of course A Good Thing to fight entropy and build anyway, I think that the Stuff as Code approach does so poorly because it increases complexity and decreases legibility. Yes gardens are good, but no robotic hydroponic containers are not the answer to everything.

Using security resources for testing and documentation is another paved road that is called out in an interesting way. It is worth noting that quality assurance is both decimated and hard to assess in the modern organization, and documentation has similar patterns, while research shows that both are highly valuable to resilient organizations. It makes sense to bundle valuable but difficult to prioritize functions in order to ensure that at least some resources are allocated instead of none at all.

Security Observability. Not a lot new here… NIST CIA (National Institute of Science and Technology) (Confidentiality, Integrity, and Availability) is important to the operational mission and compliance drivers as well as security. There’s a kind of hand-wavy path from “CIA could be OKRed” through “metrics are subject to perversion / Goodhart’s Law” to “DORA metrics are neat”. I’m not sold on the four DORA metrics (Deployment frequency, Lead time for changes, Failure rate, and Time to restore service) having a lot of applicability to security, but I can buy Security Observability needing operational maturity and DORA being a good measure of that maturity. I suppose it’s better to say I’m not against SLOs and SLAs in security, I’m actually largely for them, but it’s a lot like the anti-entropy gardening discussed above. Gardening is not a set-and-forget activity, and neither is supporting a service level. You could easily end up watering beds full of oxalis and dandelions. I’m also a bit bummed to see what I consider the first dip into the usual security marketing, with the 2020 US Census breach: two years worth of logs didn’t make it into the SIEM, which means of course the SIEM wasn’t used at all. However, this chapter does have valuable reminders that scalability (the systemic ability to take a hit) and automation (scripting of tasks that humans might do inaccurately) have their place.

Recovery gets some attention, which I like, and an embrace of the blameless culture. Another magic ritual is offered here, and I’m endlessly amused by imagining some of the oil, insurance, and finance people I’ve worked with reading it. This chapter also spends some time on the factors going into a decision to act or wait, whether pre-attack or post-attack. That said, I think the emotional element of that decision is a little glossed over; pre-attack there is a massive bias to wait, during the attack there is a massive bias to act, and post-attack there is a massive bias to act. Local rationality is called out as a reason for not hanging the poor operator… but maybe it should be considered for the manager who didn’t prioritize re-architecture and the SRE who nuked the attacker’s entry point instead of saving its logs. There is a good review of runbook (or Playbook) requirements here: noting that these should have more context and guidance than specific actions. But what’s really really good is the RCA (root cause analysis) section, comparing healthcare and infosec. Healthcare’s incentives are if anything even more misaligned, and the root cause can’t ever point back to “change the fundamental model”, so the RCA recommends retraining individuals to behave slightly differently. Sounds very familiar, as I complete another compliance-mandated KnowBe4 session.

I’ve been waiting to see the Jens Rasmussen safe operating envelope diagram, and I am not disappointed! Pages 268 and 269 have two variations, plus Lynn Margulis! Once you accept a premise that the environment is chaotically complex, evolutionary biology and ecology become highly relevant design patterns. The organization is a species, the market is an environment, the incentives of competition avoidance, resource exploitation, and self-perpetuance are inexorable. After a great deal of “security is SRE”, I am pleased to see some use of product management and product design fundamentals, like empowered teams, product vision, user problems, personas, and user journeys. As one would hope, this setup is completed with an application of scientific method to security engineering: developing hypotheses, testing them against reality, and adapting mental models to reflect a greater understanding of reality.

The book concludes with experience reports of applying security chaos principles at several large organizations. The first of those is UHG, and since publication of this book, one of UHG’s subsidiaries was hit with ransomware. That said, almost all of these organizations have been hit since the book was published in late 2023; and two of the three exceptions were hit in the five years prior to publication. This certainly seems to highlight the message that attack is an environmental pressure in a chaotic environment, and some loss will occur over time.

Five out of five of your Internet stars or points or whatever, if you care about operational systems security then you should read this book, it’s really good.

Security Chaos Engineering review

Share this: