Festina Lente

Published by

Jack Coates

on

November 30, 2024

There’s been no lack of writing about development processes and engineering practices in software development shops (which arguably is everything now). The consensus of research and punditry stands firm: festina lente, or make haste, slowly. It’s like learning a complex riff on guitar: slow is smooth, smooth is accurate, accurate is fast. When care is devoted to ensuring smooth and frequent releases, errors are reduced in number and impact. And yet: business leaders chafe at the resulting capacity restriction on their desires, and we read about “founder mode” micromanagement as a desirable state. I suspect there are many bored executives at companies that have become default alive: that yearning for excitement can explain innovation teams, perhaps it also explains table-flipping a product team. The problem with doing things The Right Way is that it only reduces error, it doesn’t eliminate human frailty or entropy or chaos. Therefore the disingenuous or ignorant can always argue against it. They’ll say the cost of The Right Way is misplaced and magic or bootstraps or crowdsourcing will do just as well, and through Brandolini’s Law they can win.

When slow-is-fast Right Way principles are not embraced in software engineering, the result is typically chaos. This immature process induced chaos has three causes: features are unpredictable in scope or cost, releases are unpredictable in timing or outcome, and recovery from error is unpredictable in time or cost.

Features: there are many books written about the difficulty of scoping feature effort in software, and yet professionals somehow get it done to an acceptable level in many organizations. Why does it fail in the organization that is using immature processes, even though all the engineers were hired fresh from their experiences in successful projects at successful organizations? Because enterprise development is a team sport and scoping is reliant on the team executing together. There’s more than one way to produce a working product from frontend, backend, and middleware components, but there’s thousands of ways to produced a failing product from the same parts. If each developer is working to an individual idea of correct process and design, chaos and recrimination will ensue.
Since the features are chaotic, the release is too. Major features are going to be missed. The target date is more or less fictional. The target purpose of the release and team membership will be open to conversation long after coding has started. Expect frequent reprioritizations, calls to “get back to basics”, and don’t forget to “squeeze this one in”.
Consequently, problem identification and recovery are more challenging than they might otherwise be.

To recap: immature processes produce inaccurate plans and chaotic releases. That in turn makes on-call work brutal, and since our teams are small (they always are), there’s lots of on-call work. Everyone’s burning the candle on both ends. Engineers get slower and worse at doing their actual jobs of planning and delivering quality features, because they’re busy monkeypatching the shipped mess for another day of functionality. There’ll be lots of calls to start subsections over and do them right this time, but without rethinking how delivery is done, these restarts aren’t likely to go well.

The root of the root problem is a chaotic release process. Chaotic releases mean the product is a mess which means on-call is hell. And in the Software as a Service world, on-call now comes with most development jobs. Let’s unpack the chaos-to-on-call relationship a bit more.

While it’s not a requirement for chaos, a chaotic release process is frequently justified by a small team size (as in, “we couldn’t possibly release software smoothly, there’s only N of us!”). Of course, the size of team allocated to a function is a choice as well… but anyway, the teams are always smaller than desired, and people are on-call a lot of their time. On-call for badly maintained software is absolutely terrible. There’s useless paging from no-longer meaningful monitors interrupting sleep. There’s surprise outages that you learn about because a customer calls an executive. There’s all-night bridges ending with wildly lucky guesses. For every brilliant bug find, there’s five “scale up the resources and we’ll try to fix it better tomorrow” incidents. There’s rolling back the latest change to see if that works, then explaining to some other team their feature or fix is now unshipped. All of this is going on against the backdrop of life: sick kids, unhappy spouses, crowded apartments, hours of commuting, kitchen remodels. And in the morning, your slate of regular meetings is most likely unchanged. Bad on-call is hard on people’s lives, and it causes engineers to struggle with making or completing commitments. That means a lack of successful commitment to full end-to-end ownership, lack of communication, lack of investment in improvement. In other words, being in on-call hell blocks the door to escaping on-call hell. Continued chaos is the natural result.

To fix the problem, make releases more ordered and less chaotic by reducing WIP (work in progress) and slowing the system down. Prioritize work that reduces on-call pain over work that drives new revenue. While that seems counter-intuitive, you can think of it as revenue churn prevention. It takes at least two, or realistically many more dollars of NARR (New Annual Recurring Revenue) to fix a dollar of churn. Customers churn when the system is unreliable. Churned customers can’t be upsold, so there’s a lot of pipeline gone. The system is unreliable when the developers can’t do a good job. So, fix the on-call pain, let people sleep and think, let sales focus on upselling instead of responding to outages. As on-call load gets lighter, engineers are able to do their new feature work day jobs better.

It’s not enough to just fix on-call though, the day job itself has to be fixed. We’ve all seen the aspirational roadmap with several planning buckets and a 50% delivery rate at best. Throw that out. Take sprints down to one week and strictly enforce WIP to 1 ticket per engineer. Get that working, then expand. What if things that went on roadmap actually shipped when you said they would? How would that change your relationship with customers? When you can deliver 90% of planned and roadmapped features reliably, you can expand scope.

This is a radical change for an organization to take on, and it will need support. You’ll need to clear the decks of useless cruft, thread the needle between the improvements you’re requesting and excitement. Getting things done isn’t exactly hard, but it does not allow shortcuts. You can’t do it alone, but you can get support by calling shots and hitting them.

Discover more from Monkeynoodle.Org

Subscribe to get the latest posts sent to your email.

Festina Lente

Share this:

Discover more from Monkeynoodle.Org