Second System Syndrome

Hidden complexity and the unexpected costs of rebuilding systems

Dec 05, 2024

I've referenced Chesterton's fence several times in this newsletter, his precautionary aphorism about hidden purpose of complexity. Here’s my brief translation of the idea:

It's important to understand the purpose of an existing system before tearing it down: why it was built the way it was, and the network of other things the system interconnects with

This principle provides a rich vein to mine for examples of it being disrespected. We've all seen them, and we all have biases that sometimes send us into the same trap.

A great example of a related idea comes from engineering: "second system syndrome". Fred Brooks coined the term back in his 1972 classic The Mythical Man-Month. Not a new concept! Obscured complexity in systems has been understood for a long time.

The first working version of any system comes through a process full of trade-offs, sacrifices, unfortunate decisions, omissions, and a host of other real-world necessities that the designers had no choice but to make. To ever get the thing into reality, the builder must work within the bounds of resources, available tooling, time, and money. Not to mention their own lack of expertise about certain things, and human fallibility. There's benefit to keeping things simple while you get the thing to market.

Say we've got a system built. "Version 1" is out in the wild. It works, but during the process of building it we made note of 100 other details we opted against, for reasons of cost, time, lack of knowledge, etc. One of the engineers says "when we come back around for v2, we'll do it right." The list of "v2" ideals is long and beautiful. It's going to be glorious.

But it's so rarely this clean.

The utopian mission of the "rebuilder"

Most of the time the motivation to start over on a product — to greenfield the whole thing with a clean plot of land and build the perfect architecture — doesn't mount by v2. In reality it's more like versions 1 through 100 are an ever-metastasizing nest of complexity. But each tendril of complexity was created for a purpose, even if done short-sightedly or imperfectly. It's a dense mass that's hard to pull apart and examine, but each little nasty, ugly detail was put there to solve a problem.

The utopian inclination to rebuild comes from two types of people:

Those who worked on the first system and are overconfident from past experience, and have long wishlists they had to sacrifice
New people who had nothing to do with the current system, look at it with fresh eyes and in their ignorance think "eww, I can do it better"

Then, if either of them get the permission to raze the building and start over, they're going to suffer from two things they either didn't sign up for, or didn't appreciate (typically both):

Because they're starting over, expectations of perfection from all stakeholders go through the roof; everyone assumes it'll be fresh and clean (and fully functional!)
Succeeding with the second system will require meticulous forensic understanding of all the hidden details, the obscure things the system is doing that — turns out — every user is relying on

Way back in 2000, Joel Spolsky wrote what I think is the canonical breakdown of this problem, in a post appropriately called "Things You Should Never Do". He uses the example of something all software people would be familiar with: the long and nasty-looking code function that's two pages long. "Why is it two pages? What a mess! We should rewrite it all."

Forgive the long quote, but this is just a fantastic explainer (emphasis mine):

Back to that two page function. Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn’t have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.
Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it’s like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.
When you throw away code and start from scratch, you are throwing away all that knowledge. All those collected bug fixes. Years of programming work.
You are throwing away your market leadership. You are giving a gift of two or three years to your competitors, and believe me, that is a long time in software years.
You are putting yourself in an extremely dangerous position where you will be shipping an old version of the code for several years, completely unable to make any strategic changes or react to new features that the market demands, because you don’t have shippable code. You might as well just close for business for the duration.
You are wasting an outlandish amount of money writing code that already exists.

Now Joel was talking about code, but the same principles apply to any complex system. Even though the code was a mess, he emphasizes how that mess is paradoxically value. The disorderly function actually embeds knowledge and value that's easy to toss out if we're not careful.

Is there a solution?

Refactoring, cleanup, and incremental redesign is almost always an option, even when we don't want it to be. The problem is that it's a tedious process to rework the functional complex system in sustainable ways.

The old veteran wants to start over because he knows the years of messiness and "bad" decisions made. The new expert wants to skip this step because it's tedious, uninteresting, and has the air of "not invented here" about it. Engineers want to build, not edit.

If you're going to tackle the complete second system rebuild, then you have some issues to sort out. You have to make the time to examine the inner workings of the system. You have to do the homework to understand what works, what doesn't work, what it does, and why it was done a certain way (again, tedious). You have to temper the stakeholder expectations of perfection (really hard).

If you've taken to heart the lessons and failure stories of second system syndrome, yet you still think it needs doing, remember Gall's Law:

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

If you're starting over, start simple (requires those low expections...). After all, the system you're replacing — if it's a functional one — started that way.

If the code you're rewriting or the vehicle assembly process you're redesigning is 15 years old, there's 15 years of learning, bugfixes, and problem resolution baked into that messy rat's nest of a system you want to rebuild. It's frankly hubris to me when I hear that we want to start over without incurring enormous unexpected cost.

Not that that cost isn't worth it. Sometimes it might be. But usually it's a mistake to tear down what's working without a disciplined process to understand it.

Second System Syndrome

Hidden complexity and the unexpected costs of rebuilding systems

The utopian mission of the "rebuilder"

Is there a solution?

Discussion about this post