Why Big Rewrites Fail

Every few years, a team looks at their codebase and decides to start over. The reasoning always sounds solid: the architecture is holding us back, the tech is outdated, we'll move faster with a clean foundation. And every few years, that decision destroys a product, a company, or at minimum a year of engineering time that could have gone toward features users actually wanted.

The rewrite debate in software engineering is over 25 years old. Joel Spolsky wrote his famous "Things You Should Never Do" essay in 2000. Fred Brooks described the second-system trap in 1975. The evidence has been accumulating for decades - and companies keep making the same mistake.

I recently wrote about my own experience with this decision across ad-tech monoliths and govtech platforms. That was the personal view - war stories and a practitioner's framework. This is the industry view. What happens when you zoom out and look at every major rewrite and migration attempt over the past 25 years? The pattern is remarkably consistent. The failures share the same traits. The successes share different ones. And the line between them is clearer than most teams realize when they're standing at the crossroads.

The Graveyard

The failures share a pattern so consistent it's almost formulaic. Big-bang replacement, feature freeze on the old system, a timeline that doubles and then doubles again, and competitors gaining ground while your team rebuilds what already existed.

Netscape: The Canonical Disaster

In 1998, Netscape decided to rewrite their browser from scratch. Navigator 4.0 was messy, sure - years of accumulated code, rushed releases, the usual. But it worked. It had market share. And then they threw it away.

Version 5.0 never shipped. Three years of development produced a buggy, feature-incomplete version 6.0 while Internet Explorer went from afterthought to market dominance. Lou Montulli, one of Navigator's five original engineers, wrote to Joel Spolsky confirming: "I agree completely, it's one of the major reasons I resigned from Netscape." The rewrite didn't just fail technically. It killed the company.

Spolsky turned this into his 2000 essay calling a from-scratch rewrite "the single worst strategic mistake that any software company can make." Twenty-five years later, people keep proving him right.

Borland: The Same Mistake, Twice

Borland managed to repeat the pattern on two products simultaneously. Their dBase for Windows rewrite took so long that Microsoft Access launched first and ate the market. Their Quattro Pro rewrite shipped with significantly fewer features than the DOS version it replaced. Two products, same company, same mistake, same result - competitors shipped while Borland rebuilt.

Chandler: Six Years and Nothing to Show

Mitch Kapor - the guy who created Lotus 1-2-3 - poured millions of his own money into Chandler, a personal information manager that was going to be the next big thing. Development started in 2002. By 2008, after six years and an entire book documenting the project's dysfunction ("Dreaming in Code" by Scott Rosenberg), the team had produced a barely usable preview release. Meanwhile, Google Calendar launched for free and solved the same problem.

Kapor had the resources, the reputation, and the talent. What he didn't have was scope discipline. The project kept expanding - email, calendaring, task management, sharing, synchronization - and the clean-slate architecture meant every feature had to be built from zero.

The Pattern

Every one of these failures shares the same traits. A big-bang approach with no incremental value delivery along the way. Development frozen on the old system while the new one crawls toward feature parity. Timelines that blow past estimates by 2-3x. Institutional knowledge embedded in the old code - years of bug fixes, edge-case handling, adaptations to real-world conditions - quietly discarded and then painfully rediscovered. And competitors, unburdened by a rewrite, shipping features and gaining ground the entire time.

The pattern is so reliable that the interesting question isn't why rewrites fail. It's why companies keep attempting them despite the evidence.

The Survivors

The companies that successfully modernized their systems did something the failures didn't. They kept shipping. Every one of them found a way to replace pieces of the old system while the whole thing stayed running and users kept getting value.

Twitter: Five Years, One Component at a Time

Twitter's Ruby on Rails backend was buckling under scale by 2008. The Fail Whale - that cartoon error page - became a meme because it appeared so often. But Twitter didn't shut down and rebuild. They replaced components one at a time over five years.

Message queues first. Then storage. Then search - replacing the Ruby search frontend with a Java server dropped search latencies by 3x. Then the frontend. Each phase delivered measurable improvements while the service continued operating. The Fail Whale didn't disappear in a dramatic cutover. It faded out gradually as each bottleneck was addressed.

By 2013, the migration from Ruby to Scala/JVM was essentially complete. No feature freeze. No lost market share. No three-year gap where competitors could catch up.

Facebook: They Didn't Rewrite PHP - They Evolved the Language

Facebook's approach was even more radical. Instead of migrating away from PHP, they changed what PHP was.

In 2009, HipHop compiled PHP to C++ and doubled performance. In 2010-2013, HHVM replaced it with a virtual machine that compiled PHP on the fly for even better performance. In 2014, Hack introduced gradual typing that allowed file-by-file migration from PHP to a typed language. The entire codebase was converted to Hack "thanks to both organic adoption and a number of homegrown refactoring tools" - all while Facebook continued shipping features to billions of users.

Nobody on the team ever had to stop delivering value to rewrite infrastructure. The language evolved underneath them.

Shopify: The Modular Monolith

When Shopify's Ruby on Rails codebase hit 2.8 million lines with over 1,000 developers working on it, the conventional wisdom said: decompose into microservices. They went a different direction.

Instead of breaking the monolith into dozens of independently deployed services (with all the operational complexity that brings), they enforced boundaries within the monolith using their open-source tool Packwerk. Business domains got clear interfaces. Code stayed in one deployable unit. The result handles 32+ million requests per minute and hundreds of deployments daily - without the distributed systems headaches that microservices introduce.

Shopify's bet was that the monolith wasn't the problem. The lack of boundaries was the problem. Fix the boundaries, keep the monolith.

Stripe: Every API Version Still Works

Stripe has maintained backward compatibility with every API version since 2011. Think about that - over a decade of API versions, all simultaneously supported.

When they needed to fundamentally redesign their payments model, a small team built the PaymentIntents API from first principles between 2017 and 2019. Then migration bridges connected old and new abstractions. Per-request version pinning with compatibility modules meant no customer was forced to migrate on Stripe's schedule.

The compatibility layers are ugly. They're also what lets Stripe ship fundamental changes without breaking millions of integrations.

The Pattern

The contrast with the failures is stark. Where Netscape stopped everything for three years, Twitter replaced pieces over five. Where Chandler tried to build everything from scratch, Facebook evolved what they had. Where conventional wisdom said "decompose," Shopify said "add boundaries."

Every successful migration shares these traits: incremental replacement with continuous value delivery, preserved institutional knowledge (nobody threw away working code and started guessing), no competitive window for rivals, reversibility at each step, and compatibility layers enabling old and new to coexist.

The strategy has a name. Martin Fowler called it the Strangler Fig pattern in 2004, after watching rainforest vines in Queensland that germinate in a tree, grow around it, and eventually replace the host entirely. Grow the new alongside the old. Redirect behavior gradually. Decommission the legacy only when the new system has fully taken over.

The Second-System Trap

Fred Brooks explained in 1975 why rewrites tend toward bloat, and nobody has improved on his explanation since.

During the lifetime of a first system, designers accumulate a wish list. Features they couldn't ship because of deadlines. Architectural choices they regret. Shortcuts they had to take. When the opportunity for a rewrite comes, they dump every deferred wish into the second system. "This second is the most dangerous system a man ever designs."

This is different from normal scope creep. A greenfield product can cut scope ruthlessly - you're building something new, and "we'll add that in v2" is a legitimate answer. A rewrite can't cut scope the same way because it has a floor: everything the old system already does. Users expect at least feature parity. Then the team adds "and while we're at it, let's also improve..." on top of that floor. The scope balloons in a way that's unique to rewrites.

The Chandler project from the graveyard section is the purest example of this mechanism. Each new capability was added because the team was already rebuilding everything - why not do it right this time? The scope expanded from a simple organizer to a full communication suite, and "doing it right" consumed the entire budget.

Brooks recommended making resource costs visible per feature, resisting what he called "functional ornamentation," and ensuring architectural leadership from people who had designed at least three comparable systems. Fifty years later, this advice remains underused.

The Playbook

Four people shaped how the industry thinks about this problem. They all arrived at the same conclusion from different directions.

Joel Spolsky made the business case in 2000. His core insight: "It's harder to read code than to write it." Developers consistently overestimate how messy old code is and underestimate the cost of reproducing its behavior. Those ugly lines - the weird edge cases, the inexplicable conditionals, the comments that say "don't touch this" - are usually hard-won solutions to real problems. Throwing them away means rediscovering every one of those problems the hard way.

Martin Fowler provided the architectural strategy. His Strangler Fig pattern - the approach every successful migration in this article used - gives teams a concrete method: build new functionality alongside the legacy system, incrementally redirect behavior from old to new, and eventually decommission the original. Fowler emphasizes the necessity of transitional architecture - temporary code that allows old and new to coexist. It looks wasteful. It dramatically reduces risk.

Michael Feathers wrote the practitioner's toolkit. His definition of legacy code - "code without tests" - reframes the entire problem. What makes code dangerous to change isn't age or ugliness. It's the absence of automated verification. His characterization tests capture what code actually does (not what it should do), creating a safety net for refactoring. His Legacy Code Change Algorithm gives teams a step-by-step process: identify change points, find test points, break dependencies, write tests, make changes. The whole approach is fundamentally anti-rewrite.

Kent Beck crystallized it in 2012: "For each desired change, make the change easy (warning: this may be hard), then make the easy change." Rather than fighting existing code structure, reshape it incrementally to accommodate what you need. His "two hats" rule - you're either adding functionality or restructuring code, never both simultaneously - prevents the scope confusion that turns refactoring into accidental rewrites.

The common thread across all four: don't throw away working code. Make it incrementally better. The unglamorous path is almost always the safer one.

When Context Changes Everything

The blanket "never rewrite" advice ignores that risk varies dramatically depending on what you're rewriting. A frontend is not a database. A microservice is not a monolith. The same decision framework needs different weights in different contexts.

Frontend Rewrites Carry Lower Risk

UI frameworks evolve on 3-5 year cycles. Frontend code is typically decoupled from data stores. And the blast radius of a frontend failure is contained - you can roll back to the old UI without losing data or breaking integrations. Netlify rewrote their entire frontend from Angular to React because their JAMstack architecture (standalone UI talking to APIs) meant the rewrite risk was genuinely low. Smartly.io migrated AngularJS to React incrementally with a simple rule: "New code? React. Refactoring old code? React maybe."

Backend rewrites are far more dangerous because they involve APIs, data models, persistent state, and business logic that has accumulated edge-case handling for years. The consequences of getting it wrong include data loss, broken integrations, and violated SLAs.

Database Migrations: Almost Always Evolutionary

Database schema changes are widely considered the riskiest area in application development. The Expand and Contract pattern - deploy new schema alongside old, dual-write during transition, migrate data, switch reads, then remove old schema - exists specifically because big-bang database migrations fail in ways that are hard to recover from. Martin Fowler and Pramod Sadalage pioneered evolutionary database design at ThoughtWorks, treating schema changes as small, reversible, version-controlled migrations. The rule is straightforward: evolve the schema unless you're moving to a fundamentally different data model.

The Monolith Question

Martin Fowler's observation from 2015 holds up: "Almost all the successful microservice stories have started with a monolith that got too big and was broken up." The modular monolith - single deployment with enforced module boundaries - has emerged as a middle ground that Shopify, StackOverflow, and Basecamp all chose over full decomposition. Start monolithic, add boundaries when complexity demands it, decompose only the pieces that genuinely need independent deployment.

The AI Variable

The most significant recent development is AI's impact on migration timelines. Things that were previously too expensive to do incrementally - tedious file-by-file migrations, mechanical syntax transformations, boilerplate modernization - are now feasible at speeds that change the math entirely.

Airbnb migrated roughly 3,500 React test files from Enzyme to React Testing Library in six weeks. The old estimate was a year and a half. Their pipeline used AI-powered processing with state machines, configurable retry loops, and rich context injection - not a developer manually converting each file, but not a blind find-and-replace either. Salesforce compressed a two-year legacy code migration to four months using dependency-graph-driven leaf-to-root AI refactoring. Qonto's Ember-to-React migration with Claude reported output "by orders of magnitude" above their 200 lines-per-day manual target.

These results have a pattern worth noting: AI favors structured incremental migration over big-bang rewrites. The most successful approaches embed AI within disciplined pipelines - file-by-file processing with validation gates, automated test execution, and human review for the remaining edge cases. The tool handles the mechanical grinding. The humans handle the judgment calls about architecture and design.

A critical caveat: up to 76% of initial LLM refactoring suggestions can be hallucinations when validation is absent. Speed without verification is just faster mistakes. I wrote about the practical side of maintaining code quality with AI tools if you're navigating that tradeoff.

AI makes the incremental path cheaper, which means the "no incremental path exists" exception to the "don't rewrite" rule triggers less often than it used to. But it doesn't eliminate the need for architectural judgment about what to migrate and in what order. The tools have changed. The thinking hasn't.

Decision Signals

The case studies and frameworks converge on a set of concrete signals. Not rules - signals. Each one shifts the probability, and you're looking for which direction the weight of evidence points.

Signals Favoring Refactor

The system is in production serving real users. The team that built it still maintains it. Test coverage is reasonable, or achievable through characterization tests without heroic effort. The problems are localized to specific modules rather than spread across the entire codebase. The technology stack is still supported and receiving security patches. Business constraints prohibit a feature freeze. And critically - the organization hasn't addressed the practices that created the mess in the first place. If the same team with the same habits rewrites the system, they'll reproduce the same architecture in slightly different syntax. Software structure mirrors org structure - if you rewrite without restructuring teams, you get the same design back.

Signals Favoring Rewrite

The technology stack is genuinely obsolete - not unfashionable, but unsupported. No security patches, no ecosystem, a shrinking talent pool that makes hiring painful. The architecture fundamentally cannot support required capabilities. Not "it's slow" (optimize the bottleneck) or "it's ugly" (refactor it), but "the data model is wrong for what this system needs to become." The system is small enough and well-understood enough that rebuild complexity is low. The original team is doing the rewrite, so institutional knowledge transfers with the people rather than being lost with the code.

Red Flags That a Rewrite Will Fail

Some patterns predict failure reliably. No articulated reason beyond "the code is ugly" or "we want to use a newer framework." No migration plan for data, integrations, or customer transition. The old system continues receiving features during the rewrite, creating a moving target that the new system can never catch. The rewrite is bundled with a product pivot or experimental launch. The team plans to switch languages, architectures, and infrastructure all at once. Product managers see only the visible 10% of the system's complexity and estimate accordingly.

And the most reliable red flag of all: the primary motivation is developer boredom or resume-building rather than a business problem that the current architecture genuinely cannot solve.

The Real Question

The rewrite-versus-refactor framing is a false binary. The real question is: how incrementally can you replace this system while continuing to deliver value?

The Strangler Fig pattern has become the dominant migration strategy not because it's elegant - it isn't. Compatibility layers are ugly. Transitional architecture feels wasteful. Running old and new side by side is operationally painful. But it converts a high-risk, all-or-nothing gamble into a series of small, reversible, value-delivering steps. Every successful case study in this article used some version of it. Every failure skipped it.

Facebook didn't rewrite their app - they evolved the language. Shopify didn't decompose into microservices - they enforced boundaries within their monolith. Twitter didn't rebuild all at once - they replaced components over five years. The few legitimate exceptions to the "never rewrite" rule involve small, well-understood systems with stable teams, obsolete technology with no migration path, or fundamentally mismatched architectures. And even then, the incremental approach usually outperforms a clean-slate rebuild.

Twenty-five years of evidence. Billions of dollars in lessons. The pattern is clear. If your team is standing at the crossroads, the burden of proof is on the rewrite - not on the incremental path.