To Rewrite or Not to Rewrite: A Decision Guide

The Weekend Rewrite Fantasy

There's a phase every developer goes through where the answer to every legacy codebase is "let's just rewrite it." When your portfolio is apps you can build in a couple of caffeinated weekends, rewriting feels natural, even elegant. Why untangle someone else's mess when you can start clean?

Then you face a production system serving live traffic around the clock, backed by thousands of tests written by people who left years ago. And your clean rewrite drowns in a thousand small problems you didn't anticipate: import mechanisms that changed between language versions, edge cases nobody documented but the code handles correctly, business logic that looks insane until you learn why it had to be that way.

I've made this rewrite-vs-refactor decision more times than I'd like to count - across ad-tech monoliths, govtech platforms, and crypto trading systems. Here's what I've learned from getting it wrong, and the framework I use now to get it right.

The Rewrite Graveyard

The Migration That Delivered Nothing

The worst was a Python 2 to 3 migration at an ad-tech company. Our codebase was a monolith deployed across hundreds of machines, processing millions of ad requests daily under sub-100ms response time requirements and a zero-downtime SLA. Hundreds of tenants pumped their advertising traffic through our system around the clock. We had thousands of unit tests and a team of four developers. Python 2 support was ending. Everyone knew it. We had to migrate.

We couldn't do it iteratively. The monolith had no seams - you couldn't run half of it on Python 3 and half on Python 2. So we committed to a full migration, and the cascade began. Python 3 crashed on the first entrypoint. Fix it, crashes on the next module import. Fix that, the next one. For a month, we released nothing. We just ground through import issues, then split the test suite four ways and started fixing module by module, test by test, committing and pushing as frequently as possible to avoid stepping on each other.

We finished. We also broke the deadline. The team was exhausted, squeezed lemons, all of us. And the business was furious, because from their perspective the team had spent months and delivered nothing. Zero features. Zero improvements. Zero value. We'd upgraded the engine while the car sat in the garage.

The next year, they shut down the project entirely. I still think the migration was technically the right call - Python 2 was dying. But we delivered no value. We didn't find a way to slice the work and mix it with features the business actually needed. That's the part that stings.

The Rewrite That Became a Stitch Job

Another project: a govtech platform with a creaky Bootstrap/jQuery admin portal. The plan was ambitious: a modern React frontend with a GraphQL data layer, designed to improve the user experience across the board. Initially planned as a full rewrite.

But there was always something more urgent, something more valuable short-term. The rewrite got suspended before reaching feature parity. So we shipped what we had: a citizen-facing portal with modern SMS/email code-based authentication and the admin sections that were already working. Some admin pages embedded pieces of the new interface via iframes, stitching old and new on the fly.

Surprisingly, the feedback was great. Users loved the citizen portal. The company was happy. What started as a planned rewrite ended as a pragmatic partial migration, not what anyone designed on a whiteboard, but working and delivering real value. Shipping what we had stopped the budget bleed and gave us a stable platform to evolve from.

The Wall

Both of those projects hit the same wall, and it's the same wall Joel Spolsky described in 2000 when he called rewriting from scratch "the single worst strategic mistake that any software company can make." Netscape rewrote their browser, never shipped version 5, and handed the market to Internet Explorer over three wasted years. I dug into more case studies of why big rewrites fail, and the details differ but the pattern doesn't.

You stop delivering value. Every sprint goes toward rebuilding what already exists. Your team is busy, your commits are green, and your users see nothing. That advertiser platform migration that delivered zero features? That's what a rewrite looks like from the outside.

Feature parity becomes a trap. Unlike greenfield work where you can cut scope ruthlessly, a rewrite has to match everything the old system does, as a minimum. Then someone adds "and while we're at it, let's also improve..." and the scope balloons. Fred Brooks identified this in 1975: designers dump every deferred wish into the second system. It's the most dangerous system you'll ever build.

The old system doesn't stop. Bugs still need fixing. Customers still need features. Your team oscillates between maintaining old and building new, making progress on neither. In the govtech project, there was always something more urgent than the rewrite, and that "something" was the actual work the business was paying for.

Institutional knowledge hides in ugly code. Those weird edge cases, those comments that say # don't remove this, it fixes the billing bug from March 2019 - they're the scar tissue of real production incidents. Spolsky's core insight holds: it's harder to read code than to write it. Developers overestimate how messy the old code is and underestimate the cost of reproducing its behavior from scratch.

The Framework I Use Now

My default answer is always refactor. When someone on my team says "let's just delete this and rewrite it," I ask one question: how much value does this bring to the business? Not to the engineering team. To the business. That question kills most rewrite proposals on the spot, because the honest answer is usually "none, it'll just be cleaner code."

When a Rewrite Is Actually Justified

A rewrite earns its cost only when all of these are true simultaneously:

Small enough to hold in your head. Not too many files, not too many lines, not too many moving parts in the infrastructure. If you can't explain the entire system's behavior to a new hire in an afternoon, it's too big to rewrite safely.
Completable in days, not months. The moment a rewrite crosses two weeks, you're maintaining two systems and delivering zero features. Patience is finite, and you're spending it.
The team has the domain knowledge. There's a useful heuristic: if the knowledge is on the team, you can rewrite. If the knowledge is only in the code, you must refactor. When the original developers are gone, that ugly code is the only record of years of bug fixes and edge cases.
The technology is genuinely obsolete. Not unfashionable, obsolete. No security patches, no ecosystem support, shrinking talent pool. "I want to use a newer framework" is not this.
There's no incremental path. If you can replace pieces gradually while keeping everything running, that's almost always safer. The rewrite exception exists for cases where the architecture genuinely cannot be strangled, where the old and new can't coexist.

APIs Are Contracts

The clearest lesson came from govtech. Government clients sign SLAs and never appreciate your ideas about upgrading their API because you restructured your database. V1 must keep working. You release V2? Great. V2 is deprecated and V3 is out? That's your problem. V1 still needs to work.

So we built compatibility layers. We emulated database structures that no longer existed. We wrapped newer endpoints to accept old payload shapes and content types. Every API version ran simultaneously, all covered with tests. Ugly, yes. Also the most reliable system I've ever worked on.

One of the milestones in how I think about this came from Android's approach to database upgrades: you define migration scripts that can upgrade from any version to the most recent schema. At any moment, you have users on version 1 and users who update twice a day. All of them must keep working. That constraint (all versions must work simultaneously) forces gradual thinking. No big bangs. No "everyone migrates on Tuesday."

The Half-Upgraded System

There's a dimension to this that only hits you in zero-downtime, high-load environments: deployment itself takes time. Database migrations can stall midway through. In a gradual rollout, half your machines are serving old code while the other half serve new. Your system might sit in this half-upgraded state for hours, and it must function the entire time.

This puts hard constraints on what counts as a "breaking change." In ad-tech, we could always add a database column, but deleting columns wasn't allowed because the old code still running on half the fleet expected them to be there. Cache keys got a prefix incremented with every release, so old app instances only read their own cache and didn't choke on new data shapes. API payloads, ORM models, queue message formats - all of it had to be backward-compatible across at least two consecutive versions.

These aren't theoretical concerns. They're the reason "let's just swap it out" sounds simple in a meeting and turns into a month of compatibility work in practice.

What Actually Works

Most of the time, the answer isn't rewrite or refactor. It's a messy middle path that delivers value while you evolve the system underneath.

When the Cost Is Worth It

I don't want to pretend rewrites never work. They do, when the conditions are right and the team goes in clear-eyed about the cost.

On one project, maintaining legacy database structures had cornered us into relying on triggers, views, and unindexable SQL queries that grew hairier with every new customer. Those tables were core, used in every query across the system. The more customers we served, the slower everything got. There was no incremental path: the fundamental data model was wrong for what the system had become.

We committed to the rewrite. It meant a new features embargo. The full team dedicated weeks to the transition. The release was stressful, a week of fighting fires and bugs that slipped through the test grid. But the benefits justified the cost, because the old structure was actively strangling the product. Sometimes the table really can't be turned into a chair.

The key difference from the failures: we knew exactly what we were replacing and why, the team had full domain knowledge, and the pain of not rewriting was measurable in slow queries affecting every customer.

Modularity Is the Real Unlock

On a crypto trading project, the microservice architecture was the opposite of the ad networks portal monolith: highly modular, each service living its own life. Orchestrating a release across the whole system was a nightmare. But refactoring any individual service was easy and convenient, because each one was decoupled and independently testable.

That contrast taught me something: good test coverage and modularity are the two prerequisites that make refactoring safe. A legacy system with no tests that needs a substantial change? You don't jump straight to refactoring. First you cover the existing behavior with tests. Then you decouple it, making the piece you want to change swappable. Only then do you mess with refactoring or replacing it. Test, decouple, then change, in that order.

Practical Patterns for the Gradual Path

For everything that doesn't meet the rewrite bar, these patterns have saved me repeatedly:

Write characterization tests first. Before changing legacy code, capture what it actually does, not what it should do. Michael Feathers calls these the safety net for refactoring. They're fast to write (you're just asserting current behavior) and they catch the regressions you didn't expect.

Separate the refactor from the feature. Kent Beck's "two hats" rule: you're either adding functionality or restructuring code, never both at once. When I mix these, I lose track of what broke. Keeping them in separate commits makes rollback trivial and code review meaningful.

Release to subsets. Not every change needs to hit all users at once. Roll out to a subset of tenants, watch the metrics, expand gradually. If something breaks, the blast radius is contained.

Build compatibility layers instead of breaking contracts. Emulate old interfaces on top of new internals. Yes, it's ugly. It's also what lets you ship the new system without waiting for every consumer to migrate on your schedule.

The AI Asterisk

I should be honest: things are shifting. AI coding assistants are genuinely changing what's possible in a weekend. Airbnb migrated 3,500 test files in six weeks using LLM-powered pipelines, a project estimated at a year and a half the old way. That's not hype. That's a real team with real numbers.

The decision framework doesn't change, but the execution speed does. Things that were too expensive to refactor incrementally (tedious file-by-file migrations, mechanical syntax transformations, boilerplate modernization) are now feasible. The "no incremental path" exception on my checklist triggers less often when AI can do the grinding.

But here's the catch: AI is a multiplier of whatever approach you choose, including bad ones. Without constraints, you can bloat a project with low-quality code 10-20x faster than a junior developer could. Speed without judgment is just faster mistakes. If the rewrite was a bad idea before AI, it's still a bad idea, and you'll just arrive at the failure sooner. I wrote more about maintaining code quality with AI tools if you're navigating that side of the equation.

The fundamentals haven't changed. The tools have.

The Question That Kills Most Rewrites

I like modern, elegant code. Everyone does. But real-life projects are full of legacy, and the job isn't to replace it. It's to evolve it without breaking the thing that's paying everyone's salary.

I started my career as a rewrite enthusiast and became a reluctant incrementalist. The turning point wasn't reading Spolsky or Fowler. It was watching a team spend months on a migration that delivered zero value to the business, and seeing the project get shut down the following year.

The next time someone on your team says "let's just rewrite it," ask them to estimate not just the build time, but the cost of zero feature delivery during the rewrite, the parallel maintenance tax, and the week of firefighting after release. If the math still works and the conditions are right (small scope, team has the knowledge, no incremental path), go for it. For everything else, there's a compatibility layer with your name on it.