Testing Untested Legacy Code: A Practitioner's Guide

February 2019. I'd just inherited a customer data onboarding system at a GovTech client. The package handled loading customer uploads (XLSX, CSV, ZIP files in whatever format they felt like sending), cleaning the corrupt data, and loading it into Postgres. Half a dozen contributors across two years, zero tests.

My first week was pure archaeology. Cultural layers of different engineering epochs stacked on top of each other. Scripts importing libraries that had been migrated to different packages months ago. Database tables referenced in code that no longer existed. Google Drive spreadsheets nobody had access to anymore. Each customer had their own Postgres schema, tied to whatever era their data was first loaded in. Every new customer had been someone's fresh attempt to standardize the process, which just meant inventing yet another standard on top of the previous three. Each of those scripts was a disposable program: days of engineering work to import one customer's data once.

My job was to turn this into a product where an ops team (not engineers) could pick from standard configs and presets, click "load," and have data flow through extraction, transformation, and storage into a standardized format. The moment I spotted the common thread across all those one-off scripts, I started building a generic framework. And immediately hit the wall: without tests, every refactoring step was a coin flip. Everything was flaky, brittle, dependent on import order and file existence.

That Saturday, I started writing the first test fixtures. Not because I had a testing strategy. Because I couldn't build a generic framework on top of code where I couldn't tell if my changes broke existing behavior. Two weeks later we had a CI pipeline with a --fail-under gate, not at 80%, not at 90%, but at whatever coverage existed, so the number could never go down.

Here's what I learned from that experience, and from adding tests to plenty more legacy codebases since.

Why Legacy Code Has No Tests

Nobody wakes up and decides to skip testing. It happens incrementally, and the reasons are always reasonable at the time.

The first version ships without tests because there's a demo next week. Then the second version ships without tests because there's a client onboarding. By the third version, nobody knows how to add tests to this codebase anymore. The architecture wasn't designed for testability, and retrofitting feels like a bigger project than anyone has time for.

There's a parallel path that ends in the same place. You built a prototype, maybe on a no-code platform, maybe with AI writing most of the code, maybe by hand on weekends. It was an experiment. It was never supposed to be the real product. Then it found users, revenue, traction, and now it IS the real product. You never wrote tests because there was nothing to test. It was a hack, a proof of concept. Now you have paying customers and a codebase you're scared to touch. You can't even blame the previous developer. You are the previous developer.

Then come the handoffs. Someone built it, someone else maintains it. The person who built it knew the implicit contracts. The person maintaining it doesn't, and there's no test suite to document them. I've seen this pattern across multiple companies and teams. The onboarding system I described above had been touched by at least six people over two years. Each one understood their slice. Nobody understood the whole thing.

Eventually you reach the most dangerous state: "it works, don't touch it." A system that works in production with zero tests is a system that nobody will refactor, nobody will upgrade, and nobody will touch when the bug reports start coming in. It works until it doesn't, and when it doesn't, you have no safety net.

The AI angle makes this worse. Claude and Copilot can generate plausible-looking test suites that hit 90% line coverage while testing almost nothing meaningful. Mocking the thing you're trying to test. Asserting that returned values are truthy without checking what they contain. I found a vacuously true assertion in our own codebase recently - a test that always passed because the filtering it tested never ran.

The Strategy: Where to Start

The biggest mistake is trying to retroactively test everything. That's a rewrite in disguise, and rewrites fail for predictable reasons. You'll spend weeks writing tests for code that works fine, burn out, and abandon the effort with 30% coverage and zero momentum.

Instead, four rules that build coverage where it matters.

Rule 1: Write a test before every bug fix. When a bug comes in, write a test that reproduces it first. The test fails, you fix the bug, the test passes. You now have a regression test that guarantees this specific bug never returns. Over time, your most bug-prone code paths get the most test coverage - exactly where you need it.

Rule 2: Write a test before every refactor. Before you touch any function, capture what it currently does. Not what it should do - what it actually does. This is the characterization test approach (next section). If your refactor changes the behavior, the test catches it. If the behavior was already wrong, you'll discover that during the test-writing phase, not three deploys later.

Rule 3: Write integration tests for critical paths first. What are the three things that, if they break, wake someone up at 2 AM? Test those end-to-end. For the onboarding system, that was: loading a customer data file, running it through the transformation pipeline, and storing the standardized output in the database. Three integration tests covered the paths that caused 80% of our failures.

Rule 4: New code gets tests. No exceptions. Draw a line in the sand. Everything written after today ships with tests. This is the only rule that's non-negotiable. The other three are strategies for dealing with the past. This one prevents the problem from growing.

The mental model is an expanding safety net. You're not trying to cover the entire codebase at once. You're building coverage organically, concentrated around the code that changes most and breaks most. After six months of following these rules, I looked at our coverage report and the hot spots - the files with the most git activity - were all above 70%. The dusty corners that nobody touches were still at zero. That's fine. Those corners aren't breaking anything.

Characterization Tests: Your Secret Weapon

Michael Feathers coined this term in Working Effectively with Legacy Code, and it's the single most useful technique for testing code you didn't write.

The idea: test what the code does, not what it should do. You're not asserting correctness. You're pinning current behavior so you can refactor without accidentally changing it.

Here's the pattern. Say you have a function that cleans address data - standardizing state abbreviations, fixing zip codes, normalizing street names. You didn't write it. You don't fully understand the edge cases. But you need to refactor it because it's a 400-line monster.

def test_address_cleaning_characterization():
    raw = {
        "street": "123 n. main   st.",
        "city": "springfield",
        "state": "il",
        "zip": "62704-1234",
    }
    result = clean_address(raw)

    assert result["street"] == "123 N Main St"
    assert result["city"] == "Springfield"
    assert result["state"] == "IL"
    assert result["zip"] == "62704"

You run the function with known input, capture the actual output, and write assertions against it. If the current behavior is wrong (maybe it strips the zip+4 when it shouldn't), that's a separate conversation. Right now, you're documenting what the code does so you can change how it does it without changing what it does.

The power of this approach: you can write characterization tests fast. You don't need to understand the business logic. You don't need to read the requirements doc (there probably isn't one). Feed in representative input, record the output, assert it stays the same. I wrote characterization tests for six modules in one Saturday afternoon using nothing but real data samples from our staging environment.

One warning: characterization tests are scaffolding, not architecture. Once you understand the code well enough to write proper tests with meaningful assertions, replace the characterization tests. They're a safety net for refactoring, not a long-term testing strategy.

Integration Over Unit for Legacy Code

This is going to be controversial: for untested legacy code, integration tests give you more value per test than unit tests.

Here's why. Legacy code has unclear boundaries. Functions call other functions that call the database that trigger a webhook that updates a cache. If you unit test each function in isolation by mocking its dependencies, you're testing the mocks, not the code. The bugs live in the interactions between components - the same place where your missing tests hurt most.

For the onboarding system, our first useful tests were integration tests. Feed a real (anonymized) customer file through the full pipeline, check that the right records appear in the database with the right values. One test covered the loading path, the transformation logic, the address parser, the name standardizer, and the database writes. A unit test for each of those components would have been ten tests that each individually passed while missing every interaction bug.

This doesn't mean unit tests are useless. Unit tests are great for pure functions, complex algorithms, and edge cases in well-defined modules. But when you're starting from zero on a legacy codebase, write the integration test first. It's the smoke detector for the whole floor. You can add room-by-room sensors later.

The practical approach: test at the API boundary. If you have a REST API, call the endpoint with realistic data and check the response. If you have a Celery task, call it with realistic arguments and check the side effects. If you have a CLI tool, invoke it with realistic flags and check the output. These tests are easy to write, easy to understand, and they catch real bugs.

What AI Gets Wrong About Testing

AI-generated tests are a specific category of problem worth calling out, because they look more correct than most hand-written tests while often testing less.

Testing implementation, not behavior. AI loves to assert on internal state. It checks that a method was called with specific arguments, that an internal variable has a specific value, that the code took a specific path. These tests break on every refactor even when the behavior is unchanged. They test how the code works instead of what it produces.

Mock everything. AI defaults to mocking every dependency, which means you're testing that your mocking framework works correctly. I've seen AI-generated tests that mock the database, mock the HTTP client, mock the cache, and then assert that calling the function returns the mocked value. Yes, it does. That's how mocks work. You've tested nothing.

Fake coverage. Studies suggest that 40-50% of AI-generated code in pull requests has quality issues, and test code is no exception. A test that asserts result is not None hits the line but doesn't verify the value. A test that catches all exceptions and passes anyway hits every branch but validates nothing. That vacuously true assertion I mentioned earlier? An AI-generated test that checked whether filtering worked, but the test data was set up in a way that the filter condition never triggered. Coverage report said 100%. Actual testing: zero.

Happy path only. Ask AI to write tests and you'll get thorough coverage of the success case with zero tests for error handling, edge cases, or concurrent access. The bugs that wake you up at 2 AM are never in the happy path.

If you're using AI to help write tests (and you should - it's genuinely faster for boilerplate), review the tests with more skepticism than you'd review the production code. Check that assertions are specific. Check that the test actually fails when you break the code it's supposed to test. Check that mocks aren't replacing the thing you're trying to verify. The bar for AI-generated tests is higher, not lower, because they're so convincing. For more on reviewing AI-generated code in general, see my detailed workflow for Claude Code quality.

Making It Stick

Tests are worthless if nobody runs them. The moment you have a single useful test, put it in CI. Block merge if it fails. A GitHub Actions workflow that runs pytest on every pull request takes 30 minutes to set up and it changes everything. If you've never set up CI before, that's a separate topic - but the short version is: it's a YAML file, and the free tier is enough.

A coverage report on each PR is useful too, not as a gate but as information. "This PR drops coverage from 58% to 55%" is a conversation starter. "This PR adds 400 lines with zero tests" is a code review comment that writes itself.

The real danger is erosion. The first broken test someone ignores kills the testing culture. If a test is flaky, fix it or delete it the same day. Never mark it as @pytest.mark.skip and move on. I've seen skip markers accumulate until 30% of the test suite was disabled. At that point, you're back to zero confidence.

The hardest part is writing the first test in a codebase that has none. The setup is awkward. The fixtures don't exist. The code isn't designed for testability, so you'll fight with imports and global state. Write it anyway. One test. For the next bug that comes in. It doesn't have to be elegant. It has to exist.

Once one test exists, the second is easier. Once CI runs tests on every PR, the third is natural. Once a test catches a real bug before it hits production (and it will, faster than you think), the team will never go back.

Here's the flip side. At an ad-tech company I worked at, we had a Python 2 monolith processing millions of requests daily with a zero-downtime SLA. Thousands of unit tests, all written in a Given-When-Then pattern that made them easy to read and maintain even years after the original authors left. When Python 2 end-of-life forced a migration, those tests were the only reason we could attempt it. Every behavioral change surfaced immediately. We could refactor the import system, update syntax, swap out deprecated libraries, and know within minutes whether something broke. The migration was still painful for other reasons, but without that test suite, we wouldn't have even considered it. We would have been stuck on a dead runtime.

Two projects. One with zero tests where I couldn't refactor a single function without holding my breath. One with thorough coverage where a team of four could migrate an entire language version under live traffic. That's the difference tests make at scale.

I started with test fixtures on a Saturday because I couldn't trust my own refactoring. Six months later we had a coverage gate, a CI pipeline, and a generic data loading framework that replaced dozens of one-off scripts. A year after that, customer onboarding was handled by an ops team who weren't engineers, picking configs and clicking buttons instead of writing migration code. None of that would have been possible without the safety net those tests provided. Your starting point will look different. The trajectory is the same.