Using Claude Code Without Technical Debt

Last month I spent three days debugging race conditions where overlapping transactions interlocked and corrupted shared state. The code passed every test. Linters were clean. Type checks passed. In production, under real load, two users hitting the same resource at the same time broke everything.

Some of that code was AI-assisted. Not all of it - race conditions don't need AI to exist - but the AI-generated parts had sailed through review because they looked impeccable. Syntactically perfect. Well-structured. Exactly the kind of code you glance at and think "looks good." That's the trap. Auto-approving AI output because it reads well is how you end up spending a week staring at transaction logs instead of shipping features.

That experience crystallized something I'd been feeling for months. AI tools write code that's correct in isolation - functions that do what you asked, following patterns that pass every static check. But production isn't isolation. Production is concurrent users, stale caches, network partitions, and data that doesn't look like your test fixtures. The gap between "works in dev" and "survives production" is where technical debt hides.

I use Claude Code every day for production AI/ML systems. I'm not here to tell you to stop using AI tools - I'd be a hypocrite. I'm here to share the workflows I've built after learning these lessons the hard way. Two modes of working with AI, a context-priming system that makes output dramatically more predictable, and the red flags I've trained myself to catch before they reach production.

Here's how I use Claude Code without losing sleep.

What AI Gets Wrong (And Why It's Hard to Spot)

AI-generated code has a dangerous property: it looks right. Clean variable names, correct syntax, reasonable structure. It passes lint, passes type checks, often passes tests. The problems are in the things you can't see by reading the code.

It Doesn't Understand Your Runtime

AI generates code in a vacuum. It doesn't know that your FastAPI endpoint gets hit by 200 concurrent users during batch processing. It doesn't know that your Celery workers share a database connection pool that saturates under load. It writes code that's correct for a single request and completely wrong for a thousand simultaneous ones.

My race condition? AI-generated async code where two transactions could hit the same resource within milliseconds. Each transaction was correct in isolation. Together, they interlock and corrupt shared state. Nothing in the code looks wrong - the bug is in the timing that only exists under production load.

It Doesn't Know Your Architecture

Every Claude Code conversation starts fresh. Without context, it invents patterns. Feature A gets a service layer. Feature B gets business logic inline in the route handler. Feature C introduces a repository pattern nobody asked for. Each one works individually. Together, your codebase becomes an archaeology dig where every layer is a different civilization. This kind of architecture drift is how teams end up in the rewrite-vs-refactor debate that costs months instead of weeks.

It Over-Engineers

AI loves abstractions. Ask for a simple data processor and you'll get an AbstractBaseProcessor with a StrategyFactory and a PluginRegistry. Code that should be 30 lines becomes 150. AI doesn't feel the maintenance cost of abstraction - it just reaches for the most "proper" solution it's seen in training data.

It Misses What's Between the Lines

Business rules that nobody documented. The constraint that user IDs can't change after first payment because three downstream systems cache them. The convention that all async tasks must be idempotent because your queue has at-least-once delivery. AI can't know what it was never told, and the stuff that isn't written down is usually the stuff that matters most.

The False Confidence Trap

This is the compounding problem. Because AI code is syntactically clean and well-structured, you trust it more than you should. You review it less carefully. You approve faster. And that's exactly when subtle bugs slip through - not because the AI is bad, but because the code looks too good to question.

These aren't reasons to stop using AI tools. They're reasons to use them with a system. The rest of this article is that system.

flowchart LR
    subgraph AI["What AI Generated"]
        A1["await db.get(sender)"]
        A2["await db.get(receiver)"]
        A3["if balance >= amount"]
        A4["sender.balance -= amount"]
        A5["await db.commit()"]
        A1 --> A2 --> A3 --> A4 --> A5
    end

    subgraph PROD["What Production Needed"]
        P1["select(...).with_for_update()"]
        P2["Lock rows in sorted order"]
        P3["if balance >= amount"]
        P4["sender.balance -= amount"]
        P5["await db.commit()"]
        P1 --> P2 --> P3 --> P4 --> P5
    end

    AI -. "Passes tests\nbut no row locking" .-> PROD

Two Modes of Working With AI

I've developed two distinct approaches for working with Claude Code in production. Neither eliminates the need to review AI output - both just change when you invest your thinking time.

Mode 1: Inline Review (The Conversation)

This is the interactive approach. I work with Claude Code on changes one at a time, reviewing and giving feedback as we go.

The workflow looks like this: Claude suggests a change, I read it carefully, I either accept it, reject it, or ask for modifications. Then we move to the next change. It's slower per change, but I catch issues immediately when my mental context is still fresh.

I review code changes and immediately give feedback. This is much slower but doesn't force me to review everything in one gigantic PR.

The trade-off is speed. You're not going to ship a feature in an hour with this approach. But you also won't face a 2,000-line PR full of AI-generated code that you need to audit all at once. The review burden is distributed across the work, not concentrated at the end.

There's a hidden benefit: this mode is educational. I've learned new patterns, algorithms, and technologies through these conversations. When Claude suggests an approach I haven't seen before, I stop and understand it before accepting. Over time, that builds real expertise.

I use this mode when I'm in unfamiliar territory - exploring a new API, debugging complex business logic, or working on code where the architecture isn't obvious yet. When I don't fully understand the problem space, I want to think through each step rather than delegate execution to AI.

Mode 2: Documentation-First (The Blueprint)

This is the upfront investment approach. Before Claude writes a single line of code, I invest significant time creating detailed documentation.

I invest time in creating very detailed documentation for the feature and then let AI work on tickets. Faster, but requires very clear understanding and plan before work.

The documentation includes: the architecture (which services, which databases, how they interact), data flow (what comes in, how it transforms, what goes out), edge cases (what happens when X fails?), and constraints (performance requirements, backwards compatibility, security boundaries).

Then I break the work into well-defined "tickets" and let Claude Code execute against that specification. For example, last week I needed a new API endpoint: I wrote a one-page spec covering the route, request/response schemas, database queries, error cases, and auth requirements. Claude produced a working implementation on the first pass - because it wasn't guessing, it was following a blueprint.

This doesn't save me from all AI slop. I still review the results. But the code is more predictable, more consistent across features, and closer to what I actually wanted.

I use this mode for well-understood features: CRUD operations, database migrations, repetitive refactoring, test writing. When I know exactly what needs to happen, documentation-first is faster overall despite the upfront time investment.

When to Use Which

The decision is straightforward:

Use Mode 1 (Inline Review) when:

You're in new or unfamiliar territory
The problem requires complex business logic
You're debugging and don't know the root cause yet
The architecture isn't clear and you need to feel your way forward
You're doing DevOps or infrastructure work (shell commands, network rules, Docker configs)
The work unfolds as you go - you don't know what you'll find until you look

Use Mode 2 (Documentation-First) when:

You understand the requirements completely
The feature follows established patterns in your codebase
You're doing repetitive work (migrations, similar CRUD endpoints)
You're writing tests for well-understood behavior

I often alternate between them in the same day. Morning: documentation-first for a straightforward API endpoint. Afternoon: inline review for debugging a race condition where I don't know what's broken yet.

The Key Insight

Neither mode eliminates the need to review. The question isn't "should I review AI output?" The question is "when do I invest my thinking time?"

Mode 1: thinking happens during execution (distributed review). Mode 2: thinking happens before execution (upfront design, then review at the end).

Both require discipline. Both require saying no to AI suggestions that aren't right. The difference is timing.

flowchart TB
    subgraph MODE1["Mode 1: Inline Review"]
        direction LR
        M1A["AI suggests change"] --> M1B["You review"] --> M1C["Accept / Reject / Modify"] --> M1A
        style M1B fill:#ffd,stroke:#aa0
    end

    subgraph MODE2["Mode 2: Documentation-First"]
        direction LR
        M2A["You write spec"] --> M2B["AI executes tickets"] --> M2C["You review result"]
        style M2A fill:#ffd,stroke:#aa0
    end

    MODE1 --- T1["Thinking distributed across work"]
    MODE2 --- T2["Thinking upfront, review at end"]

Context Is Everything

The single biggest factor in AI code quality isn't the model, the prompt, or the temperature setting. It's context. The more your AI tool knows about your project, your conventions, and your constraints, the better its output will be.

I've found three layers of context that dramatically change Claude Code's output quality.

Layer 1: CLAUDE.md - Your Project's AI Constitution

Claude Code reads a CLAUDE.md file at the root of your project before every conversation. This is where you define the rules of engagement.

Mine includes things like: "Never wrap code in try-except by default - we handle errors globally." And: "Do not use inline imports, always put imports at module level." These are my coding conventions, and without them Claude would happily generate code that violates both.

CLAUDE.md should contain your coding style, architecture decisions, what NOT to do, and any project-specific constraints. Keep it concise - this isn't a wiki. It's a dense set of instructions that shapes every line of code Claude generates.

Layer 2: Subdocument References

CLAUDE.md gets long fast. The trick is keeping it lean and linking to deeper documentation for specific areas.

My CLAUDE.md references separate docs for database schemas, API patterns, deployment procedures, and testing conventions. Claude reads the relevant subdocuments when it needs context for a specific task. This layered approach means CLAUDE.md stays scannable while deeper context is always available.

Layer 3: AI-Targeted Documentation

This is the layer most people miss. Traditional documentation is written for humans - it explains concepts, includes tutorials, gives background. AI-targeted documentation is different. It's dense, specific, and architectural.

Instead of "Our API uses REST principles," you write: "All API endpoints follow this pattern: FastAPI router in api/v1/, Pydantic models in schemas/, service layer in services/. Responses use StandardResponse wrapper. Auth via get_current_user dependency."

That one paragraph gives Claude more useful context than pages of explanation. The AI doesn't need to understand why - it needs to know what patterns to follow.

The hidden cost: maintaining this documentation takes real effort. Every architecture change, every new convention needs to be reflected in these docs. It's not free. But it's the highest-ROI investment I've found for AI-assisted development. Better context in means dramatically better code out.

The Principle

Garbage context in, garbage code out. If your AI tool doesn't know your patterns, it will invent its own. If it doesn't know your constraints, it will ignore them. If it doesn't know your architecture, every feature will look different.

Invest in documentation. Not for future developers - for your AI tools, right now. The payoff is immediate: more consistent code, fewer review cycles, less time fixing AI's guesses.

There's a whole topic around automating the maintenance of AI documentation - keeping it fresh as your codebase evolves. I wrote a definitive guide to CLAUDE.md covering the full five-layer configuration system. For now, start with CLAUDE.md and build from there.

flowchart TB
    L1["CLAUDE.md\nConventions, architecture, constraints"]
    L2["Subdocuments\nDB patterns, API conventions, testing guide"]
    L3["AI-Targeted Feature Docs\nDense specs: routes, schemas, edge cases"]
    OUT["AI Output Quality"]

    L1 -->|"references"| L2
    L2 -->|"details"| L3
    L1 & L2 & L3 -->|"context in"| OUT

    style L1 fill:#e8f4fd,stroke:#1a73e8
    style L2 fill:#d4edda,stroke:#28a745
    style L3 fill:#fff3cd,stroke:#ffc107
    style OUT fill:#f8d7da,stroke:#dc3545

Red Flags in Your Diff

You already know why AI code goes wrong - runtime blindness, architecture drift, over-engineering. Here's what to actually look for when you're reviewing a diff.

Try-except wrapping everything. The most reliable AI tell. You'll see entire function bodies wrapped in try: ... except Exception: logger.error(...). Your global error handler never fires because every exception gets caught and buried. If you see a bare except Exception in a diff, reject it.

Magic numbers. Timeouts of 30, retry counts of 3, batch sizes of 100. AI picks numbers that look reasonable but aren't tied to anything real. Your timeout should be 5 seconds because the downstream SLA is 3. Check every numeric literal that isn't 0 or 1.

Comments explaining "what." # Increment the counter above counter += 1. If you see a comment describing what the next line does, delete it. The only comments worth keeping explain why - and AI can't write those because it doesn't know your intent.

Near-duplicate functions. Three functions that differ by one parameter. AI generates each independently, doesn't see the duplication across sessions. Search for similar function signatures in the same module.

Missing boundary validation. Internal functions that trust all inputs. AI treats each function as self-contained, skipping the validation that protects your system at entry points. Check: does new code that handles external input validate it?

Inline imports. Imports inside function bodies instead of at module level. AI does this because it's "convenient" for the snippet. If your project convention is module-level imports, this is an instant reject.

None of these require deep analysis. They're mechanical checks you can spot in seconds during review - which is exactly why they're worth having on a checklist.

# RED FLAG: Try-except wrapping everything
# BEFORE - global error handler never fires
def process_order(order_id: int):
    try:
        order = Order.get(order_id)
        order.validate()
        order.charge()
        return {"status": "completed"}
    except Exception:
        logger.error(f"Failed to process order {order_id}")
        return {"status": "failed"}

# AFTER - let exceptions propagate
def process_order(order_id: int):
    order = Order.get(order_id)
    order.validate()
    order.charge()
    return {"status": "completed"}

# RED FLAG: Magic numbers
# BEFORE
response = httpx.get(url, timeout=30)

# AFTER - tied to a real constraint
response = httpx.get(url, timeout=DOWNSTREAM_TIMEOUT_SEC)  # 5s, SLA is 3s

Automating What You Can

Every red flag from the previous section can be caught by a machine. Linters, type checkers, and pre-commit hooks handle the mechanical stuff - bare except Exception blocks, inline imports, magic numbers, functions over 100 lines. You already have ruff and mypy (or ESLint and TypeScript strict). The step most people skip is adding custom pre-commit checks for the AI-specific patterns: error swallowing, numeric literals outside constants, near-duplicate functions.

# .pre-commit-config.yaml - AI-specific checks
- repo: local
  hooks:
    - id: no-bare-except
      name: no bare except Exception
      language: pygrep
      entry: 'except\s+(Exception|BaseException)\s*:'
      types: [python]

    - id: no-inline-imports
      name: no inline imports
      language: pygrep
      entry: '^\s{4,}(import |from \S+ import )'
      types: [python]

The underrated CI trick: run your tests with concurrency. Most test suites run sequentially, so timing-dependent bugs pass locally. Run with pytest -n auto or your framework's parallel flag. Bugs that only surface under concurrent execution will start failing in CI - which is exactly where you want them caught.

But here's what matters more than any of this: what automation can't catch. No tool will tell you whether this feature follows the same patterns as the rest of your codebase. No linter knows your business rules. No type checker can tell you the code solves the wrong problem.

Automation handles the floor. Human review handles the ceiling. The mistake is confusing which problems belong to which layer.

The Honest Trade-offs

Time for the part most AI articles skip.

What AI Is Genuinely Great At

Boilerplate. CRUD endpoints, data models, serializers - anything where the pattern is established and you just need more of it. AI excels here because there's nothing to get wrong beyond following the template.

Tests, when you define scope. Tell Claude exactly which cases to cover and it writes solid tests fast. Let it decide what to test and you'll get 90% happy-path coverage with zero edge cases.

Mechanical refactoring. Renaming, moving code between modules, converting class patterns. Tedious work that humans mess up because our attention drifts. AI doesn't get bored.

Exploring unfamiliar APIs. Need to integrate a library you've never used? Claude reads the docs faster than you and produces a working first draft. You still need to understand what it wrote, but the exploration phase shrinks dramatically.

First-draft documentation. API docs, README updates, docstrings. AI produces a decent starting point that's faster to edit than to write from scratch.

What AI Is Genuinely Bad At

Everything that requires understanding beyond the code itself. Concurrency correctness, architecture decisions, implicit business rules, cross-codebase consistency - I covered these in detail earlier, and they remain the core risks.

But the one I keep coming back to: AI doesn't push back. It's eager to please. Tell it to build something wrong and it will do so enthusiastically. You need to be the one who decides "we shouldn't build this." AI has no judgment about whether the task itself makes sense.

The Hidden Costs

Maintaining AI-targeted documentation takes real effort. Every architecture change needs to be reflected in your CLAUDE.md and supporting docs. This is an ongoing tax, not a one-time setup.

Review time sometimes exceeds writing time. For complex logic, reading and verifying AI output takes longer than writing it yourself would have. You save nothing - you just shifted the work from writing to reading.

Context-switching between driving and reviewing is cognitively expensive. You're either thinking creatively or thinking critically. Switching between the two hundreds of times a day is draining in a way that pure coding isn't.

The speed gain is real but smaller than the hype. I'm not 10x faster. I'm not even 5x faster.

My Honest Assessment

AI tools make me maybe 1.5-2x faster overall. The biggest gains are in exploration and boilerplate - not in the hard parts that actually matter. Core logic, architecture, debugging - these take the same time they always did, sometimes longer because I'm reviewing AI's work on top of my own thinking.

Quality is maintained only because I refuse to skip review. If I stopped reviewing, I'd ship faster but sleep worse.

You're the Architect, AI Is the Contractor

AI coding tools are the most productive addition to my workflow in years. They're also the easiest way to accumulate technical debt I've ever seen. The difference between the two outcomes is discipline - not talent, not experience, just a system you follow consistently.

You wouldn't hand a contractor a plot of land and say "build something." You'd give them blueprints, constraints, materials specs, and you'd inspect the work at every stage. AI tools are the same. You design. AI executes. You review. Skip any step and you're gambling.

Here's what matters:

Invest in documentation - for your AI, not for future developers. CLAUDE.md, subdocuments, AI-targeted architecture docs. Better context in, better code out. The payoff is immediate.

Choose your review mode deliberately. Inline review for the unknown. Documentation-first for the predictable. Match the mode to the work, not your mood.

Automate the mechanical checks. Linters, type checkers, pre-commit hooks - let tools handle the floor so your review time goes to the ceiling.

Know the limits honestly. 1.5-2x faster, not 10x. Great at boilerplate, bad at concurrency. The speed is real. The hype isn't.

If you want a concrete starting point: write a CLAUDE.md for your main project this week. Just your coding conventions, your architecture patterns, and three things AI should never do in your codebase. That single file will change the quality of every AI interaction you have.

The race condition you don't catch today is the production incident you debug next week. And enough unchecked drift turns into the kind of big rewrite nobody wants. Build the system. Follow the system. Sleep well.