Detecting AI Slop: A Code Review Checklist

The pull request looked clean. Tests green, linter quiet, the diff small enough to read in one sitting. I approved it, merged it, and went to lunch. Two days later production started dropping updates under load, and I spent the next three days tracing it back to eight lines an AI had written for me.

The code was correct. Single-threaded, it did exactly what I asked: read a counter, add to it, write it back. It just had no idea that two requests could hit the same row inside the same 50ms window. No atomic transaction, no row lock, nothing. Under real traffic the reads overlapped and the writes clobbered each other.

Here is what unsettled me afterward. Nothing in my normal review caught it. The code read well because a model trained on millions of correct examples writes code that reads well. The tells were structural, not in the logic I was scanning for.

That bug taught me to review AI-generated code differently. This article is the checklist I built out of it: twelve named patterns, sorted by how fast they will hurt you, so you can spot the slop before it ships.

Why AI Slop Isn't Like Human Bugs

Human bugs are personal. Every developer has their own blind spots, so the mistakes in a codebase come out scattered and idiosyncratic. One person forgets to close a file handle, another off-by-ones a loop. You review human code by hunting for those individual lapses.

AI slop works differently because it is not personal. The same model, trained on the same corpus, reaches for the same shapes no matter whose project it is touching. Ask three teams to generate error handling and you will see the same try/except wrapped around the same too-broad block. The mistakes repeat, so they are predictable, so you can build a checklist instead of guessing each time.

The measurements line up with that. CodeRabbit looked at AI-assisted PRs across 470 repositories and found 1.7x more issues than in human-written ones. Veracode flagged security flaws in 45% of the AI-generated samples it tested. CrowdStrike measured roughly double the rate of concurrency mistakes, the exact class of bug that bit me in the opening. Different studies, same signal: the failures cluster.

That clustering is the whole opportunity. Random bugs you find one at a time, on your knees with a debugger. Systematic ones you can name, sort by severity, and scan for in seconds, which is what the rest of this article sets up.

P1 - Fix Before Merge

These four ship quietly and then bite in production. Block the merge until they are gone.

The Try-Except Blanket. AI reaches for error handling that catches everything and does nothing with it.

try:
    charge_customer(order)
    ship(order)
except Exception:
    pass

The code never crashes, which feels like robustness. It is the opposite. A failed charge now ships the order anyway, and you hear about it from an angry customer instead of a stack trace. In review, grep every diff for except Exception and bare except:. Each one is a question: what failure is this swallowing, and should the program really keep going?

The Race Condition Blind Spot. This is the one from the opening. The model writes correct single-threaded logic and quietly assumes the world is single-threaded.

def add_credits(user_id, amount):
    user = db.get(user_id)
    user.credits += amount   # read here, write below
    db.save(user)

Two requests run this at once, both read the same starting value, and one update vanishes. The fix is an atomic write the database already gives you:

db.execute(
    "UPDATE users SET credits = credits + %s WHERE id = %s",
    (amount, user_id),
)

For any read-modify-write on shared state, ask: what happens if two of these run in the same millisecond? The AI almost never asks that on its own, so you have to.

The Confident Hallucination. The model calls a method that sounds exactly right and does not exist, or passes a config key the library silently ignores.

# boto3 has no such method; it sounds plausible and isn't real
s3.download_bucket_to_folder(bucket, "/tmp/data")

This phantom API happens most with smaller libraries the model saw less of during training. The danger is tone: the call reads as authoritative, so reviewers skim past it and it fails at runtime instead of at review. Treat any unfamiliar method or parameter as guilty until you have seen it in the actual docs, not just sitting confidently in the diff. A thirty-second check against the real API beats a 2am page when the job finally runs.

The Security Non-Check. Authorization logic that looks plausible and is subtly wrong: an or where it needed and, a check against the wrong field, a token trusted only because the client sent it.

if user.is_owner or user.is_member:
    grant_access(resource)

Members were never meant to reach owner-only resources, but the happy-path test passes and the bug rides along. Read every auth check slowly and confirm the boolean logic matches the rule you actually want, then confirm nothing in it trusts a value the client controls.

P2 - Fix This Sprint

These do not crash anything. They rot the codebase slowly, so clear them before they set into something every future change has to route around.

The Over-Abstraction. Ask for a function that formats a date and the model can hand back a small framework.

class AbstractFormatterFactory:
    def get_formatter(self, kind): ...

class DateFormatterStrategy(FormatterStrategy):
    def format(self, value): ...

What you needed was value.strftime("%Y-%m-%d"). The model has read a lot of enterprise Java and reaches for ceremony when nothing stops it. Flag any new abstraction with exactly one implementation. An interface wrapping a single concrete class is a maintenance tax you pay forever for a benefit that never arrives.

The Config Cargo Cult. Constants get promoted into environment variables and feature flags for no reason: a RETRY_COUNT, an ENABLE_NEW_PARSER, a DATE_FORMAT that will never hold a second value. The model has seen production code where these are configurable and copies the shape without the reason. Each one adds a knob nobody turns, a branch nobody tests, and a .env entry the next developer has to reverse-engineer. Ask of every new setting whether it has a real alternate value it could take. If it does not, it is a constant wearing a costume, and inlining it makes the code shorter and clearer at once.

The Import Chaos. Imports buried inside functions, unused imports left at the top, deprecated module paths picked up from stale training data. None of it breaks the build, so it slips through review by being boring. ruff and pyflakes catch nearly all of it in under a second, so let the linter own this category and spend your own attention on the patterns a tool cannot see.

The Fake Test. The most dangerous P2, because it lies with a straight face. Coverage is green, the suite passes, and the test asserts nothing real.

def test_charge_customer(mocker):
    charge = mocker.patch("billing.charge_customer", return_value=True)
    assert charge(order) is True   # asserts the mock, not the code

This mocks the very function it claims to verify, so it would still pass if charge_customer were deleted. Read each test and ask what would have to break in real code for it to fail. If the answer is nothing, the test is decoration. Writing tests that exercise the hard cases is its own skill, one I dug into in testing legacy code.

P3 - Track and Fix

These will not page you at 2am. They accumulate as drag, so note them, clean them up when you are already in the file, and do not block a merge over one alone.

The Comment Novel. AI narrates. It explains what the next line does instead of why the line exists, which is the opposite of a comment worth keeping.

# Increment the counter by one
counter += 1
# Loop over each item in the list
for item in items:
    process(item)

Ox Security found this kind of padding in more than 90% of the AI code they examined. The comments are not wrong, just worthless, and they shove the real logic further apart on screen. Delete the ones that restate the code and keep only the ones that capture a decision the code cannot show on its own.

The Magic Number Garden. Hardcoded values bloom everywhere: a timeout=37, a range(0, 256), a retry that sleeps 1.5 seconds, none of them explained. Each was probably sensible the moment it was typed and is now a small mystery for whoever edits the function next. Lift the ones that carry meaning into named constants, and when you meet an oddly specific value in review, ask where it came from.

Ghost of Patterns Past. The model trained on years of code, so it sometimes writes in the wrong era: a blocking requests call inside a fully async service, a deprecated method, a class-based view in a codebase that moved to plain functions. It runs, but it fights the conventions around it. Catch it by checking that new code matches the file it lands in, not just that it passes.

The Scope Creep. You asked for a function to parse a CSV. The diff also brings retry logic, a caching layer, and a logging wrapper nobody requested. Extra code is not a gift; it is more surface to review, test, and carry. When a diff runs wider than the request, the burden sits on the extra code to justify itself, and it usually cannot.

Turn the Patterns Into a Checklist

Twelve patterns only help if you actually run them. Lift the four P1 checks into your pull request template as required boxes: no bare except, every read-modify-write is atomic, every unfamiliar API verified against real docs, every auth check re-read for its boolean logic. That is the Detect step, and it costs a reviewer about a minute per diff.

The checks a machine can run, let the machine run. A linter clears the import chaos, a type checker catches a slice of the hallucinations, and your pipeline can gate on both so they never reach a human at all. If that is not wired up yet, setting up CI/CD from scratch walks through it.

Detecting slop in review is the floor, not the ceiling. The stronger move is to stop generating it. Most of these patterns trace back to the model not knowing your conventions, which is exactly what a CLAUDE.md is for: write the anti-patterns down once and it stops reaching for them. I went deep on that in the CLAUDE.md guide.

The last rung is enforcement: commit hooks that reject a banned pattern automatically, so the rule never depends on anyone remembering it. A piece on wiring those up is coming. Until then, a pre-commit hook running ruff plus a grep for except Exception covers most of the ground.

The Reviewer's Mindset

Reviewing AI code is a different job than reviewing a teammate's. With a person, you hunt for the logic slip unique to them and that day. With a model, you hunt for the structural tells it repeats everywhere, because it never quite held your whole codebase in its head. The twelve patterns here are just those tells, named and sorted so you can spot them fast.

So do one thing on your next AI-generated pull request: run the four P1 checks before you read anything else. Bare excepts, unguarded shared state, invented APIs, and loose auth logic are where the real damage hides. Then go one step earlier and write your conventions into a CLAUDE.md, so the model stops producing the slop you would otherwise have to catch.