All Articles

AI Data Cleaning in Production: From Toy Demos to Real Pipelines

Arseniy Potapov
Arseniy Potapov
··18 min read·
AI Data Cleaning in Production: From Toy Demos to Real Pipelines

Three years of building AI-powered data pipelines taught me one thing: don't pipe every row through an LLM. Here's the architecture that actually scales.

In December 2022, I wrote an article about using AI to clean data. I was excited. GPT-3 had just become accessible, text-davinci-003 could parse messy CSV rows, and I built a demo with 6 rows of sample data. Six rows. I showed how the model could normalize dates, fix capitalization, and extract structured fields from free text. It worked beautifully.

It also didn't scale past a demo.

The model cost $0.02 per 1,000 tokens. Rate limits capped throughput at a few hundred requests per minute. Every API call returned slightly different formatting. There was no validation, no error handling, no way to reproduce results. I was essentially asking the world's most expensive intern to hand-clean each row one at a time.

Three years later, I build data ingestion systems that process millions of records from dozens of sources, each arriving in its own special flavor of broken. County tax rolls with names like "SMITH JOHN A JR & MARY B SMITH-JONES TTEE OF THE SMITH FAMILY TRUST." Address files where column B switches meaning halfway through. CSV exports with mixed encodings, missing headers, and creative interpretations of what "null" means.

AI is central to how these systems work. But not the way I imagined in 2022. The LLM doesn't touch every row. It looks at a sample, figures out what's wrong, creates a transformation plan, and hands it off to pandas and Pydantic to execute and validate. The expensive, intelligent part happens once. The cheap, deterministic part runs on every row.

This is the article I should have written three years ago. Here's what actually works when you're cleaning data at production scale - not with 6 rows, but with millions.

Every Customer's Data Is Broken in Its Own Way

Clean datasets are all alike; every dirty dataset is dirty in its own way.

I've processed data from county governments, mortgage companies, title agencies, and tax assessors across 14 states. Not one source arrived clean. The problems fall into predictable categories, but the specific combination is unique every time.

Start with the structural problems. CSV files with no headers, mixed delimiters (tabs in one section, commas in another), multiline fields that break naive parsers. One county sends an Excel workbook with 6 tabs. Another sends a ZIP of CSVs named EXPORT_FINAL_v2_CORRECTED(1).csv.

Then there's encoding. Latin-1 data in a UTF-8 world. Windows line endings mixed with Unix. BOM markers that show up as invisible characters. I once spent a day debugging why 200 property addresses contained é instead of é - classic UTF-8 double-encoding from an intermediate system that converted the file twice.

Semantic ambiguity is worse. "John Smith" in one source, "SMITH, JOHN A JR" in another, "SMITH JOHN & MARY TTEE" in a third. Same person, three representations. Dates arrive in 15+ formats: 01/02/2024, 2024-01-02, Jan 2, 2024, 20240102, and my favorite, 1/2/24 (is that January 2nd or February 1st?).

Missing data has personality. Empty string, "N/A", "n/a", "NA", "NULL", "None", "-", "0", and a single space character - I've seen all of these mean "this field has no value" within the same file. Then there are the blanket deletions where an entire column is empty because the export process silently dropped it.

And domain-specific landmines. Addresses like "123 MAIN ST APT 4B C/O J SMITH" that need to be split into street, unit, and care-of fields. Property class codes that mean different things in different counties ("01" is "Single Family" in Lee County, "Residential" in Broward, and "Vacant Land" in Marion).

The problems are predictable by category but unique in combination. You can't write one script that handles all sources. And you can't hire enough people to manually map every new dataset - not when each one arrives with its own encoding, its own column names, its own creative interpretation of how to represent a null value.

The question is what you do about it. People have tried a lot of things.

What People Try Before AI

Every approach to data cleaning exists for a reason, and most of them work fine within their limits. The trouble starts when you outgrow those limits.

Manual cleanup in Excel. Power Query, VLOOKUP, find-and-replace. This handles one-off tasks with small files perfectly well. If you're fixing 200 rows once a quarter, Excel is the right tool. It's not reproducible, though. Next quarter, when the same file arrives with a new encoding issue, you start from scratch.

One-off Python scripts. A developer writes a pandas pipeline per data source. I've done this dozens of times: read CSV, rename columns, parse dates, strip whitespace, export. Works great for 5-10 sources. At 50+, you're maintaining 50 scripts, each encoding tribal knowledge that walks out the door when the developer leaves.

ETL platforms like dbt, Fivetran, or Airbyte are excellent for structured sources: APIs, databases, well-defined schemas. They struggle with truly messy data. The CSV with no headers. The Excel file where column B is sometimes "Owner Name" and sometimes "Property Address" depending on which county clerk exported it.

Data quality frameworks (Great Expectations, Soda, dbt tests) are superb at detecting problems: "15% of addresses failed to geocode." They don't fix anything. You still need a human to figure out why and write the fix.

Pasting into ChatGPT is where I was in 2022. Send 50 rows, ask for clean output, marvel at the result. It genuinely works for small batches. It doesn't scale: Osmos measured $2,250 for 100K rows through GPT-4 before Microsoft acquired them. Inconsistent output between calls, no validation, and rate limits that kill any automated pipeline.

AI-assisted pipeline is what I'll spend the rest of this article on. AI analyzes samples, generates transformation configs, quality gates verify the output. It combines deterministic execution from scripts, config-driven structure from ETL, automated checks from quality frameworks, and LLMs' ability to understand messy, ambiguous data.

Here's how they compare:

Approach Scales to 1M rows Reproducible Handles ambiguity Cost per run
Excel / manual No No Human judgment Free + hours
One-off scripts Yes Per-script No Dev time per source
ETL platforms Yes Yes No Platform fees
Quality frameworks Detection only Yes No Config time
Paste into ChatGPT No No Yes (LLM) $2-2,250 per batch
AI-assisted pipeline Yes Yes Yes (LLM + rules) Hours once, pennies per run

None of these approaches is universally wrong. Excel is perfect for one-off cleanup. Scripts are perfect for stable, well-understood sources. ETL platforms are perfect for structured pipelines. The AI-assisted approach fills the gap between them: understanding messy, ambiguous data and generating the transformation logic that traditional tools then execute.

AI as Planner, Not Processor

The pattern that works in production is simple to describe: AI looks at a sample of your data, creates a transformation plan, and the system executes that plan with traditional tools. The LLM touches 100 rows. pandas processes 10 million.

Here's the concrete version. Five steps, each doing one thing.

Step 1: Define the target shape. Before any AI gets involved, you need a contract - what should clean data look like? I use Pydantic models for this because they double as validation:

class PropertyRecord(BaseModel):
    parcel_id: str = Field(description="Unique parcel identifier, e.g. '12-34-56-789'")
    owner_first: str
    owner_last: str
    address: str
    city: str
    state: str = Field(max_length=2)
    zip_code: str = Field(pattern=r"^\d{5}(-\d{4})?$")
    assessed_value: float = Field(ge=0)
    is_homestead: bool

This schema isn't just documentation. It's the success criterion. After cleaning, every row must validate against it.

Step 2: Describe expectations. Give the AI context about the data - not the schema (that's step 1), but what the data represents. "This is a property tax roll export from a county government. Each row is a parcel. Owners can be individuals, trusts, or corporations. Addresses include both mailing and property-site addresses."

Step 3: Feed AI sample data plus a toolkit. Send the LLM 50-100 representative rows, the target schema, and a list of available transformation operations: rename_column, parse_date, split_name, filter_rows, map_values, convert_type. The LLM doesn't need to implement these - it just needs to know they exist.

Step 4: AI generates the plan. The LLM responds with an ordered list of transformations. In our production system, this is a JSON config:

{
  "transforms": [
    {"op": "rename", "mapping": {"OWNERNAME1": "owner_raw", "SITUS_ADDR": "address"}},
    {"op": "map_values", "column": "prop_class",
     "mapping": {"01": "Single Family", "02": "Mobile Home", "03": "Condo"}},
    {"op": "split_name", "source": "owner_raw",
     "targets": ["owner_first", "owner_last", "trust_flag"]},
    {"op": "filter", "expr": "address_type == 'situs'"},
    {"op": "convert", "column": "assessed_value", "to": "float", "null_value": 0}
  ]
}

This config could be a Python script, a SQL query, or a dbt model. The format matters less than the pattern: it's declarative, versionable, and deterministic. Same input always produces same output.

Step 5: Validate, execute, monitor. Run the plan on a subset. Check quality gates (more on those next). If gates fail, send the failures back to the AI and let it adjust the plan. Once the subset passes, run on the full dataset. Check gates again.

This is why it scales. The expensive part (LLM analysis) happens once per data source. The cheap part (pandas applying the config) runs on every row. When the same county sends next month's data in the same format, the config runs without any AI involvement at all. When a new county arrives with a completely different layout, AI generates a new config.

In our production system, we run 100+ county configurations. Each one was originally created by a human engineer studying the data. But the process that human follows - look at samples, understand the schema, write the mapping - is exactly what an AI agent can automate. The engineer's job shifts from "write the config" to "review the AI's config and approve it."

How Do You Know the Data Is Actually Clean?

You don't trust AI-generated code without tests. You shouldn't trust AI-generated transformation plans without quality gates either. The principle is the same: automate the work, verify the result.

In our production pipeline, every data load runs through four layers of checks before anything hits the database.

Schema gates are the baseline. Every field matches its expected type. IDs are unique. Required fields aren't null. Dates parse to ISO format. Foreign keys point at existing records. If your Pydantic model says zip_code must match ^\d{5}$ and 3% of rows have "N/A" in that field, the gate catches it before those rows propagate.

Shape gates catch structural problems. On a refresh (re-processing the same data source), parcel count shouldn't change by more than 10%. If it does, someone uploaded the wrong file, or the schema changed, or data got truncated. We flag anything above 5% as a warning and above 10% as an error that blocks ingestion.

Distribution gates are where it gets interesting. After parsing 50,000 owner names, we check the top 20 most common first names. If "SMITH" appears as a first name, the parser swapped first and last name columns. If more than 5 of the top 20 first names are uncommon (not in a standard frequency list), the parser is splitting names wrong - probably treating "JOHN MICHAEL" as first="JOHN MICHAEL" instead of first="JOHN", middle="MICHAEL."

We run similar checks on address geocoding: after standardizing addresses through an address validation API (we use SmartyStreets), at least 60% should validate. Below 40% means a systematic parsing problem - wrong column, encoding mismatch, or format the parser doesn't recognize. On exemption rates, 25-40% of residential parcels typically have a homestead exemption. Below 10% means exemption data is missing entirely. Above 60% means the parser is incorrectly flagging non-exempt parcels.

Business logic gates check relationships across tables. Every owner record must reference an existing parcel. Every transaction must reference an existing parcel. Exemption codes must exist in the reference table. These catch join errors and mapping mistakes that look fine in isolation but break when you query across tables.

The thresholds (10%, 40%, 60%) come from processing 100+ data sources over several years. They encode what "normal" looks like. Data is never 100% accurate - there's always noise, always edge cases. The goal isn't perfection. It's catching systematic problems early, before bad data propagates through the system and produces wrong results downstream.

Here's the part that connects to the AI pipeline: when a gate fails, the system doesn't crash. It reports which gate failed, which rows caused the failure, and what looks wrong. That report feeds back to the AI agent, which can analyze the failures and adjust the transformation plan. Human reviews the adjustment, approves, and the pipeline reruns. This is the feedback loop that makes the system trustworthy: AI generates the plan, gates verify it, failures get fixed, and each iteration improves the result.

This isn't academic idealism. Research on production agent deployments shows that 68% need human intervention within 10 steps. The best AI systems aren't fully autonomous - they're semi-autonomous with clear checkpoints. Quality gates are those checkpoints for data.

Name Parsing - Where the Hybrid Pattern Clicks

Let me show you the "AI as planner" pattern with a concrete case study.

Our data has a full_name field. It contains everything from "John Smith" to "SMITH, JOHN A JR & MARY B SMITH-JONES TTEE OF THE SMITH FAMILY TRUST DATED 01/01/2020." We need to extract: first name, last name, trust flag, corporate flag. For every record.

The traditional approach is a regex-based parser. Ours handles the common patterns well: "JOHN SMITH," "SMITH, JOHN A JR," "DR JOHN MICHAEL SMITH," Hispanic compounds like "DE LA CRUZ." It scores each parse result with a confidence value based on how well the tokens match known name patterns.

About 80% of names parse cleanly with high confidence. That's 40,000 out of 50,000 records handled instantly by deterministic code. No AI involved, no API cost, no latency.

The remaining 20% is where regex breaks down. Trust names: "DEBBIE DAVIDSON REVOCABLE TRUST." Multiple owners in a trust: "WALSH JOSEPH T & NHUNG REVOCABLE TRUST." Corporate entities: "M & M MANAGEMENT LLC." Names where context matters: is "LE" a Vietnamese surname or a legal abbreviation?

These low-confidence results get routed to an LLM. The prompt includes the raw name string, the target schema (first_name, last_name, trust_flag, is_corporate), and a few examples of correct parsing. The LLM handles the ambiguity that regex can't.

Out of those 10,000 LLM-routed names, roughly 9,500 resolve cleanly. The remaining 500 get flagged for human review - a queue that a person can clear in a few hours instead of manually fixing all 10,000.

The cost math: the LLM portion costs about $2 for 10,000 names. The alternative is hiring someone to manually parse 10,000 complex names, which takes days.

The pattern generalizes beyond names. Any data cleaning task where 80% is predictable and 20% is ambiguous fits this model: let traditional tools handle the easy cases, route the hard cases to AI, validate everything against the schema, and flag what neither can handle. You're not replacing the regex parser. You're adding a second tier that handles what the regex parser can't.

Stop Copy-Pasting Into ChatGPT

There's a difference between using AI as a chat tool and using AI as a worker. Most "AI data cleaning" tutorials show the chat version: open ChatGPT, paste 50 rows, ask it to clean them, copy the result back. That works for one-off tasks the way manual Excel cleanup works - fine for small, non-recurring jobs.

The production version looks completely different from the user's perspective. Here's what actually happens when someone uploads data to our system:

A user receives a link. They drop their files - a ZIP of CSVs, an Excel workbook, a folder of PDFs. They don't know or care about our schema. They just know they need to get their data into the system.

Behind the scenes: the system unpacks the archive and identifies file formats. An AI agent examines samples from each file, detects what each column likely represents, and generates a transformation config. The system runs that config, quality gates verify the output, and if something fails, the agent adjusts and retries. If it still can't resolve the issue, a human gets a specific report: "45% of addresses didn't geocode. Likely cause: PO boxes mixed with street addresses. 1,247 rows affected."

The user who uploaded never sees any of this. They get "success" or "we need to clarify a few things."

The key difference from the chat approach: an AI agent uses tools. It doesn't try to hold a million rows in its context window. It calls pandas to read files, runs SQL to check distributions, executes validation functions, and writes configs. It iterates - try, check, adjust, retry - without a human in the loop for each step.

An agent processing 1,000 files doesn't pipe each file through the LLM. It categorizes files by format and problem type, creates a plan per category, and executes plans with traditional tools. The LLM might make 20 API calls total for 1,000 files. The rest is pandas, Pydantic, and SQL doing what they do best: processing data fast and deterministically.

This isn't a chatbot. It doesn't need a conversation UI. It doesn't need your attention. It reads messy files, figures out what's wrong, writes the fix, tests it, and leaves a report. When it can't fix something, it tells you exactly what's wrong and why. That's the kind of AI I'm excited about - not the one that talks to you, but the one that works for you.

Production Realities

Let me be honest about the parts that aren't elegant.

Sending every row through an LLM doesn't survive contact with the finance team. One million CSV rows at 500 tokens each through a budget API ($0.10/M tokens) costs $50 per run. Through GPT-4, it's closer to $5,000. Run that weekly across 50 data sources and you're looking at a six-figure annual line item for data cleaning alone. The config-generation approach costs a few dollars per data source (one-time LLM analysis of samples) plus pennies for pandas execution on every run. That's 3-4 orders of magnitude cheaper. One practical tip: if you are sending data to an LLM, send it as CSV, not JSON. CSV uses 50-56% fewer tokens for the same data and LLMs parse it just as accurately.

Rate limits compound the cost problem. We've had API providers throttle us during burst processing. Fifty thousand records hitting an LLM endpoint floods the rate limit within minutes. Batch APIs help (50% discount, higher throughput), but the latency goes from seconds to hours. If you need data cleaned before a morning deadline, that's a problem.

LLMs also aren't deterministic. Same prompt, different run, slightly different output. For planning this is fine - the generated config is deterministic once it exists. For row-level processing it's a deal-breaker. I've seen the same name parsed as "John Smith Jr" in one call and "Smith, John Junior" in the next. Configs don't have this problem. They run the same way every time.

And there are things AI simply can't do yet. Deduplication across records is still hard - a 2025 evaluation found that LLMs excel at standardization and profiling but fail at non-exact deduplication. "J. Smith at 123 Main" vs "John Smith at 123 Main St Apt 4" requires comparing millions of pairs, and LLMs are both too slow and too inconsistent for this. Use dedupe or recordlinkage for fuzzy matching. AI can help define the matching rules, but the execution needs traditional algorithms.

For high-volume or privacy-sensitive pipelines, self-hosting (Ollama for development, vLLM for production, cloud GPUs for burst) eliminates rate limits and API costs. The break-even is roughly 2 million tokens per day at 70%+ utilization. Below that, APIs are cheaper when you factor in infrastructure overhead.

The upside that makes all of this worthwhile is resilience. Traditional pipelines are brittle - one unexpected column name, one new encoding, one schema change, and the pipeline crashes at 3 AM. AI-assisted pipelines degrade gracefully. New format arrives? The agent analyzes the sample and generates a new config. Encoding changed? Detected and handled. Unseen edge case? Flagged for review instead of silently producing garbage. Each new data source makes the system smarter, not more fragile. The library of configs grows, the quality gate thresholds get refined, and the percentage of cases handled automatically increases over time.

The Silent Helper

Everyone is tired of chatbots. "Talk to AI" has become the default pitch for every product, every feature, every startup. But there's another kind of AI that doesn't get the hype and does most of the useful work.

It doesn't have a chat interface. It doesn't need your attention. It reads 500 messy files while you sleep, figures out what's wrong with each one, writes the transformation config, tests it against quality gates, and leaves a report on your desk in the morning. When it can't fix something, it tells you exactly what's wrong and why. When a new format arrives that it's never seen before, it adapts instead of crashing.

What previously required a week of an onboarding engineer's time - staring at spreadsheets, guessing at column meanings, writing one-off scripts, debugging encoding issues - now takes an afternoon of supervised automation. The human becomes the reviewer, not the laborer. That's a better use of judgment.

The three-year lesson is simple. AI is best as a planner and edge-case resolver, not as a row-level processor. Define your target schema. Sample the data. Let AI generate the transformation plan. Verify with quality gates. Execute at scale with traditional tools. Handle exceptions. That's the pattern, and it works.

This approach isn't revolutionary - the industry has converged on it independently. Every major data platform shipped some version of "AI suggests, human approves" in 2025. What's still missing is the practitioner knowledge: what thresholds to set, how to structure the feedback loop, when to route to AI versus when traditional tools are sufficient. That's what I tried to share here.

As agents get better at using tools, this pattern will become standard infrastructure. Today it's an engineering pattern you build. Soon it'll be a checkbox in your data platform. But the fundamentals - target schemas, quality gates, confidence-based routing - those don't change. They're the boring, essential foundation that makes the AI part trustworthy.

© 2026, built by Arseniy Potapov with Gatsby