Richard Batt |
Your AI Is Only as Good as Your Data, Here Is the Cleanup Checklist
Tags: AI Strategy, Operations
A manufacturing company spent £45,000 on an AI-powered demand forecasting system. After three months, the forecasts were worse than guessing. The system had learned from eight years of corrupted inventory data, duplicate orders, conflicting dates, fields filled with test values that nobody cleaned up. The AI wasn't broken. The data was.
Key Takeaways
- 70-85% of AI projects fail to meet objectives, with poor data quality as the #1 cause
- Bad data costs organizations 15% of revenue per year (Gartner)
- AI cleaning can reduce manual cleansing time by 70-90%, but you have to start with a baseline
- An 8-point data cleanup checklist catches the problems that sink AI projects before launch
- Use this checklist before any AI or automation project, it's the difference between working automation and expensive failure
Why Data Kills AI Projects (And Nobody Talks About It)
MIT Sloan research is blunt: "Automation does not fix bad data. It accelerates the impact of it."
You already know this intuitively. Feed garbage into a spreadsheet formula and you get garbage output. Scale that to an AI system learning from thousands of records, and the garbage multiplies. The system becomes more confident in its mistakes.
The painful part: most companies don't discover the data problem until after they've bought the tool and the project is already "failing." By then, you've spent months and thousands on something that was doomed from the start.
The 8-Point Data Cleanup Checklist
Before you deploy any AI or automation, run this checklist. It catches the silent data killers that sink projects.
1. Duplicate Records
Customer John Smith appears five times in your database. Once as "John Smith," once as "J Smith," once as "John Smyth" (typo), once from a merger that wasn't deduplicated, once from a test import that should have been deleted.
An AI system trained on this data will learn that John is five different people. Your churn analysis breaks. Your revenue attribution breaks. Your customer lifetime value calculations are nonsense.
Action: Run a deduplication tool on every customer list, product catalogue, and transaction table. Look for: identical records, spelling variations, test entries, and records from failed imports.
2. Inconsistent Formatting
Phone numbers in your CRM: sometimes with country code (+44), sometimes without. Sometimes formatted as (020) 1234 5678, sometimes as 02012345678, sometimes as 020-1234-5678. Dates: sometimes DD/MM/YYYY, sometimes MM/DD/YYYY, sometimes "5th March 2024."
Inconsistent formatting creates phantom duplicates and breaks matching logic. Your AI model will treat "John, Software Consultant" and "john, software consultant" as two different job titles.
Action: Standardize formatting across every critical field. Phone = standard country code format. Dates = YYYY-MM-DD (ISO standard). Names = title case. Titles/categories = one lookup table, one correct value only.
3. Missing Fields (The Silent Killer)
Your customer database has a "Last Purchase Date" field. 40% of records are blank. Your sales team added the field last year but never backfilled historical data. An AI model trained on this will think 40% of your customers haven't bought anything. It will score them as dead leads.
Missing data isn't just incomplete. It's misleading. A blank field is worse than a wrong field because the system can't see it's wrong.
Action: For every critical field, know your missing-data rate. If it's over 5%, either backfill it (merge from another source, or calculate it) or remove the field entirely. Don't let AI train on incomplete information.
4. Outdated Entries
You have contact information for an employee who left three years ago. They're still in your company directory. An automation system pulls from this directory and sends them emails. The company looks careless. Or worse: a churn prediction model thinks this is a record of someone who left, and trains on all the bad patterns from their final months.
Action: For customer, employee, and contact records, set a "last active" date. Archive or mark as inactive anything older than 12 months with no updates. Don't delete it, you might need it for historical analysis, but take it out of the active dataset.
5. Conflicting Definitions Across Systems
Your accounting system defines "revenue" as invoiced amount. Your CRM defines "revenue" as cash received. Your forecasting spreadsheet defines "revenue" as projected annual spend based on contract terms. Three systems, three different definitions. When you pull data from all three to train an AI, it's learning from three conflicting realities.
Action: Create a single source-of-truth document that defines every critical metric. "Revenue = invoiced amount in GBP recorded in Xero on the invoice date." Make sure the AI system is trained on data from one system only, or clearly labeled data from each.
6. Test Data Left in Production
Your team was testing the CRM. They created 50 test records with dummy data: names like "Test User," email "test@test.com," addresses like "123 Main Street." The test was supposed to be cleaned up. It wasn't. The data is still there. Your AI model is learning from it.
Action: Search for obvious test patterns (names like "Test," "Demo," "Sandbox"; emails like "test@," "demo@"; phone numbers like 00000000; addresses like "123 Main Street"). Delete or isolate them before running any AI.
7. No Audit Trail or Timestamps
You're trying to build a churn prediction model. You need to know when each customer was active, when they went quiet, when they left. But your database doesn't record "date updated" or "date created." You have customer snapshots but no timeline. You can't tell if a customer left last week or six months ago.
Action: Before feeding data to AI, ensure every record has: created_date, updated_date, and (for transaction data) a transaction_date. If your system doesn't track this, add it now. Backfill if possible with the oldest available information.
8. No Data Quality Baseline (You Don't Know What You Don't Know)
You haven't measured your data quality. You think it's probably fine. You don't know that 15% of product SKUs are missing pricing. You don't know that 23% of customer addresses are incomplete. You don't know that import job failed silently three months ago and didn't log any errors.
Action: Run a data quality audit on every dataset before AI. Count missing values (by field), duplicates (by key identifier), format violations (by field), and age of records. You should be able to say: "Our customer database is 94% complete, with 2% duplicates and records averaging 18 months old." If you can't say that, you're not ready for AI.
The Real Cost of Skipping This
I've seen companies skip data cleanup because they wanted to move fast. The cost:
A logistics firm spent £18,000 on an invoice automation system. Three months in, they discovered it was creating duplicate payments because the data had invoice duplicates they hadn't caught. They spent another £6,000 fixing data, then another £4,000 rebuilding the automation. Total time: five months. They could have cleaned the data in two weeks for £2,000 before the project started.
The pattern is always the same: bad data hides until the system is live. By then, you're managing failures instead of building capability.
How to Use This Checklist
Before any AI or automation project, run through each of these eight checks. You're looking for the "killer" issues that will sink the project. You don't need perfection, you need to know what's wrong before you build on top of it.
For each item, ask:
Do we have duplicates? Run a tool. How many are there? Is the pattern clear enough to automate cleanup?
Are our formats consistent? Spot-check 20 records from your key fields. Are they in the same format every time?
What percentage of critical fields are blank? Over 5%? That field is a problem.
Are our records current? When was the oldest active record created? Is it older than it should be?
Do we have one definition of "revenue" (or customer, or product, or whatever matters)? Ask three different departments and see if they agree.
Is there test data mixed with real data? Search your database for test patterns and count them.
Do our records have timestamps? Can you tell when each record was created and last updated?
Have we measured any of this? If you can't answer the above questions with actual numbers, you have a baseline problem.
That's the checklist. It takes a few hours. It saves months of failures.
Frequently Asked Questions
How much does data cleanup cost?
Depends on your data volume and corruption level. For a 100K-customer database with moderate issues: £2,000-£5,000 in tooling and time. For a messy 500K+ database with multiple systems: £10,000-£25,000. Compare that to the cost of deploying AI on bad data (failed projects, rework, lost time). You'll always come out ahead cleaning first.
Can we clean data and train AI at the same time?
You can. It's slower and messier. You'll have to rebuild models once cleaning is done because the dataset changed under the model. Clean first, then train. It's faster in the end.
What if we don't have time to fix everything?
Prioritize the fields the AI will actually use. If you're building a churn prediction model, fix customer contact data, purchase history, and activity dates. Ignore fields the model won't see. Low-priority cleanup can wait.
How do we stop bad data from coming back?
Governance. Set rules before import: phone numbers must be valid format, dates must be YYYY-MM-DD, customer names can't be "Test." Add validation at the point of data entry. If data enters your system badly, the bad patterns will stay forever. Make it hard to enter garbage in the first place.
Should we hire a data engineer to do this?
Not necessarily. Most of this work is mechanical: run deduplication tools, standardize formats, validate completeness. A business analyst with basic SQL skills can do it. You need a data engineer if your data is distributed across five systems with no clear master, or if you're dealing with 10+ million records. For most companies, this is a "fix it yourself" job.
Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.
What Should You Do Next?
If you are not sure where AI fits in your business, start with a roadmap. I will assess your operations, identify the highest-ROI automation opportunities, and give you a step-by-step plan you can act on immediately. No jargon. No fluff. Just a clear path forward built from 120+ real implementations.
Book Your AI Roadmap, 60 minutes that will save you months of guessing.
Already know what you need to build? The AI Ops Vault has the templates, prompts, and workflows to get it done this week.