Richard Batt |
Leading AI Models Produced Flawed Results on Real Work Tasks
Tags: AI Strategy, Operations
January: data analyst at logistics firm used Claude to write SQL. Query ran. Results looked right. But the logic was wrong. Her manual check caught it before it fed into marketing spend. That error could have cost thousands. Her catch saved the company.
Key Takeaways
- What the Research Actually Shows.
- The Quality Control Framework: Four Essential Layers, apply this before building anything.
- Real Examples Where Quality Controls Caught Expensive Mistakes.
- The Conversation You Need to Have With Your Team, apply this before building anything.
I raise this not as a criticism of AI models, but because researchers at the Center for AI Safety and Scale AI recently published findings that confirm what I am seeing across my consulting work: leading AI models, the ones marketed as enterprise-ready, produce flawed results on actual work tasks at rates that most organisations are not prepared for.
The study was straightforward: they took real-world work assignments from various professional domains and asked leading AI models to complete them. The results? The models performed worse than organisations expect, and in some domains, the error rates were significant enough that deploying these models without proper quality controls would be genuinely risky.
This is not a reason to avoid AI. It is a reason to build quality controls that most organisations currently lack.
What the Research Actually Shows
The Center for AI Safety's research tested Claude, GPT-4, Gemini, and other leading models on real work tasks across multiple domains: data analysis, writing, coding, research synthesis, and customer service responses. The findings showed that all models produced flawed outputs at detectable rates. None of them were perfect. Some were significantly better than others, but all were imperfect.
What struck me about these findings and what I have confirmed across 47 different client implementations of AI systems, is that the error rates are not random. They are patterned. Models make certain types of mistakes consistently. GPT-4 tends to produce plausible-sounding but inaccurate data when synthesising research. Claude tends to over-summarise and occasionally drop nuances from source documents. Gemini sometimes hallucinates citations.
These are not deal-breakers. But they are real. And they require quality controls that most organisations deploying AI have not built.
The Quality Control Framework: Four Essential Layers
Over the past eighteen months, I have designed AI quality control systems for eight different client organisations. What I have learned is that organisations that succeed with AI are the ones that treat AI output with the same scrutiny they would apply to any other business-critical process. Here is the framework I now recommend to every client.
Layer One: Never Trust AI Output Without Human Review Above a Cost Threshold
Define a cost threshold for your organisation. For most organisations, this is somewhere between $500 and $5,000, depending on the domain. Any AI-generated output that would influence a decision worth more than this threshold must be reviewed by a human expert before it is acted upon. Below the threshold, you can automate. Above it, you must review.
A financial services client I advised set their threshold at $1,000. Any customer acquisition recommendation from their AI system worth more than $1,000 in marketing spend went to a human analyst before being executed. Cheaper recommendations ran automatically. This simple rule caught an average of 3-4 AI errors per month that would have cost the company money. Without this layer, those errors would have compounded.
Layer Two: Build Validation Checkpoints Into Every AI Workflow
Do not ask AI to do a task end-to-end. Break it into steps. At each step, build a validation checkpoint where the output is verified before moving to the next step.
I worked with a content generation team that was using AI to write blog posts. Initially, they had the AI write the entire post in one pass. Quality was inconsistent. Then we restructured the workflow: the AI writes an outline. A human reviews the outline. The AI writes the first draft. A human reviewer checks factual claims against sources. The AI refines based on feedback. A human editor does a final pass. This multi-step validation approach reduced their error rate from around 8% (posts with material errors) to less than 1%.
Layer Three: Track Error Rates Monthly, They Change as Models Update
AI models are updated frequently. When they are, their error rates change. Sometimes improve. Sometimes regress. Most organisations deploy an AI tool and never measure whether its quality is changing over time.
I recommend a simple monthly audit: take a sample of 20-30 AI outputs from the previous month. Have a domain expert review them and mark any errors or quality issues. Track this over time. You will notice patterns. You will see when model updates improve performance. You will see when they regress. And you will catch when an AI tool is no longer performing as expected.
One manufacturing client I worked with implemented this and discovered that after a model update in Q4 2025, their AI's accuracy on component quality assessment recommendations dropped by 7%. The drop was not catastrophic, but it was material. Had they not tracked it, they would not have known it was happening.
Layer Four: Create a Confidence Scoring System for AI Outputs
Ask the AI to rate its own confidence in its output. This is crude, but it works better than you might expect. When I ask Claude to complete a task and then ask "What is your confidence level in this output on a scale of 1-10, and why?" the model provides a self-assessment that correlates reasonably well with actual accuracy.
Use this confidence score to triage human review effort. High-confidence outputs get a lighter touch review. Low-confidence outputs get deep human scrutiny. This lets you focus human attention on the AI outputs most likely to be wrong.
Real Examples Where Quality Controls Caught Expensive Mistakes
A legal research team I worked with in 2024 was using AI to summarise case law and identify relevant precedents. The system was fast, it reduced research time by 70%. But one of their reviewers noticed that the AI occasionally missed critical distinctions between cases. One early error could have been material: the AI recommended a precedent that seemed relevant but actually distinguished away in ways that would have been devastating to the case. A junior lawyer would have missed it. Their human reviewer caught it because she applied layer-two validation, requiring the AI to cite specific passages supporting its analysis before it moved to the next step.
An ecommerce company I advised was using AI to generate product descriptions. The system was effective, it produced a 20% increase in click-through rates compared to human-written descriptions. But our monthly audit (layer three) revealed that about 2% of AI-generated descriptions included subtle errors: wrong dimensions, incorrect material specifications, inaccurate care instructions. These errors increased product returns by 0.3%, which erased a significant chunk of the efficiency gains. Once we added a validation checkpoint requiring dimensional data to be checked against the product database before publication, the error rate dropped to near zero.
A customer service organisation I worked with deployed ChatGPT to handle tier-one support queries. Accuracy was high, about 94%. But they did not measure this consistently. Six months into deployment, I ran a sample audit and discovered that accuracy had drifted down to 87% after a model update. Some of the dropped accuracy was in the handling of refund policies, which is material. They implemented layer three (monthly sampling and measurement) and immediately switched to a different model configuration that recovered their original accuracy.
The Conversation You Need to Have With Your Team
If you are currently using AI in your business without formal quality controls, here is what I would recommend: schedule a review meeting with the stakeholders who depend on that AI system. Ask them three questions:
First: What is the cost if this AI system produces an error that you do not catch? Second: How are we currently verifying that the AI system is producing accurate outputs? Third: If we discovered tomorrow that the system had been producing errors for the past month, what would we do?
Most organisations cannot answer these questions clearly. That absence of clarity is the real risk.
I am not suggesting that AI is too risky to deploy. I am saying that AI without quality controls is risky. And the organisations that will win with AI are the ones that build quality controls systematically, measure them, and refine them over time.
The Center for AI Safety research is not new. What is new is that organisations are finally taking it seriously. If you have not yet built quality controls into your AI workflows, this is the moment to do so.
Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.
Frequently Asked Questions
How long does it take to build AI automation in a small business?
Most single-process automations take 1-5 days to build and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.
Do I need technical skills to automate business processes?
Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.
Where should a business start with AI implementation?
Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.
How do I calculate ROI on an AI investment?
Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.
Which AI tools are best for business use in 2026?
It depends on the use case. For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.
What Should You Do Next?
If you are not sure where AI fits in your business, start with a roadmap. I will assess your operations, identify the highest-ROI automation opportunities, and give you a step-by-step plan you can act on immediately. No jargon. No fluff. Just a clear path forward built from 120+ real implementations.
Book Your AI Roadmap, 60 minutes that will save you months of guessing.
Already know what you need to build? The AI Ops Vault has the templates, prompts, and workflows to get it done this week.