Richard Batt |
How to Evaluate Any New AI Tool in 15 Minutes: My Framework
Tags: AI Tools, Productivity
The Problem: AI Tools, Everywhere
New AI tools launch almost every day. I'm not exaggerating. Last week I got emails about four new AI summarization tools, two new code generation frameworks, and one AI tool that claims to handle customer support. All of them claim to be the best at what they do. All of them are probably some combination of useful, overhyped, and not right for you.
Key Takeaways
- The Problem: AI Tools, Everywhere, apply this before building anything.
- The Five-Minute Rules, apply this before building anything.
- Category 1: Core Capability (Three Minutes).
- Category 2: Integration (Three Minutes).
- Category 3: Cost and Scaling (Three Minutes).
I've made a lot of bad bets on AI tools. Spent weeks setting up something that looked great and turned out to be useless for my actual workflow. Went through onboarding that took eight hours for something I could accomplish in two. Got excited about a feature that doesn't actually exist yet. You know this feeling.
So I built a framework. Fifteen minutes. Five categories. By the end, you know if a tool is worth deeper exploration or if you're wasting your time. This is what I use when I'm evaluating new tools, and it works.
The Five-Minute Rules
Before I walk through each category, the meta-rule: you have three minutes of setup before you can evaluate anything. If a tool requires more than three minutes to get to a working state-downloading something, creating an account, configuring integration-you're already wasting time. Most good tools get you to something useful faster than that.
This rule filters out a lot of noise. A tool that requires 30 minutes of setup before you can test the core feature has already told you something about its UX philosophy: We're comfortable asking you to invest significantly before you know if this is useful. That's a red flag.
Category 1: Core Capability (Three Minutes)
Does it actually do what it claims? Not in theory. Right now. For your use case.
Here's what I do: I pick one real task from my actual work. Not a toy problem. Something I actually need to accomplish. Then I try to use the tool to do it.
For example, last month I evaluated a new AI code review tool. Core task: take a GitHub PR and generate a detailed review. I didn't create a toy PR. I used a real PR from a client project. I ran the tool. Did it generate a useful review? Did it catch real issues or just generic stuff? Did it take 30 seconds or five minutes?
The review was technically accurate but shallow. It found one real issue and missed two others that were more subtle. It was like a junior developer reviewing code after looking at it for 10 seconds. Not worthless, but not useful enough to integrate into my workflow. That told me this tool wasn't ready for what I needed.
Red flags in this category:
- The tool does something related to what you need but not exactly. (We do code review, but it's optimized for JavaScript only and you use Python.)
- It works but takes longer than you expected (It analyzes your code, but give us 3-5 minutes per file.)
- The output is generic or template-based (Yes it finds issues, but every issue description is basically the same boilerplate.)
- It requires configuration before it does anything useful (Yes it works! First set up integrations with these five systems...)
Green flags:
- It does the specific thing you need, not a close relative.
- It works faster than doing it manually.
- The output is specific and non-generic.
- It works out of the box with your real data.
Category 2: Integration (Three Minutes)
Now assume the core capability is solid. Can you actually fit this into your existing workflow, or are you building a new workflow around the tool?
Specifically: does it connect to the systems you already use? If you use GitHub, does it have a GitHub integration? If you use Slack, can it send messages there? If your data lives in Notion, can it read Notion? If not, you're going to be manually ferrying information between systems, and that friction will kill adoption.
Here's my test: I try to integrate it with one system I use daily. If there's a native integration, I turn it on. If there isn't, I check if it has an API. If the API requires more than 10 minutes to understand, that's friction. I mark it as a negative signal.
Real example: a new AI writing assistant claimed it would transform my workflow. But it didn't integrate with Google Docs, where I actually write. It had a web editor. So using it meant: open the web editor, write there, copy to Google Docs. That extra step would kill it. I didn't bother evaluating further.
Red flags:
- No integrations with tools you use regularly.
- Integration exists but is one-directional (We can read your data but not write back.)
- Integration requires writing your own code.
- Integration is real-time in theory but has a 6-hour sync delay.
Green flags:
- Native integration with your main tools.
- Two-way sync (reads and writes).
- Works without configuration.
- Data syncs in near-real-time.
Category 3: Cost and Scaling (Three Minutes)
What does this tool actually cost at the volume you'd use it?
This is where a lot of evaluations go wrong. The tool looks free, so you think it's free. Then you actually use it heavily and hit a usage limit or the pricing model reveals itself. I've seen this a dozen times: tool is free for 100 API calls per month. Sounds fine. You actually need 10,000 per month. Price jumps to $500/month. You never expected that cost.
Here's what I check: I look at the pricing page and I mentally model my usage. I'd probably use this 10 times per day on business days. That's about 200 uses per month. Then I find where that puts me on the pricing tier. Do I hit a paywall? Does it get expensive? Is the cost reasonable relative to the time it saves me?
Example: I evaluated an AI expense tool. Free tier gave 10 research sessions per month. I'd probably do one per day. So that's 20 per month. I'm immediately over quota. Next tier was $40/month. For that volume, $40/month is fine. But the pricing was obscured and I only figured it out by trying to use it heavily. That's bad UX.
Red flags:
- Free tier limits are unambiguous (Why am I getting an error after 50 uses?)
- Pricing scales non-linearly (First 1000 tokens are cheap, then the price jumps 10x.)
- You don't actually know what you'll pay until you use it.
- The per-unit cost gets more expensive as you use more.
Green flags:
- Pricing is clear and predictable.
- Free tier is genuinely useful or the paid tier is reasonable.
- Pricing scales linearly or gets cheaper with volume.
- You can estimate your cost before you start.
Category 4: Security and Data Handling (Three Minutes)
Where does your data go? Who sees it? What's the actual privacy policy versus the marketing language?
This is the category where I'm most paranoid, and I should be. If you're feeding an AI tool your customer data, your code, your internal documents, you need to know where that data lives and who has access to it.
I check: does the tool have an explicit privacy policy? Not a marketing page. An actual policy. Does it say where data is stored? Does it say if data works for training? Is there a data deletion policy? Do they have a security audit or SOC 2 certification? If it's a B2B tool handling sensitive data, do they have a contract option?
I also check the fine print for something specific: Your data may be used to improve the model. That phrase means your data is being fed into their training pipeline. Depending on what your data is, that's either fine or a deal-breaker. Know which one it is.
Real example: an AI code review tool I looked at had a privacy policy that said code would be anonymized and potentially used to improve their model. That's a no-go for me because my code often contains client secrets and proprietary algorithms. I didn't pursue it further.
Red flags:
- No clear privacy policy or it's vague.
- Data works for model training without explicit opt-out.
- No geographic data residency options if you need them.
- No security audit or certification for B2B tools.
- Data deletion policy is unclear or long-delayed.
Green flags:
- Explicit privacy policy that's readable.
- Clear statement that your data is not used for training.
- Data residency options if needed.
- Security certification for B2B tools.
- Data deletion on-demand.
Category 5: Lock-in Risk (Three Minutes)
Can you actually leave if you need to? Or are you now dependent on this vendor?
This is the question I ask at the end: if this company disappears, gets acquired, changes their pricing 10x, or otherwise does something bad, can I extract my data and move to something else?
For data-based tools, I check: can I export my data? In what format? Is it in a standard format or some proprietary format that only works with their tool? How hard is it to migrate to a competitor?
For tools that are more about workflow, I check: am I building critical dependencies on this tool? If it disappears, do I lose everything or is it just an efficiency play?
Real examples: I use Claude Code a lot. Lock-in risk: medium. Anthropic is reliable, but I can export projects and move code to other editors. The knowledge I build-how to prompt effectively, architecture decisions-is portable. I'm comfortable with that level of lock-in. I do not use a particular competitor's tool that has no export feature and stores everything on their servers. Same capability, worse lock-in risk. I choose Claude.
Red flags:
- No export feature or export is cumbersome.
- Data is stored only on their servers with no local option.
- You're building critical workflows that depend entirely on this tool working exactly as it is now.
- Switching costs are very high.
Green flags:
- Easy data export in standard formats.
- Local storage option available.
- The tool improves your efficiency but isn't critical infrastructure.
- Switching costs are low.
The Decision Tree
After you've checked all five categories, here's how I decide:
If it fails any of the first two categories (core capability or integration), I don't use it. It doesn't matter how cheap it is or how good the privacy policy is if it doesn't actually do what I need or can't fit into my workflow.
If it fails category three (cost) severely-like it costs 10x what I expected-I pass unless the capability is truly exceptional.
If it fails category four (security), I don't use it with sensitive data, period.
If it fails category five (lock-in), I evaluate the risk. If it's low lock-in work, I might use it anyway. If it's high lock-in work, I look for alternatives.
If it passes all five with no major red flags, I'll integrate it and give it 30 days of real usage. That's when I really know if it works.
Tools That Passed and Failed
Let me give you three real examples:
Claude Code (Passed): Core capability: exceptional. Writes production code I'm confident in. Integration: excellent, I can use it on any codebase. Cost: reasonable for what I use it. Security: transparent, no training on my data. Lock-in: low, my code is portable. Verdict: used daily.
One popular AI writing assistant (Failed category 2): Core capability: good, generates decent copy. Integration: no Google Docs plugin, forces me to web editor. That integration friction is a blocker. Verdict: didn't adopt.
An AI expense tool (Failed category 3): Core capability: solid, reads receipts accurately. Integration: good, connects to accounting software. Cost: fine on the surface, but usage-based pricing that hits hard at scale. Security: excellent. But when I modeled my actual usage, it would cost $300/month. I'm solving a $50/month problem. Verdict: too expensive to adopt.
The Practical Application
Use this framework the next time you're evaluating an AI tool. Get a timer. Go through the five categories in 15 minutes. You won't have perfect information, but you'll have enough to know if something is worth deeper exploration or if you're chasing a mirage.
The framework cuts through marketing. It cuts through hype. It focuses you on what actually matters: does the tool work for my actual task, does it fit my workflow, what does it cost, where's my data, and can I leave if needed? Answer those five questions and you have 80% of the decision-making power.
This saves you weeks of wasted onboarding. Weeks of integration work on tools you'll abandon. It lets you find the 10% of new AI tools that are legitimately useful and ignore the 90% that are noise.
Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.
Frequently Asked Questions
How long does it take to implement AI automation in a small business?
Most single-process automations take 1-5 days to implement and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.
Do I need technical skills to automate business processes?
Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.
Where should a business start with AI implementation?
Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.
How do I calculate ROI on an AI investment?
Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.
Which AI tools are best for business use in 2026?
For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.
Put This Into Practice
I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for $97/month.
Want a personalised implementation plan first?Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.