Richard Batt |
How to Build AI Automation That Survives When You Are Not Looking
Tags: Automation, Operations
A logistics company built an automated invoice system four years ago. It worked flawlessly for two years. Then the person who built it left. Nobody else knew how it worked. When data formats changed (the supplier updated their system), the automation broke silently. For three months, invoices weren't being processed. Thousands of pounds in payments were stuck.
When they finally discovered the problem, the damage was done. And rebuilding the automation took longer than building it the first time because the original builder wasn't around to explain the decisions.
This is the "bus factor" problem. If one person dies or leaves, the automation dies with them.
Key Takeaways
- Most automation debt comes from undocumented workflows with no clear owner
- Silent failure is worse than loud failure, you don't notice the problem until it's expensive
- The "bus factor" problem: if one person leaves, automation dies
- Five principles of maintainable automation prevent 95% of automation failures
- A maintenance checklist takes 30 minutes per automation and saves months of crisis management
Why Automation Breaks (And Why Nobody Notices)
Automations fail silently. The invoice system runs every night, but nobody's watching. If it breaks, it fails quietly. Days pass before anyone notices invoices aren't being processed.
Compare that to a person doing the work. If they're sick, you notice immediately. The work doesn't get done. Someone has to cover. The failure is loud and fast, which is actually better because you can react.
Automated systems need to be built to fail loudly. And they need documentation and ownership so someone can fix them when they break.
Principle 1: Ownership (Who Owns This?)
Every automation needs one person who owns it. Not the person who built it. The person who's accountable for it working.
That person needs to:
Know the automation exists (obvious, but many companies have "shadow automations" nobody knows about).
Understand what it does and why it does it.
Be able to debug it if it breaks (or know who to call if they can't).
Check it weekly (look at the logs, verify the output is correct, scan for errors).
Decide when to update or kill it.
The owner is not a technical expert. It's the person who cares most about the process it automates. The head of finance owns the invoicing automation. The operations manager owns the report-generation automation. They understand the problem better than anyone else, so they're best positioned to know when something's wrong.
Make this explicit. Write it down: "Sarah owns the invoice automation. If it breaks, Sarah investigates and either fixes it or escalates to the developer. Sarah reviews the logs every Friday."
Principle 2: Documentation (Write Down What You Built)
You build an automation. It works. You never write down how it works. A year later, you need to modify it. You can't remember the logic. You spend three days reverse-engineering your own work.
Documentation should be one page per automation. It answers five questions:
What does it do? "This automation processes invoices from our suppliers. It extracts line items, validates amounts, and creates entries in our accounting system."
How does it work? Step-by-step: what triggers it, where does data come from, what transformations happen, where does output go? Include a simple diagram if it helps.
When does it run? "Every night at 2am. Takes 10 minutes. Completes by 2:15am."
What can go wrong? "If the supplier changes their invoice format, the automation will fail. If the database connection times out, it will retry three times then alert us."
Who to call if it breaks? "Sarah (owner) first. If Sarah can't fix it, contact the developer. For urgent issues (no invoices processed in 24 hours), page the on-call engineer."
That's it. One page. Store it next to the automation (comments in the code, or a document linked from your automation tool). When Sarah leaves the company, the new owner has a starting point instead of nothing.
Principle 3: Monitoring (You Can't Fix What You Don't See)
Automations run in the dark. You don't know if they're working unless you look.
Set up three kinds of monitoring:
Run monitoring: Did the automation execute? When? How long did it take? Look for jobs that didn't run, or ran but took 10x longer than normal.
Output monitoring: Is the output correct? For invoicing automation, check: did all invoices get processed? Are any amounts flagged as invalid? Are error rates where you expect them?
Data quality monitoring: Is the data the automation is working with correct? If input data quality drops (incomplete fields, formatting changes, missing values), the automation's output quality drops. Catch the input problems before they cascade.
For each type, define an alert:
"If a job doesn't complete, email the owner immediately."
"If error rate goes above 5%, page someone."
"If output count drops below expected, alert the owner within 2 hours."
These alerts turn silent failures into loud ones. You notice problems while they're small, not three months later when they're disasters.
Most automation tools have built-in monitoring. Zapier, Make, and custom systems (Python, Node, etc.) all have logging. Set up slack alerts or email alerts when things break. Takes 10 minutes. Saves months of crisis.
Principle 4: Error Handling (Fail Gracefully, Not Silently)
Things go wrong. Your automation needs to handle errors without breaking silently.
Good error handling looks like this:
"Try to process the record. If it fails, log why. If it fails three times, move it to an 'error queue' for manual review. Alert the owner."
Bad error handling:
"Try to process the record. If it fails, do nothing and move to the next one." (The bad record disappears silently.)
For each automation, define:
What counts as an error? (Invalid data, missing field, API timeout, duplicate record)
How many retries before we give up? (Usually 3)
What do we do with records that fail? (Log them, quarantine them in an error table, alert the owner)
Who reviews the error queue? (The owner, every Friday)
This way, nothing disappears. Failed records are visible. The owner can decide: is it a bad record (reject it) or is it an automation problem (fix the automation)?
Principle 5: Versioning (You Need a Paper Trail)
You modify the automation. It breaks. You wish you could revert to the version from yesterday. But you didn't save a backup.
Version control applies to automations just like code:
Store your automation definition (if it's code, put it in git. If it's a Zapier flow, export it and store it somewhere).
When you change the automation, note what changed and why. "Updated to handle new supplier invoice format on March 15."
Keep the last three versions available for rollback. If the new version breaks, revert to yesterday's version in five minutes instead of rewriting it from scratch.
This is table stakes for professional automation. It takes 20 seconds per change (write it down, store it). It saves hours when things break.
The Maintainable Automation Checklist
Before you consider an automation "done," run through this checklist:
Ownership [ ] One person is identified as the owner [ ] The owner has been told they own this [ ] The owner understands what it does and why [ ] The owner has a weekly check-in on their calendar Documentation [ ] One-page documentation exists (what, how, when, failure modes, escalation) [ ] Documentation is stored where the owner can find it [ ] A non-technical person can understand what the automation does from the docs [ ] Failure modes and error handling are documented Monitoring [ ] Run logs exist and are retained for 30 days [ ] Alerts are set up for common failures (job didn't complete, error rate too high) [ ] Alerts go to the owner (not to a generic channel where they'll be missed) [ ] Success metrics are defined (how many records processed, error rate, completion time) Error Handling [ ] The automation handles errors gracefully (doesn't delete bad records) [ ] Failed records are logged and reviewable [ ] The owner knows how to review and fix error records [ ] There's a process for escalating repeated errors Versioning [ ] The automation definition is version-controlled or backed up [ ] The last three versions are available for rollback [ ] Change log exists (who changed what, when, why) [ ] Rollback process is documented and tested Knowledge Transfer [ ] Someone other than the builder understands how it works [ ] If the builder leaves, someone else can maintain it [ ] Handoff process is documented If you check all these boxes, your automation will survive. It'll be maintainable. When the builder leaves, it won't die with them.
Real Examples: Automations That Died vs. Ones That Survived
Automation That Died: The Reporting Bot
Built by: A data analyst named Tom.
What it did: Generated weekly reports from the database and emailed them to the finance team.
Why it died: Tom left the company. Nobody knew how he set it up. The database structure changed (fields were renamed). The report stopped working. For two months, the finance team didn't get reports. By the time someone investigated, the damage was done. The automation was so buried in Tom's old scripts that nobody could find it to fix it.
What went wrong: No owner. No documentation. No monitoring. When Tom left, the automation was orphaned.
Automation That Survived: The Invoice Processor
Built by: A developer, Sarah owns it.
What it does: Processes incoming invoices and creates records in the accounting system.
Why it survived: Sarah owns it. Documentation is in a Confluence page everyone can find. It runs every night and sends a summary email every morning showing how many invoices were processed, how many failed. If something breaks, Sarah gets an alert immediately. When the developer who built it left, Sarah could troubleshoot common issues. For complex problems, Sarah could escalate with full context (logs, documentation, what changed).
Four years later, the automation is still running. It's been updated twice (new supplier format, database migration). The updates took days, not weeks, because the documentation was there.
What went right: Clear ownership. One-page documentation. Daily monitoring. Defined error handling. Version control.
The Maintenance Habit
Once the automation is live, maintenance takes 30 minutes per week per automation.
Weekly (30 minutes per automation): (1) Check the logs. Any errors? Unexpected patterns? (2) Verify output is correct. Run a spot check. (3) Check the alert system is working (look for alert emails in the past week) (4) Note anything that needs fixing or improving Monthly (1 hour per automation): (1) Review the error queue. Fix systemic issues. (2) Check if the automation is still solving the original problem. Is the business process still the same, or has it evolved? (3) Plan any updates needed Quarterly (2-3 hours per automation): (1) Run a full health check. Is performance degrading? Is it still reliable? (2) Review the documentation. Is it still accurate? (3) Test rollback procedures. Can you revert to a previous version if needed? That's it. Regular, predictable maintenance. It takes minimal time and prevents crises.
Frequently Asked Questions
What if the person who built the automation isn't available to train the owner?
Use the code and logs as the source of truth. The owner doesn't need to understand every detail. They need to understand: what it does (in business terms, not technical terms), what normal looks like, and what to do when something's wrong. You can get 80% of the way there with good documentation, even if the builder isn't around.
How complex does documentation need to be?
One page. Seriously. If your documentation is longer than one page, you're over-documenting. Trim it to the essentials: what, how, when, failure modes, escalation. Link to deeper technical docs if needed, but the one-pager should be standalone.
What if an automation breaks and we don't have a developer?
Good error handling catches most breaks before they're critical. The error queue tells you what's wrong. Often, it's a data format issue (something upstream changed) that you can fix by updating the automation to handle the new format. If you can't, call your developer. But good monitoring and error handling mean you catch issues while they're small.
How do we know if monitoring is working?
Test it. Intentionally break something small (change a field name, provide bad test data). Does the alert fire? If not, your monitoring isn't working. Fix it before you go to production. Once it's live, you need to trust it's working.
What if we have 20 automations running?
You can't maintain them all manually. You need a system: a spreadsheet or tool that tracks all automations, their owners, their status, when they were last checked. Zapier and Make have built-in dashboards for this. Use them. Set a weekly reminder: check the dashboard, review any alerts, make sure nothing's broken.
Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.
Put This Into Practice
I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for £97/month.
Want a personalised implementation plan first? Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.