Richard Batt |
The Developer's Guide to Building Maintainable Automation
Tags: Development, Automation
I walked into a new client's office to audit their automation infrastructure. They'd been running automation scripts for three years, and the entire system was held together by what I can only describe as digital duct tape. No tests. No documentation. Hardcoded credentials scattered through multiple files. And here's the worst part: one person understood how it all worked: and they'd just resigned.
Key Takeaways
- Why Automation Code Gets Neglected and what to do about it.
- Start with Assertions, Not Just Happy Paths.
- Structure Your Code for Testing From Day One.
- Version Control and Secrets Management Are Non-Negotiable.
- Build complete Logging, Not Just Debug Prints.
This isn't uncommon. Most automation projects I've encountered in my 120+ consulting engagements start with good intentions and gradually become unmaintainable nightmares. Teams build scripts that work, deploy them, and move on. Then six months later, something breaks, and nobody remembers why the script was written that way in the first place. The person who wrote it is gone. The deadline was two years ago, so there's no context document. The code is a mystery.
Here's what I've learned: the difference between automation that dies after two years and automation that runs reliably for a decade is engineering discipline applied from day one. Not complex architecture. Not expensive tools. Just basic software engineering practices that somehow get skipped in the rush to "just get it working."
Why Automation Code Gets Neglected
Before I explain what to do, let me explain why this happens so often. Automation scripts occupy a weird space in software development. They're not quite applications. They're not throwaway scripts either: they're business-critical infrastructure that runs automatically. But they're treated like both simultaneously.
A developer writes a Python script to process daily reports. It works. It goes into production. Then something else catches their attention, and the script gets orphaned. Nobody budgets for maintenance because it's "just a script." But that script is now running on a schedule, touching important data, and if it breaks, it breaks silently.
I consulted with a financial services company where their main revenue reconciliation ran on a Python script nobody had touched in 18 months. They had no idea whether it was working correctly. They had no alerts. When I ran it in debug mode, I found it was silently skipping records that didn't match a particular pattern: records that turned out to be worth £47,000 in misclassified transactions.
That's when they realised automation code deserves the same engineering discipline as any other business-critical software. It just had never received it.
Start with Assertions, Not Just Happy Paths
The first habit you need is to stop thinking about "happy path" execution. Yes, your script works when everything is fine. But what happens when it's not? What goes wrong? And critically, how will you know something went wrong?
I worked with a logistics company running an order shipment automation. The script would pull orders from the system, generate shipping labels, and send them to the warehouse. It worked beautifully 99.2% of the time. The other 0.8% of the time, the shipping label API timed out, but the script would just log "connection timeout" and move on. No exception. No failure signal. The order would sit in limbo, the customer wouldn't get notified, and nobody would know until three days later when customers started complaining.
The fix was straightforward: add assertions that check what should be true after each major operation. After generating a label, assert that the label file exists and has a reasonable size. After sending to the warehouse system, assert that you get a successful response code. If any assertion fails, fail loudly and immediately: don't continue with bad data.
This applies to data transformations too. If you're processing 1,000 records and expecting 990 to succeed with 10 errors, what happens if you only get 500 successes? Does the script just continue? Add a check: "I expected this operation to succeed for 95% of records, and it only succeeded for 50%. This is anomalous. Stop and alert."
These checks are cheap to add and invaluable when something goes wrong. They transform a silent failure into an immediate alarm.
Structure Your Code for Testing From Day One
Most automation scripts are written as a single block of procedural code. Read data, process it, write results. Everything is tangled together. If you want to test whether your data processing logic is correct, you have to run the entire script, which means hitting external APIs, reading real databases, and producing real output.
Here's what I recommend: structure even simple automation scripts with functions that can be tested independently. This doesn't require a testing framework or complex architecture. Just basic functions.
Instead of:
```
data = get_from_api()
for item in data:
process_item(item)
save_item(item)
```
Write:
```
def transform_item(raw_item):
# Pure function: given input, always produces same output
return processed_item
def validate_item(item):
# Check if item is valid
return True or raise ValueError
def main():
data = get_from_api()
for item in data:
try:
processed = transform_item(item)
validate_item(processed)
save_item(processed)
except ValueError as e:
log_error(item, e)
```
Now you can test `transform_item()` and `validate_item()` with sample data in seconds, without hitting any external systems. You can prove your logic is correct before you run it against real data.
I implemented this structure in a data pipeline for a retail client processing 500,000 product records daily. Before the refactor, every change required running the full pipeline and waiting 45 minutes to see if it worked. After factoring out the transformation logic, I could test individual functions in seconds. The team went from making one deployment per week to three deployments per day because testing became fast.
Version Control and Secrets Management Are Non-Negotiable
Every automation script you write should live in version control. Full stop. Not optional. Not "eventually." From the first line of code.
I once found a client running a critical customer data synchronisation script that wasn't in version control. It lived on a single developer's laptop and was copied to a server. When that developer left, they couldn't find the source code anymore: only the deployed version. When a bug appeared two months later, they couldn't trace what had changed or roll back to a previous version. They had to rebuild the script from memory.
Version control gives you history, the ability to understand what changed and why, and the ability to revert mistakes. You need all three for production automation.
But here's what's equally critical: never, ever store API keys, database passwords, or credentials in version control. Not in comments. Not in strings you think are protected. Not "just temporarily."
I consulted with a SaaS company who stored AWS credentials directly in their automation scripts. Six months later, those scripts were cloned across three different AWS accounts, the repositories were briefly public, and hackers gained access through the exposed credentials. They estimated the damage at £180,000 in fraudulent AWS charges.
Use environment variables, secrets management systems (AWS Secrets Manager, HashiCorp Vault, etc.), or configuration files that are explicitly excluded from version control. Never hardcode credentials.
Build complete Logging, Not Just Debug Prints
The difference between automation you can troubleshoot and automation that mystifies you is logging. Most scripts have either no logging or minimal debugging print statements. Both are inadequate for production.
You need structured logging that captures: what the script was trying to do, what it actually did, what went wrong, and when. You need to be able to search logs weeks later and understand exactly what happened during a particular run.
Here's what I recommend for a minimal production logging setup:
Log levels. Use INFO for important business events ("Started processing 1,000 orders", "Completed order 12345"), WARN for anomalies ("Retry count exceeded 3 times", "Unexpected response format"), and ERROR for failures. Separate the important from the noise.
Timestamps and context. Every log line should include the time, the context of what operation is happening, and any identifiers needed to trace the issue (order ID, customer ID, etc.).
Structured format. Use JSON logging so you can actually search and filter logs. Don't just concatenate strings.
I worked with a healthcare organisation processing insurance claims through automation. Their first version had ad-hoc logging sprinkled throughout. When a claim got stuck, they'd have to manually search through thousands of lines of logs to piece together what happened. We implemented structured JSON logging with claim ID in every entry. Support went from 20 minutes to 2 minutes to identify what went wrong with a specific claim.
Centralize your logs. Don't leave logs only on the server where the script runs. Send them to a centralized logging service. This way, if the server crashes or gets deleted, you still have the logs. If you need to search across multiple script runs, you can do it from one place.
Monitor and Alert, Don't Just Hope It Works
Here's the hard truth: even with all the testing and logging in the world, automation scripts fail. External APIs go down. Databases run out of space. Permissions change. Networks get flaky. Something will go wrong eventually.
You need monitoring and alerting so you know about failures before your customers do. Not after the system has been broken for six hours.
Minimum monitoring includes:
Execution monitoring. Did the script actually run? When? Did it complete or crash? I worked with a marketing team whose email automation script had crashed silently two weeks earlier. They had no idea 50,000 marketing emails hadn't been sent. A simple "did this task complete successfully" check would have alerted them immediately.
Data quality checks. After processing, is the output what you'd expect? If you're syncing customer data and you suddenly start syncing 10x more records than usual, that's not a feature: it's a bug. Set thresholds and alert when metrics go outside expected ranges.
Duration monitoring. Does the script usually take 5 minutes? If it's taking 30 minutes today, something's wrong: maybe a database query regressed, maybe an API is slow, maybe you have 10x more data than usual. Alert on anomalies.
Error rate monitoring. If your script processes 1,000 items and 50 fail, is that expected? Set an acceptable error rate and alert when you exceed it.
I consulted on a data warehouse automation for a financial company. They process 2 million transactions daily. We set up alerting on: execution success/failure, error rate (alert if more than 0.1% of transactions fail), and duration (alert if more than 5% slower than normal). This early-warning system caught issues within minutes. When an upstream API changed format and broke the parser, they knew about it before any business reports were affected. Without monitoring, they wouldn't have known for 24 hours.
Documentation: The Gift to Future You
I've already mentioned that automation projects lack documentation. Let me be specific about what you need to document.
Operational runbook. How do you run this script? What dependencies does it need? What happens if it fails? How do you restart it? A junior engineer should be able to follow this without calling the original author.
Design decisions. Why did you choose this approach instead of that one? What constraints were you optimising for? Why does this script hit API X instead of Y? Six months from now, you won't remember. Write it down.
Known limitations. Every script has assumptions and limitations. It doesn't handle records with special characters. It can only process 100,000 items per run. It requires internet connectivity. Document these so the next person doesn't discover them the hard way.
Troubleshooting guide. What's the most common failure mode? How do you diagnose it? How do you fix it? I worked with a team running a daily report automation that occasionally produced incomplete results. Took them 4 hours to figure out it was a race condition with their database. After that, I made them document it: "If the report looks incomplete, check the database slow query log. If you see lock timeouts, restart the database refresh job." Now they can fix it in 15 minutes.
Testing Automation Code: Practical Approaches
You don't need a complex test framework. Here's what I recommend for real-world automation:
Unit tests for pure functions. Any function that transforms data should be testable with sample inputs. Write a few test cases: normal case, edge cases, error cases. This catches logic bugs before they hit production.
Integration tests with mocked external calls. Your script probably hits external APIs or databases. Test it with mocks so you can verify the integration logic without actual external calls. Test failure scenarios: what happens when the API returns an error?
Dry-run capability. Build your script so it can run in "dry run" mode: executing all the logic but not making any changes. This lets you verify it's correct before running against real data. I implemented this for a client's invoice automation: they could dry-run against 1,000 invoices, see the results, and then run against 1 million invoices with confidence.
Canary deployments. When you change the script, run it against a small subset of real data first. Process 100 orders, verify they're correct, then run against all 10,000. This catches logic errors on real data without the full risk.
The Long-Term View
All of this: testing, logging, monitoring, documentation: feels like overhead when you're building a script under deadline. "We just need to get it working." I understand that pressure. But I've seen the cost of skipping these steps.
The script that works but has no tests becomes the script nobody dares to change. The script with no logging becomes the script nobody can debug. The script with no documentation becomes the script nobody can support. The script with no monitoring becomes the script that breaks and nobody notices.
I worked with a manufacturing company running 47 different automation scripts. They'd been built over five years, all with minimal engineering discipline. When I audited them, I found: 12 had no error handling, 23 had no logging, 31 had no monitoring, and only 3 had version control. The team spent 40% of their time fighting fires in these scripts instead of building new value.
We spent two weeks adding proper structure, testing, and monitoring to the highest-risk scripts. Nothing changed in what the scripts did. But suddenly the team could sleep at night. They could deploy changes without fear. They could onboard new team members without tribal knowledge. They could troubleshoot issues in minutes instead of hours.
The engineering discipline you add to automation code upfront buys you months of stability and peace of mind later. It's not about perfection. It's about basic practices that separate production-ready code from scripts that happen to work.
If you're running critical automation that's not backed by solid engineering practices, or you're planning automation that needs to run reliably for years, let's talk about how to build it right from the start. I've helped 120+ organisations transform automation from a maintenance nightmare into a reliable asset. Reach out here to discuss your automation engineering needs.
Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.
Frequently Asked Questions
How long does it take to implement AI automation in a small business?
Most single-process automations take 1-5 days to implement and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.
Do I need technical skills to automate business processes?
Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.
Where should a business start with AI implementation?
Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.
How do I calculate ROI on an AI investment?
Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.
Which AI tools are best for business use in 2026?
It depends on the use case. For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.
Put This Into Practice
I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for $97/month.
Want a personalised implementation plan first? Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.