This is Part 3 of a series on implementing SRE practices in small startups. Part 1 covered embracing risk. Part 2 covered SLOs and error budgets. This post tackles the silent killer of startup engineering productivity: toil.
The SRE book defines toil in Chapter 5 as work that’s manual, repetitive, automatable, tactical, without enduring value, and that scales linearly with service growth. Operational work that keeps the lights on but doesn’t make the system better. And it has this property that makes it dangerous: it feels productive while you’re doing it.
You SSH into a server and restart a hung process. You manually provision a new environment for a customer. You update a config file, test it locally, deploy it. You respond to the same alert for the third time this week by running the same manual remediation.
Each of these tasks is necessary in the moment. Each gives you the satisfaction of having fixed something. But collectively, they’re stealing your future. Every hour spent on toil is an hour not spent on automation, architecture, or features. And at a startup, where engineering hours are the scarcest resource, toil that goes unchecked will slowly consume the team.
At Google, the SRE book recommends that SREs spend no more than 50% of their time on toil. At a startup, you don’t have dedicated SREs. Your product engineers are also your operations team. If toil consumes 50% of their time, you’ve effectively cut your feature development capacity in half.
The insidious thing is that it creeps. No one decides to spend half their time on operational work. It happens gradually. You add a customer, and that means a manual provisioning step. You add a feature, and that means a new alert to respond to. Each increment is small. The total grows relentlessly.
I’ve been working with a team recently where a quick audit revealed that their two most senior engineers were spending roughly 60% of their time on tasks that met the SRE book’s definition of toil. Neither of them realized it because the tasks were spread across the week in small chunks – fifteen minutes here, thirty minutes there. But when we added it up, the number was staggering. These were the engineers who should’ve been building the product’s next major capability. Instead they were manually running database migrations and restarting services.
The first step is recognition. Most teams can’t quantify their toil because they’ve never tried to categorize it.
One approach that’s worked: for one week, have every engineer log their operational tasks in a shared doc. For each task, note what they did, how long it took, whether it was triggered by an event or a schedule, and whether a computer could’ve done it. At the end of the week, categorize and total the time. The results are always surprising. Teams consistently underestimate their toil by a factor of two to three.
The repeat test: look at your task history over the past month. Any task performed more than twice is a candidate for automation. More than five times is a strong candidate. More than ten times should be treated as an automation emergency.
And the scaling test. For each operational task, ask: if we have 10x the customers next year, how much more time will this take? If the answer is 10x more time, it’s toil. If the answer is about the same, it’s not.
You can’t automate everything at once. At a startup, you need to be strategic.
High frequency, low complexity – automate immediately. The tasks you do every day or every week that are straightforward to automate. Restarting a service after a known failure. Provisioning a standard environment. Rotating credentials on a schedule. Usually shell scripts or simple CI/CD additions. The time-to-automate is low and the time-saved compounds quickly.
High frequency, high complexity – invest in automation. Complex provisioning workflows, multi-step deployment processes, data migrations with validation. They require real engineering effort to automate properly, but the payoff is large because you’re spending significant time on them every week. Budget time for these like you would for a feature.
Low frequency, low complexity – document and template. Tasks you do quarterly or less often aren’t worth fully automating, but they should have clear runbooks with copy-pasteable commands. Reduce the cognitive overhead and error rate even if a human is still in the loop.
Low frequency, high complexity – defer. The annual disaster recovery test, the one-off data migration, the unusual customer request. Important but rare enough that the automation investment won’t pay off soon. Document them well and move on.
The SRE book talks about automation as a spectrum. For a startup, here’s where I’d put energy:
Make provisioning self-service. If adding a customer, environment, or resource requires an engineer to do manual work, that’s your highest-priority automation target. Build a script, a CLI tool, or a simple API that does the provisioning end-to-end. Single highest-ROI automation investment at most startups I’ve worked with.
Automate deployment and rollback. If your deployment process involves SSH, manual verification steps, or ad-hoc commands, automate it. A deployment should be a single command or a button click, with automated health checks and one-click rollback. I wrote about this in 2019 in the context of infrastructure as code – deployment automation is the most impactful application of that principle.
Automate remediation for known failure modes. If you have an alert that always requires the same manual response, automate the response. The service that needs to be restarted when memory exceeds a threshold? That’s a health check and an automated restart, not a page to a human at 3 AM.
Build escape hatches, not perfect systems. Startup automation doesn’t need to handle every edge case. It needs to handle the common path well and fail loudly when it encounters something it can’t handle. An automation script that works for 90% of cases and alerts a human for the other 10% saves vastly more time than no automation while being practical to build.
Toil elimination requires a cultural commitment. In most startups, there’s a bias toward heroism – the engineer who stays late to manually remediate a problem is celebrated, while the engineer who spends two days building automation that prevents the problem is questioned about why the feature work is late.
This incentive structure produces more toil, not less. If heroism is rewarded, you’ll get heroes. But you won’t get systems that don’t need heroes.
The SRE book’s 50% rule is a cultural tool as much as a planning tool. It says: operational work is real work that deserves investment, but it should never consume the majority of engineering time. When it does, something is wrong, and the right response is to invest in reducing it, not to hire more people to absorb it.
At the startup I’ve been working with, we instituted a simple rule: for every incident that requires manual remediation, the team files a follow-up ticket for automating the remediation. Not every ticket gets picked up immediately, but the backlog makes the toil visible and creates natural opportunities to chip away at it during lighter weeks.
There’s a direct connection between toil and the error budget framework from Part 2. Toil often exists because the team is too busy fighting fires to invest in prevention. But fighting fires burns the error budget, which should trigger a shift toward reliability work – and that includes toil reduction.
When your error budget is healthy, spend time on features. When it’s depleted, spend time on reliability, including automating the operational work that consumes engineering time and contributes to errors. The two frameworks reinforce each other.
In Part 4, I’ll cover observability and alerting – how to build the instrumentation that powers SLOs, error budgets, and effective incident response without drowning in alert noise.