This is the final post in a series on implementing SRE practices in small startups. Part 1 covered embracing risk. Part 2 covered SLOs and error budgets. Part 3 covered eliminating toil. Part 4 covered observability and alerting. Part 5 covered incidents and postmortems. This post ties it all together by addressing the mechanism that sits at the center of the SRE lifecycle: releasing software.
Every post in this series has circled the same tension: startups need to move fast, and moving fast introduces risk. The error budget gives you a framework for managing that. SLOs give you a way to measure it. Observability gives you visibility. Incident management and postmortems give you mechanisms for learning from it.
But the place where that tension hits hardest is the release process. Every deployment is a change, and every change is a potential source of failure. The natural response to deploy-related incidents is to add friction: more approvals, longer test cycles, less frequent releases, bigger batches. And every unit of friction slows your velocity.
The SRE book covers this in Chapter 8 (release engineering) and Chapter 9 (simplicity). The counterintuitive bit: the way to make releases safer isn’t to do them less frequently. It’s to do them more frequently with smaller changes and better automation. Reducing the friction of change is probably the single most impactful thing you can do for both velocity and reliability.
There’s a persistent belief that batching changes into large, infrequent releases reduces risk because each one gets more testing and scrutiny. The opposite is true.
Large changes are hard to understand. A release with fifty commits across fifteen files is nearly impossible to review holistically. The interactions between changes are complex and unpredictable. The reviewer – human or automated – can’t hold the entire change set in their head.
Large changes are hard to diagnose. When a big release causes a problem, the surface area for investigation is enormous. Which of the fifty changes caused it? Single change or interaction between two? Debugging time scales with release size.
Large changes are hard to roll back. Rolling back a single small change is trivial. Rolling back fifty changes because one caused a problem means reverting forty-nine good changes with the bad one. That creates pressure to “fix forward” rather than roll back, which is slower and riskier during an active incident.
And large changes are infrequent. If you release once a week, you’ve got fifty-two opportunities per year to practice. If you release twice a day, seven hundred. The team that releases more often develops better release practices, catches problems earlier, builds more confidence. The release muscle atrophies without exercise.
The alternative is continuous delivery: small changes, shipped frequently, with automated safety checks at each stage. The goal is to make releasing software a non-event. So routine it doesn’t require special attention, approval chains, or anxiety.
For a startup, in practice that looks like this. One change, one deployment. Each PR should be independently deployable. No feature branches that live for weeks. No “release trains” batching unrelated changes. Each PR is small enough to review in ten minutes, understand completely, roll back trivially.
Trunk-based development. Engineers commit to main frequently – ideally daily. Long-lived branches drift and create merge conflicts, integration surprises, large diffs that are hard to review. Short-lived branches from trunk keep changes small and integrations frequent. This doesn’t mean shipping incomplete features. Feature flags decouple deployment from release. Deploy code behind a flag – it’s in production, it’s being exercised, but not visible to users until you’re ready. Lets you deploy continuously while releasing features on your own schedule.
Automated pipeline. The path from merged PR to production should be automated. At minimum: PR passes CI (lint, unit tests, integration tests), merge to trunk triggers a build and deploy to staging, automated smoke tests run against staging, promotion to production is a single action – a button click or automatic promotion after staging validation. At a startup, “staging” might be a second deployment on cheaper infrastructure. It doesn’t need to mirror production. It needs to catch the obvious stuff – service doesn’t start, migrations fail, health check returns errors – before users see it.
Even with small changes and automated testing, some problems only show up in production under real traffic. Progressive rollout limits the blast radius.
Canary deployments: put the new version on a small percentage of your infrastructure – one instance out of four, or 10% of traffic. Monitor the canary’s SLIs for a defined window. If error rate or latency deviates from baseline, abort the rollout automatically. If it’s stable, promote. For a startup, a manual canary is fine to start. Deploy to one instance. Watch the dashboard for five minutes. SLIs look good? Promote. As deployment frequency increases, automate the canary analysis.
Database migrations are the highest-risk part of most deploys because they’re hard to roll back. A few principles that’ve saved me: separate migration deployment from code deployment. Run the migration first, verify it succeeded, then deploy the code that depends on it. Every migration must be backward-compatible with the currently running code. Never run destructive migrations (dropping columns, renaming tables) in the same deploy as the code change. Add the new column first, deploy code that uses both old and new, then drop the old column in a subsequent migration. And test migrations against a copy of production data, not just an empty database. Schema changes that work on an empty table can lock a million-row table for minutes.
Rollback as a first-class capability. The question for any deployment: “How do I undo this in under five minutes?” If the answer is “I can’t,” revise the plan. In practice that means keeping the previous version’s artifacts available and having a tested rollback procedure. For containerized deployments it’s straightforward – revert to the previous image tag. For database migrations it means having rollback scripts written and tested before you run the migration.
The goal isn’t to remove all safety checks. It’s to make them fast, automated, and reliable enough that they don’t impede flow.
Fast CI. If your pipeline takes forty-five minutes, engineers will batch changes to avoid running it. Which defeats the purpose of small changes. Invest in fast CI. Parallelize tests. Use caching. Prune flaky tests ruthlessly – a flaky test is worse than no test because it trains the team to ignore failures. Target under ten minutes for the fast path, under twenty for the full suite.
Automated code review for mechanical checks. Humans should review for design, logic, clarity. Automated tools should check formatting, linting, types, dependency vulnerabilities, test coverage. Every mechanical check you automate is one that doesn’t slow down human review and doesn’t get forgotten.
Eliminate manual approval gates for routine deploys. If a change passes CI, passes code review, and merges to trunk, it should flow to production without additional approval. Manual gates are toil – they rarely catch something CI and review missed, and they add latency and coordination overhead. Reserve manual approvals for exceptional cases: schema changes to high-traffic tables, auth logic changes, infrastructure modifications that affect availability.
Invest in test quality, not quantity. A small number of well-designed tests covering critical paths beats a large number that are slow, flaky, and don’t test user-visible behavior. Every test should justify its existence: what production failure would this have prevented? If the answer is “none that matters,” the test is overhead.
All six parts of this series connect into a reinforcing cycle.
You embrace risk (Part 1) by setting SLOs and managing an error budget (Part 2). The budget gives you permission to ship when it’s healthy and signals when to invest in reliability. When you invest, you prioritize eliminating toil (Part 3), which frees time for feature work and automation. You build observability (Part 4) that powers SLI measurement and incident detection. When incidents happen, you manage them and learn through postmortems (Part 5), which produce action items that improve the system. And you continuously reduce the friction of change through release engineering (Part 6), which lets you ship more often, which means smaller changes, which means lower risk per change, which means more budget for moving fast.
Each piece reinforces the others. SLOs without observability are aspirational. Observability without incident management is informational. Incident management without postmortems is reactive. Postmortems without follow-through are performative. And none of it matters if the release process is so cumbersome the team can only deploy once a week.
I started this series because I was doing this work and finding that most SRE literature assumes Google-scale resources. The principles in the SRE book are excellent, but they need translation for orgs where “the SRE team” is whoever’s on-call that week and “the infrastructure budget” is whatever the cloud bill comes to.
The translation isn’t about simplifying the principles – they’re already simple. It’s about right-sizing the implementation. You don’t need a dedicated SRE team. You need the SRE mindset embedded in your existing team. You don’t need a custom observability platform. You need structured logs and a managed monitoring service. You don’t need a complex incident management framework. You need a Slack channel, a timeline, and a thirty-minute postmortem.
The startup that gets these basics right – embracing risk, measuring reliability, eliminating toil, building observability, learning from incidents, reducing change friction – will operate with a level of discipline and velocity that most orgs ten times their size can’t match.
That’s the real promise of SRE at startup scale. Not Google-grade reliability. Appropriate reliability, managed deliberately, leaving maximum capacity for the work that matters most: building the product.