February 03, 2026

SRE for Startups, Part 5: When stuff breaks and what you learn from it

When things break, and what you learn after

This is Part 5 of a series on implementing SRE practices in small startups. Part 1 covered embracing risk. Part 2 covered SLOs and error budgets. Part 3 covered eliminating toil. Part 4 covered observability and alerting. This post covers what happens when things go wrong – how to manage incidents effectively, how to learn from them, and how to run on-call sustainably at a small team.

Things will break. If you’ve internalized Part 1 – that you’re operating within an error budget, not pursuing zero downtime – then breakage isn’t a failure. It’s an expected event you’ve budgeted for. The question isn’t whether you’ll have incidents. It’s whether you’ll handle them well and learn from them.

The SRE book dedicates several chapters to this: effective troubleshooting, emergency response, managing incidents, postmortem culture. At Google’s scale that involves complex coordination across dozens of people. At a startup it’s usually two or three engineers in a Slack channel. But the principles scale down pretty cleanly.

This is where I get frustrated: the silent incident

The SRE book describes an incident management framework with defined roles – Incident Commander, Operations Lead, Communications Lead. At a five-person startup you’re not having three people in defined roles during an incident. You might have one person doing everything.

That’s fine. What matters isn’t the formality. It’s the discipline. Even with a single responder, a few practices make a real difference.

The most common failure I’ve seen at startups is the “silent incident” – something breaks, someone notices, they quietly fix it, and nobody else knows it happened. No record, no analysis, no learning. Same issue three weeks later. Fix this by having a clear, low-friction way to declare an incident. At the team I’ve been working with, it’s a Slack shortcut that creates a timestamped incident channel and posts a notification. The bar is deliberately low: if you’re spending more than fifteen minutes investigating something that affects users, it’s an incident. Doesn’t need to be catastrophic to be worth tracking.

During an incident, post a status update every fifteen to thirty minutes. Even if the update is “still investigating, no change.” It keeps stakeholders informed without them interrupting the responder, and it creates a timeline that’s invaluable for the postmortem. Silence during an incident is more disruptive than bad news. Silence creates anxiety and triggers escalation.

One more: separate diagnosis from remediation. When you’re under pressure there’s a strong urge to start trying fixes before you understand the problem. Resist it. A misguided fix can make things worse and muddy the diagnostic signal. Spend the time to understand what’s happening before you act. (The exception: when you have an obvious, safe remediation – rolling back a recent deploy, say. Restore service first, then diagnose at leisure.)

Oh, and keep a timeline. Jot down what happened and when, as it happens. “14:32 - customer reported 500s on /api/jobs. 14:35 - confirmed elevated 5xx rate. 14:38 - checked recent deploys, found one at 14:15. 14:42 - initiated rollback.” That timeline becomes the raw material for the postmortem, and it’s way easier to create in real time than to reconstruct from memory afterward.

The postmortem: five questions that actually work

The SRE book is emphatic: postmortems must be blameless. Not performative “blame-free” where everyone says the right things but privately assigns fault. Genuinely blameless – focused on systemic causes, not individual mistakes.

This matters for a reason beyond morale. If people are afraid of being blamed, they’ll hide information. Minimize their role. Avoid reporting near-misses. And the org loses the information it needs to prevent future incidents. I’ve written about psychological safety in the context of engineering culture. Blameless postmortems are one of the most concrete expressions of that. They turn incidents – stressful, disruptive, expensive – into learning opportunities that make the system more resilient.

For a startup, the postmortem doesn’t need a formal document with a dozen sections. It needs to answer five questions.

What happened? A concise narrative with the timeline. What broke, when, what was the user impact? Ground it in data – error rates, duration, affected users – not just narrative.

Why did it happen? Root cause. Use the “five whys”: keep asking “why” until you reach a systemic cause, not an individual action. “The deployment caused errors” is a proximate cause. “We don’t have automated canary analysis that would’ve caught the error rate increase before full rollout” is a systemic cause. The former blames a person; the latter identifies a process gap.

How did we respond? What did the team do during the incident? What worked, what didn’t? Were there delays in detection, diagnosis, or remediation? You’re evaluating the incident response process, not your engineers.

What are we going to do about it? Concrete action items with owners and due dates. Not “improve monitoring” – that’s not actionable. “Add latency SLI alerting for the payment endpoint, owned by Alice, due Feb 15.” Action items should address root causes, not symptoms. If the postmortem doesn’t produce at least one concrete improvement, it was wasted.

How are we going to prevent recurrence? Related to but distinct from the action items. Sometimes the best prevention is architectural – redesigning so the failure mode is impossible, not just detectable. Sometimes it’s procedural – adding a checklist step to the deploy process. Sometimes it’s cultural – establishing that Friday afternoon deploys get extra scrutiny.

Making postmortems a habit when you’d rather move on

At a startup, the temptation after an incident is to fix the immediate problem and move on. The backlog is long, the pressure is high, and writing a postmortem feels like bureaucracy.

Fight it. Postmortems are how incidents become investments instead of losses. Without them you’re paying the cost – downtime, customer impact, stress – without getting the learning.

I’ve found the easiest way to make them a habit is to keep them lightweight. Thirty minutes max. Use a template. Fill it in collaboratively with everyone who was involved. Publish where the whole team can read it. Review action items in the next standup.

The most valuable postmortems I’ve been part of weren’t for the biggest incidents. They were for near-misses – small impact but the underlying cause could’ve been catastrophic. A five-minute outage from a config change that could’ve caused a five-hour outage if it went unnoticed? Worth a thorough postmortem, even though the actual impact was minor.

What sustainable on-call actually looks like for two or three people

On-call at a startup is different. You can’t build a deep rotation with multiple tiers. You might have two or three people who can carry the pager, and they’re the same people building the product.

The SRE book’s chapter on being on-call (Chapter 11) establishes principles that matter regardless of size: on-call should be compensated, incidents should be followed by adequate recovery time, and the on-call load should be quantified and managed.

At a startup I’d prioritize a few things. Two-person minimum rotation – never one person on-call with no backup. The burnout risk, the lack of redundancy, the mental toll of being the single point of responsibility: it’s not sustainable. If you’ve only got two engineers who can handle production, alternate weeks. If one’s unavailable, the other covers, and you owe them time off.

Limit the blast radius. Use the SLO-based alerting from Part 4 so on-call pages are rare and meaningful. If your on-call engineer is getting paged more than twice a week outside business hours, your alerting is too noisy or your system’s too unreliable. Fix one or both before the on-call burden causes attrition.

Business-hours-first for non-critical stuff. Not everything needs a 3 AM response. Slow-burn error budget depletion, non-critical service degradation, informational alerts – those generate tickets, not pages. Reserve pages for situations where user impact is active and growing.

And compensate on-call explicitly. Whether it’s extra pay, comp time, or something else, make it explicit and fair. Implicit expectations – “we all pitch in” – lead to resentment, especially when the burden falls unevenly. Being transparent about the cost also forces the org to take the load seriously. If on-call is expensive, there’s a financial incentive to reduce page volume.

One more thing: incidents as a signal

Zoom out from individual incidents and look at the aggregate. How many per month? Trend increasing or decreasing? Common categories? Are postmortem action items actually getting completed?

That aggregate view connects back to the error budget framework from Part 2. Frequent incidents deplete the budget, which triggers a shift toward reliability work. Completed postmortem action items reduce future incident frequency. The system’s self-correcting – if you let it work.

At the startup I’ve been working with, we review incident trends monthly alongside the error budget dashboard. The two tell a coherent story: months where we shipped aggressively had more incidents and faster budget burn. Months where we invested in reliability had fewer incidents and budget recovery. The pattern reinforces the Part 1 message – reliability is a managed resource, not a fixed property.

In Part 6, I’ll close the series with release engineering – how to make shipping software safe, fast, and boring, which is exactly what you want it to be.

Rob Colantuoni