Rob Colantuoni

January 13, 2026

Tags: SRE and Infrastructure

SRE for Startups, Part 2: SLIs, SLOs, and Actually Using Your Error Budget

Turning “reliable enough” into a number you can use

This is Part 2 of a series on implementing SRE practices in small startups. Part 1 covered the foundational mindset shift: treating reliability as a budget rather than a virtue. This post gets into the mechanics – how to define what “reliable enough” means concretely and how to use that definition to make better shipping decisions.


In Part 1, I argued that the most important SRE principle for startups is embracing risk – understanding that reliability is a resource to be managed, not a goal to be maximized. But that principle is useless without a mechanism. The mechanism is Service Level Objectives (SLOs) and the error budgets they create.

The SRE book dedicates Chapter 4 to SLOs, and it’s dense with Google-scale specifics. For a startup, most of that detail is unnecessary. What matters is the core loop: define what matters to users, measure it, set a target, and use the gap between the target and reality to make decisions.

The terminology that everyone mixes up

SLI, SLO, SLA – they get confused constantly. An SLI is a quantitative measure of something users care about. Latency, error rate, availability. CPU utilization is not an SLI because users don’t experience CPU utilization. Request latency is, because users experience slow pages.

An SLO is a target value for that SLI. “99.9% of requests will return a successful response within 500ms, measured over a rolling 28-day window.” That’s the line between “reliable enough” and “not reliable enough.”

The error budget is the amount of unreliability your SLO permits. If your availability SLO is 99.9%, your error budget is 0.1%. Over a 28-day window, that’s about 40 minutes of total downtime or roughly 1 in 1000 failed requests.

An SLA is different – it’s a contractual obligation with consequences for violation. Refunds, credits, penalties. SLAs are a business construct. SLOs are an engineering construct. Your SLO should be tighter than your SLA, because you want to catch problems before they become contractual violations.

What to measure (and what to skip)

The most common mistake I see is measuring what’s easy instead of what matters. It’s easy to measure uptime – is the server responding to health checks? It’s harder to measure whether the user’s experience is actually good.

For a startup, I’d start with two things. Availability: the proportion of requests that return a successful (non-5xx) response. It’s the most basic metric and it captures the most egregious failures.

And latency: the proportion of requests that complete within an acceptable time. I usually frame this as a percentile target – “95% of requests complete within 300ms, and 99% within 1 second.” Averages hide outliers. A service with a 200ms average might have a 99th percentile of 10 seconds, which means 1 in 100 users is having a terrible experience.

As your product matures, you might add more nuanced SLIs. But availability and latency cover the vast majority of user-facing reliability for a startup.

This is where teams get stuck: picking the number

What should the SLO be? Depends on your product and users, but a few things I’ve learned:

Start lower than your instinct. Most engineers default to ambitious targets – 99.99% sounds professional. But an aggressive SLO you can’t meet is worse than a modest one you consistently hit. A target of 99.5% that you hit every month gives you a functioning error budget. A target of 99.99% that you violate every week gives you alarm fatigue and no useful signal.

Benchmark against reality first. Before setting an SLO, measure your current performance for two to four weeks. If you’re at 99.7% availability, setting 99.9% is aspirational – you’ll need to invest before you can meet it. Setting 99.5% gives you a budget to work with immediately.

And differentiate by criticality. Not all endpoints matter equally. Your auth flow probably needs a tighter SLO than your settings page. Payment processing needs a tighter SLO than search suggestions. Identify the critical user journeys and set SLOs for those specifically.

One practical detail: use a 28-day rolling window, not a calendar month. A calendar-month window creates perverse incentives at month-end. Rolling smooths that out.

Using the error budget for real

Once you have an SLO, you have an error budget. And the error budget is where the real value lives.

Build a dashboard that shows current SLI performance and remaining error budget. Make it visible to the whole engineering team. This is the single most important artifact in startup SRE – a shared, real-time understanding of how much room you have to maneuver.

When the budget is healthy, deploy freely. Ship multiple times a day. Run experiments. When it’s depleted or nearly there, slow down. Require more thorough testing. Skip the risky refactoring. Focus on stabilization.

This isn’t rigid – at a startup, sometimes you have to ship even when the budget is thin because the feature is business-critical. That’s fine. The point isn’t a bureaucratic gate. It’s making the cost of that decision visible. “We’re shipping with a depleted error budget, which means if this deployment causes problems, we’ll be below our SLO” enables an informed decision. “We’re shipping and hoping it’s fine” does not.

One more thing: review the budget weekly. Fifteen minutes, tops. Look at the trend. Are we burning it faster than expected? Is there a pattern – a specific endpoint, a specific time of day? This cadence catches slow degradation that incident-driven responses miss.

When you don’t have the data

A common objection: “We don’t have the instrumentation to measure SLIs accurately.” Valid, and it’s the wrong reason not to start.

Start with what you have. If you’re behind a load balancer, you have access logs with response codes and latencies. That’s enough for basic availability and latency SLIs. If you’re on a cloud provider, managed services emit metrics that can serve as proxy SLIs.

The first version will be imperfect. Some requests won’t be counted. Some latencies will be measured at the wrong layer. That’s fine. An imperfect SLI you actually use beats a perfect one you never build. (I’ve spent too much time chasing perfect instrumentation instead of shipping something usable.)

The goal is building understanding of your system’s behavior, not achieving measurement perfection. SLOs give that understanding a concrete frame.

What you get out of it

The combination of SLOs and error budgets gives a startup something most small teams lack: a principled, data-driven answer to “should we ship this now or fix reliability first?”

Without SLOs, this devolves into vibes and organizational politics. The cautious engineer argues for more testing. The ambitious PM argues for faster shipping. The debate is unresolvable because there’s no shared frame of reference.

With SLOs, the answer is legible. Check the budget. If you have room, ship. If you don’t, stabilize. It’s not magic, and there will be exceptions, but the default path is clear. And that clarity, more than any specific reliability improvement, is what changes how a startup team operates.

In Part 3, I’ll turn to toil – the operational work that grows silently and eats your engineering capacity if you don’t actively fight it.