Rob Colantuoni

January 27, 2026

Tags: SRE, Infrastructure, and DevOps

SRE for Startups, Part 4: What you actually need to see (and when to panic)

See what matters, alert on what’s actionable

This is Part 4 of a series on implementing SRE practices in small startups. Part 1 covered embracing risk. Part 2 covered SLOs and error budgets. Part 3 covered eliminating toil. This post covers how to build observability and alerting that actually helps a small team operate their systems without drowning in noise.


I wrote about observability back in 2021, arguing that the shift from predefined dashboards to queryable telemetry was one of the most important trends in operations. Everything I said then still holds. But applying that at a startup? Different ballgame. You’ve got less time, less budget, and way less tolerance for operational overhead in your tooling.

The SRE book covers monitoring in Chapter 6, and there’s one principle that really sticks with me: every alert should be actionable, and every action should be necessary. If an alert fires and the correct response is “ignore it,” kill the alert. If an alert fires and the correct response is always the same manual step, automate the step and kill the alert.

For a startup, that hits harder because your on-call engineer is also your feature developer. Every false alert doesn’t just waste a few minutes on acknowledgment – it shatters a block of focused work.

So what does an observability stack look like at a small shop?

You don’t need a bespoke platform. You need four things, and most of them can be off-the-shelf.

Structured logging first. Every meaningful operation in your app should emit a structured log event – JSON, not free-form text. Request ID, endpoint, response code, latency, user ID, whatever business context matters. That’s the raw material for everything else: SLI computation, debugging, the ad-hoc stuff you’ll inevitably need. At a startup I’d log to stdout in JSON and ship to a managed service – Datadog, Grafana Cloud, or even CloudWatch with Insights. The managed service costs money, but it costs far less than running your own log aggregation (which I’ve done, and it’s a time sink).

Then metrics for your SLIs. From Part 2, you need to measure availability and latency at minimum. You can derive these from your structured logs, or emit them as dedicated metrics with Prometheus or your cloud provider. The real requirement: you can compute SLO compliance over a rolling window and see your error budget status on a dashboard. Keep it small. Ten to twenty key metrics, not thousands. Request rate, error rate, latency percentiles per critical endpoint. Queue depth for async stuff. Resource utilization. That’s it. You can add more when your system grows.

Distributed tracing? Only when you need it. If you’ve got more than two services talking synchronously, tracing gets valuable fast. A trace that follows a request from the API gateway through the app to the database and back – that tells you where latency piles up and where errors come from. OpenTelemetry has made this way easier than it was a few years ago. And if you’re running a monolith (which, at a startup, you probably should be), tracing is less critical. Application-level profiling and structured logs cover most of what you’ll need.

One more thing: build one dashboard you actually look at. Not five, not ten – one. SLIs, error budget status, deployment frequency, current alerts. Make it the default tab for the team. A dashboard nobody looks at is worse than no dashboard. It gives you the illusion of observability.

The trap I fell into: alerting on everything

The SRE book talks about three categories of monitoring output: alerts (need immediate human action), tickets (need action but not right now), and logging (for diagnostics, no action). Most startups route way too much to the alert channel. The result is alert fatigue – engineers start ignoring alerts because most don’t require action, and then they miss the ones that do. I’ve seen this destroy on-call effectiveness at orgs of every size.

How I set things up now: alert on SLO burn rate, not raw metrics. Don’t alert when error rate exceeds 1%. Alert when you’re burning through your error budget at a rate that would exhaust it within a specific window. The SRE book calls this “multi-window, multi-burn-rate alerting” – the principle is sound even if you can simplify the implementation.

In practice I use two tiers. Fast burn: you’re consuming error budget at a rate that’ll exhaust it within an hour. That pages someone immediately. Slow burn: you’re on track to exhaust it within three days. That creates a ticket for business-hours investigation. This cuts alert volume dramatically because transient spikes that resolve on their own don’t page anyone. A brief two-minute burst that doesn’t materially impact the budget? No page. Sustained elevation that threatens the budget? Page. The alerts you get are, by definition, the ones that matter.

Eliminate alerts that aren’t actionable. When you create an alert, write down what the on-call engineer should do when it fires. Can’t articulate the action? No alert. The action is always the same? Automate it and remove the alert. The action is “check the dashboard and decide if it’s a problem”? Your threshold is wrong. Review alerts monthly. For each one that fired: did someone take action? Was it necessary? Could it have been automated? Prune aggressively. A small set of high-signal alerts beats a large set with noise every time.

And use separate channels for different urgencies. Pages go to PagerDuty and wake people up. Tickets go to Linear or Jira for business hours. Informational stuff goes to a Slack channel people check when they have time. Never mix urgencies in the same channel.

Making observability something you do, not something you have

At a startup you don’t have a dedicated observability team instrumenting your code for you. It has to be a development practice – something every engineer does as part of building features.

The practical version: when you write a new endpoint, emit a structured log event with the relevant context. When you write a new async job, emit events at start and end with the job ID and outcome. When you call an external service, log the latency and response status. Adds maybe five minutes per feature. In return you can debug that feature in production without deploying new instrumentation. The compound benefit over time is huge.

In code review, treat missing instrumentation like missing tests. Not a nice-to-have – a requirement for operating the feature in production.

What I’d skip if I were starting over

Don’t build your own monitoring stack. Running Prometheus, Grafana, Loki, and Tempo is a full-time job. Use managed services. The monthly cost of Datadog or Grafana Cloud is a fraction of what you’d spend maintaining a self-hosted stack. I’ve run observability infrastructure. It’s harder than it looks.

Don’t monitor everything. Comprehensive monitoring is a goal for mature orgs with dedicated platform teams. For a startup, monitor the critical paths and accept blind spots. Add instrumentation when those blind spots bite you. Trying to monitor everything from day one leads to alert fatigue, dashboard overload, and wasted time.

And don’t set up on-call before you have SLOs. On-call without SLOs is just “be available to respond to whatever happens.” Exhausting and unsustainable. SLOs give on-call a purpose: protect the error budget. Without them you’re asking engineers to sacrifice their evenings for a vaguely defined obligation.

In Part 5, I’ll cover what happens when things go wrong despite your best efforts: incident management, postmortem culture, and sustainable on-call for small teams.