Rob Colantuoni

August 09, 2021

Tags: Infrastructure and DevOps

Your Dashboards Aren't Going to Save You

Why I’ve been rethinking monitoring from scratch

I’ve spent a lot of time this year thinking about how we understand what’s happening in production. The traditional approach – build some dashboards, set threshold alerts, write runbooks – worked fine when systems were simpler. But our architectures have gotten complicated enough that I don’t think that model works anymore.

The word people are using is “observability,” and I resisted it for a while because it sounded like marketing. But there’s a real distinction here and it matters.

The problem with dashboards

Traditional monitoring answers known questions. You decide ahead of time what to measure, what thresholds to set, what graphs to display. Something goes wrong, you look at your dashboards, and you hope the answer is somewhere in a metric you thought to collect.

This is fine when your system fails in predictable ways. If it’s always the same three failure modes, a good set of dashboards catches most of it. But distributed systems don’t fail in predictable ways. You get emergent behavior – weird interactions between components nobody anticipated, cascading failures that hop service boundaries, performance issues that only hit specific customer segments.

When you’re dealing with a novel failure, staring at predefined dashboards is like looking for your keys under the streetlight. The answer is probably in the dark.

What observability actually means

The concept comes from control theory. A system is observable if you can figure out its internal state from its external outputs. For software, that means you can ask arbitrary questions about behavior – questions you didn’t plan for – and get answers without shipping new code.

The practical difference: with monitoring, you say “show me the dashboard for service X” and you look at predefined graphs of predefined metrics. With observability, you say “show me the p99 latency for requests to service X from customer Y on endpoint Z, broken down by database shard, for the last hour.” You’re querying high-cardinality data to test hypotheses in real time.

The enabler is structured event data. Instead of shipping pre-aggregated metrics, you emit rich, structured events for every meaningful operation. Each event carries all the context – request ID, customer ID, endpoint, status code, latency, cache hit/miss, active feature flags. From those events you can compute any metric you want on the fly.

The “three pillars” thing and why it’s incomplete

The industry has settled on “metrics, logs, and traces” as the three pillars of observability. I think this framing is useful but a bit misleading.

Metrics are pre-aggregated time series. Cheap to store, fast to query, but low-cardinality. You can track requests per endpoint, but not per customer per endpoint per region. When you need to drill down, you run out of dimensions fast.

Logs are high-cardinality but expensive and slow at scale. And unstructured logs resist the kind of ad-hoc slicing that observability demands. Grepping through text is not analysis.

Traces show request paths through the system, which is fantastic for understanding latency and failure propagation. But trace data is huge and most systems sample aggressively, so you might not have the trace you need when you need it.

The real goal is unified, queryable telemetry where you can explore system behavior without jumping between three different tools with three different query languages.

Things I’m watching

Columnar storage for telemetry data – ClickHouse and similar – is enabling fast ad-hoc queries over high-cardinality event data at scale. That’s the storage backend observability actually needs.

OpenTelemetry is creating a vendor-neutral instrumentation standard. If it succeeds (and I think it will), you instrument once and you’re not locked into a specific vendor. Big deal.

SLO-based alerting. Instead of “CPU > 80%” or “error rate > 1%,” you alert on error budget burn rate. “We’re consuming error budget fast enough to exhaust it in 6 hours” is way more actionable than a raw metric threshold. Google’s SRE team has been pushing this approach and it makes a ton of sense.

It’s a culture shift more than a tooling shift

You have to move from certainty to curiosity. Monitoring culture resolves incidents by following runbooks. Observability culture resolves incidents by asking questions and following data wherever it goes. That requires deeper systems understanding and a willingness to explore.

Instrumentation has to become a first-class development concern. Treat missing instrumentation in a code review the same way you’d treat missing tests. If observability depends on rich event data, then emitting that data is part of building the feature, not something you bolt on later.

And it breaks down silos. When a service emits rich telemetry, anyone can investigate it, not just the team that owns it. Incident response gets faster. Cross-team debugging gets possible.

I think observability is going to be as fundamental to running software as testing is to building it. The tools are getting there fast. The gap right now is cultural, not technical.