SRE Golden Signals: Latency, Traffic, Errors, and Saturation Explained

<h1 id="sre-golden-signals-latency-traffic-errors-and-saturation-explained">SRE Golden Signals: Latency, Traffic, Errors, and Saturation Explained</h1>
<p>Before you can monitor a system effectively, you need to know what to monitor. Most systems generate hundreds of metrics. Most of those metrics don't tell you whether users are having a good experience.</p>
<p>Google's SRE team solved this problem with a framework they call the four golden signals - four metrics that, together, give you a complete picture of whether a service is healthy. If you only have bandwidth to monitor four things, these are the four things.</p>
<p>This guide covers what each signal is, how to measure it, what thresholds make sense, and how to use all four together for meaningful alerting.</p>
<hr>
<h2 id="what-are-the-four-golden-signals-">What Are the Four Golden Signals?</h2>
<p>The four golden signals come from Google's SRE book (available free at sre.google). The core claim: for any user-facing service, monitoring these four metrics gives you sufficient signal to detect virtually every meaningful failure mode.</p>
<p>The four signals are:</p>
<ol>
<li><strong>Latency</strong> - how long requests take</li>
<li><strong>Traffic</strong> - how much demand the system is handling</li>
<li><strong>Errors</strong> - what fraction of requests fail</li>
<li><strong>Saturation</strong> - how "full" the system is</li>
</ol>
<p>They're not the only metrics worth tracking. But they're the ones worth alerting on - the metrics that, when something is wrong, will tell you something is wrong.</p>
<hr>
<h2 id="signal-1-latency">Signal 1: Latency</h2>
<p>Latency is how long it takes to service a request. Seems simple. There are a few important nuances.</p>
<h3 id="measure-p99-not-just-average">Measure p99, not just average</h3>
<p>Average latency hides problems. If 99% of your requests complete in 50ms but 1% take 30 seconds, your average latency might look acceptable while 1 in 100 users is having an awful experience.</p>
<p>Percentile measurements tell a more honest story:</p>
<ul>
<li><strong>p50</strong> (median): What the typical user experiences</li>
<li><strong>p95</strong>: What the 95th percentile user experiences - most users, but excluding the outliers</li>
<li><strong>p99</strong>: What the 99th percentile user experiences - almost everyone</li>
<li><strong>p99.9</strong>: The worst 0.1% - often where you find systemic issues</li>
</ul>
<p>For most SLOs, p99 is the right level to alert on. p99.9 is worth tracking but is often too noisy for on-call alerts.</p>
<h3 id="separate-successful-and-failed-request-latency">Separate successful and failed request latency</h3>
<p>Failed requests are often fast. If a request immediately returns a 500 error, it might complete in 2ms - artificially pulling your average latency down even as your system degrades. Track latency for successful requests separately from failed ones.</p>
<h3 id="what-good-looks-like">What good looks like</h3>
<p>Latency thresholds vary widely by service type:</p>
<ul>
<li>User-facing synchronous APIs: p99 under 500ms is a reasonable starting point</li>
<li>Database queries: depends heavily on query complexity and data volume</li>
<li>Background jobs: depends on business requirements, not user experience</li>
<li>Real-time systems: often have sub-100ms requirements</li>
</ul>
<p>Set thresholds based on your actual SLO, not on what sounds good. If your SLO says 95% of requests must complete within 300ms, your latency alert should fire before you breach that target.</p>
<hr>
<h2 id="signal-2-traffic">Signal 2: Traffic</h2>
<p>Traffic measures the demand on your system - typically requests per second for a web service, but the right metric depends on what your service does.</p>
<h3 id="traffic-metrics-by-service-type">Traffic metrics by service type</h3>
<ul>
<li><strong>HTTP API</strong>: requests per second</li>
<li><strong>Database</strong>: queries per second, transactions per second</li>
<li><strong>Message queue</strong>: messages published per second, queue depth</li>
<li><strong>Streaming service</strong>: bytes transferred per second</li>
<li><strong>Batch jobs</strong>: jobs per hour, records processed per minute</li>
</ul>
<h3 id="why-traffic-is-a-golden-signal">Why traffic is a golden signal</h3>
<p>Traffic isn't usually what you alert on - high traffic is generally good. Its value is as context for the other signals.</p>
<p>An error rate of 2% at 100 requests/second means 2 errors per second. The same 2% error rate at 10,000 requests/second means 200 errors per second - a very different situation. Without traffic context, error rates are hard to interpret.</p>
<p>Traffic is also essential for capacity planning and saturation analysis. Sudden traffic spikes correlate with saturation; traffic drops during incidents can indicate that users are giving up rather than the service recovering.</p>
<h3 id="traffic-anomalies-worth-alerting-on">Traffic anomalies worth alerting on</h3>
<p>While high traffic is usually fine, traffic anomalies warrant attention:</p>
<ul>
<li><strong>Sudden spike</strong>: possible traffic surge, DDoS, or bot activity</li>
<li><strong>Sudden drop</strong>: possible upstream failure, routing problem, or users abandoning due to errors</li>
<li><strong>Unexpected pattern</strong>: traffic at 3am when you normally have none could be a scraper or an attack</li>
</ul>
<hr>
<h2 id="signal-3-errors">Signal 3: Errors</h2>
<p>Errors measure the rate of requests that fail - explicitly (HTTP 500s, timeouts) or implicitly (HTTP 200s that return wrong data).</p>
<h3 id="explicit-vs-implicit-errors">Explicit vs. implicit errors</h3>
<p><strong>Explicit errors</strong> are easy to measure: 5xx HTTP responses, connection timeouts, application-level exceptions. Your APM tool (Datadog, New Relic, Sentry) captures these automatically.</p>
<p><strong>Implicit errors</strong> are harder: requests that technically succeed (200 OK) but return wrong, incomplete, or stale data. These require understanding what "correct" means for your service and validating responses against that definition.</p>
<p>Most teams start by alerting on explicit errors and add implicit error tracking as they mature their reliability practice.</p>
<h3 id="error-rate-calculation">Error rate calculation</h3>
<pre><code><span class="hljs-keyword">Error </span>rate = (Failed requests / Total requests) x 100
</code></pre><p>For SLO purposes, calculate this over a rolling window (5 minutes for alerting, 28 days for SLO tracking).</p>
<h3 id="what-error-rate-thresholds-to-use">What error rate thresholds to use</h3>
<p>This depends entirely on your SLO. If your availability SLO is 99.9%, your error rate can be at most 0.1% on average before you're violating it.</p>
<p>For alerting:</p>
<ul>
<li>Alert at a rate that gives you warning before SLO violation, not after</li>
<li>A burn-rate-based alert (error rate burning through your monthly budget at X times normal rate) is more useful than a fixed threshold</li>
<li>Separate alerts for error rate increasing rapidly vs. error rate stable at an elevated level - the former is more urgent</li>
</ul>
<h3 id="don-t-aggregate-across-everything">Don't aggregate across everything</h3>
<p>A 0.5% overall error rate might look acceptable while the payment API is failing at 8%. Aggregate error rates hide the location of the problem. Track error rates per service or per endpoint for the services that matter most.</p>
<hr>
<h2 id="signal-4-saturation">Signal 4: Saturation</h2>
<p>Saturation measures how "full" your system is - how close it is to a resource limit. Unlike the other three signals, saturation is often a leading indicator: it tells you a problem is coming before errors or latency increase.</p>
<h3 id="what-to-measure-for-saturation">What to measure for saturation</h3>
<p>The right saturation metric is whichever resource limits your service first:</p>
<ul>
<li><strong>CPU-bound services</strong>: CPU utilization percentage</li>
<li><strong>Memory-bound services</strong>: memory utilization, heap usage</li>
<li><strong>I/O-bound services</strong>: disk I/O utilization, network bandwidth</li>
<li><strong>Database connections</strong>: connection pool utilization</li>
<li><strong>Queue-based services</strong>: queue depth, consumer lag</li>
</ul>
<p>Most services have multiple potential saturation points. Identify which resource is typically the constraint and start there.</p>
<h3 id="saturation-thresholds">Saturation thresholds</h3>
<p>Unlike latency and errors, saturation doesn't have universal baselines. A CPU-bound service operating at 80% utilization might be fine for months; a memory-bound service at 80% might be minutes from OOM.</p>
<p>The useful question isn't "what's the utilization" but "at current rate of increase, when will we hit the limit?" Saturation trend matters more than point-in-time saturation level.</p>
<p>Practical starting points:</p>
<ul>
<li>Alert at 70-80% for most resources (gives you time to respond before hitting the limit)</li>
<li>Track trend over the last 15 minutes, not just current value</li>
<li>Alert on rapid increase, even if absolute level is moderate</li>
</ul>
<h3 id="saturation-and-capacity-planning">Saturation and capacity planning</h3>
<p>Saturation metrics are the foundation of capacity planning. If your database connection pool is at 60% during normal traffic and traffic is growing 20% per month, you have roughly 3 months before you hit the limit. That's a capacity planning conversation you want to have in advance, not during an incident.</p>
<hr>
<h2 id="using-all-four-signals-together">Using All Four Signals Together</h2>
<p>The real value of the golden signals framework is how they interact. A single signal in isolation is often ambiguous. Two or three signals together tell a story.</p>
<h3 id="latency-up-errors-up-saturation-up-traffic-stable">Latency up, errors up, saturation up - traffic stable</h3>
<p>Classic overload situation - the system is being pushed beyond its capacity without a traffic increase. Likely cause: a slow query or inefficient code path consuming more resources than expected. Could also indicate a partial failure where some backend nodes are down, concentrating load on the survivors.</p>
<h3 id="errors-up-latency-down-traffic-and-saturation-normal">Errors up, latency down - traffic and saturation normal</h3>
<p>Failing fast. The service is returning errors quickly (low latency for failed requests) without consuming significant resources. Likely cause: upstream dependency is down, configuration error, or application-level exception before any substantial work is done.</p>
<h3 id="traffic-spike-saturation-up-latency-increasing-errors-stable">Traffic spike, saturation up, latency increasing, errors stable</h3>
<p>Normal overload from a traffic event - your system is handling load but degrading gracefully. Watch for errors to follow if saturation continues to increase.</p>
<h3 id="traffic-drop-errors-stable-latency-stable">Traffic drop, errors stable, latency stable</h3>
<p>Could be a problem with how traffic reaches your service (routing issue, upstream failure) rather than your service itself. Check your load balancer, CDN, or any upstream service.</p>
<hr>
<h2 id="golden-signals-and-on-call-context">Golden Signals and On-Call Context</h2>
<p>Four golden signals give you the foundation for meaningful alerting. But a signal firing at 3am is only useful if the on-call engineer can quickly understand what it means.</p>
<p>"API error rate above threshold" is a signal. "API error rate is 8% and rising, up from baseline 0.2% - this started 12 minutes ago, correlates with a deployment to the payments service 15 minutes ago, downstream services affected: checkout and order history" is context.</p>
<p>The gap between signal and context is where most MTTR lives. Gathering that context - which service, what changed, what's downstream - typically takes 20-30 minutes of manual investigation across Datadog, GitHub, PagerDuty, and Slack.</p>
<p>OpsBrief consolidates that context automatically when an incident fires. The dependency graph shows which services are affected. Recent deployments from GitHub are surfaced alongside the metrics. Runbooks are linked. The on-call engineer sees the full picture in 2-3 minutes instead of spending the first half of every incident gathering it.</p>
<p>Golden signals get you to the right alert. Operations intelligence gets you to the right answer.</p>
<hr>
<h2 id="beyond-the-four-golden-signals">Beyond the Four Golden Signals</h2>
<p>The golden signals are a starting point, not a ceiling. Once you have solid coverage of the four core signals, there are useful additions depending on your service type:</p>
<p><strong>Availability</strong> - direct measurement of request success rate, usually expressed as a percentage. This is often tracked separately from error rate for SLO purposes.</p>
<p><strong>Apdex score</strong> - a standardized satisfaction score that combines latency and error rate into a single number. Useful for stakeholder reporting.</p>
<p><strong>Dependency health</strong> - the golden signals of your dependencies, not just your own service. If your upstream database is saturated, your service's latency will increase before your own saturation metrics show a problem.</p>
<p><strong>Business metrics</strong> - sometimes the most important signal is not a technical one. Checkout conversion rate, session completion rate, or active users can catch problems that technical metrics miss (especially implicit errors).</p>
<hr>
<h2 id="getting-started-with-golden-signals">Getting Started With Golden Signals</h2>
<p>If your team is starting from scratch:</p>
<p><strong>Week 1:</strong> Instrument latency and error rate for your two most user-facing services. These two signals catch the majority of incidents.</p>
<p><strong>Week 2:</strong> Add traffic tracking. Use it as context when investigating latency and error alerts.</p>
<p><strong>Week 3:</strong> Identify the saturation metric most relevant to each service. Add saturation dashboards before alerting.</p>
<p><strong>Month 2:</strong> Define alert thresholds tied to your SLOs, not arbitrary numbers. A latency alert should fire when you're trending toward SLO violation, not when it crosses some abstract threshold.</p>
<p><strong>Ongoing:</strong> Review alert quality monthly. What fired that required no action? What didn't fire when something was wrong? Tune accordingly.</p>
<p>The goal is not maximum coverage. It's maximum signal-to-noise ratio - every alert that fires should require action.</p>
<hr>
<p><em>If your team has golden signals instrumented but on-call engineers are still spending 20+ minutes gathering context before diagnosis, <a href="https://opsbrief.io">OpsBrief</a> pulls your Datadog metrics, GitHub deployments, and runbooks into a single incident view - so the signal gets turned into action faster.</em></p>

SRE Golden Signals: Latency, Traffic, Errors, and Saturation Explained

Related Articles

Why Teams Forget Critical Information Within 24 Hours of an Incident

The Rise of Cross-Functional Operations Intelligence

Incident Response Bottlenecks: Where Your MTTR Is Actually Lost

Try OpsBrief Free