SLA vs SLO vs SLI: The Complete Breakdown for Reliable Systems
Three acronyms used interchangeably, rarely defined precisely. SLIs are measurements. SLOs are targets. SLAs are contracts with consequences. Getting the hierarchy right changes how your team talks about reliability - and how you make deployment decisions at 2am.
Jake Davids

SLA vs SLO vs SLI: The Complete Breakdown for Reliable Systems
Three acronyms that show up in every SRE conversation, often used interchangeably, rarely defined precisely. That imprecision matters — organizations that confuse SLAs, SLOs, and SLIs end up with reliability targets that nobody believes, contracts that don't reflect reality, and alert thresholds set by guesswork.
This guide covers what each one actually means, how they relate to each other, and how to implement them in a way that changes how your team talks about reliability.
The Short Version
- SLI (Service Level Indicator) — a measurement. What you're measuring.
- SLO (Service Level Objective) — a target. What you're aiming for.
- SLA (Service Level Agreement) — a contract. What you've promised, with consequences.
They form a hierarchy: SLIs are the raw numbers, SLOs are the internal targets you set based on SLIs, and SLAs are the external commitments you make based on SLOs.
If your SLI says your API availability is 99.94%, your SLO might be 99.9%, and your SLA might promise 99.5%. There's intentional headroom at each layer.
SLI: Service Level Indicator
An SLI is a quantitative measurement of some aspect of your service's behavior. It's a number — a ratio, a percentile, a count — that tells you something meaningful about how well the service is performing.
What Makes a Good SLI
Good SLIs measure things that correlate with user experience, not just internal system metrics. CPU utilization is not an SLI. Request latency is. Memory pressure is not an SLI (usually). Error rate is.
Google's SRE book recommends starting with the four golden signals as the foundation for SLIs:
- Availability: What fraction of requests were successful?
- Latency: How long did successful requests take?
- Error rate: What fraction of requests failed?
- Throughput/Saturation: How loaded is the system?
For each service, you want SLIs that answer: "Is this service working well for the people using it?"
SLI Examples
For a web API:
- Availability SLI: (successful requests / total requests) × 100
- Latency SLI: p99 response time in milliseconds
- Error SLI: (5xx responses / total responses) × 100
For a data pipeline:
- Freshness SLI: percentage of data processed within the target window
- Completeness SLI: percentage of expected records processed
For a batch job:
- Execution SLI: percentage of scheduled runs that completed successfully
- Duration SLI: percentage of runs that completed within the time window
The most useful SLIs are the ones that would tell you if a real user had a bad experience.
Calculating SLIs
SLIs are usually calculated over a rolling window — 28 days is standard. For availability:
Availability SLI = (Good requests / Total requests) × 100
Where "good" means the request succeeded within an acceptable latency threshold. A 200 response that took 10 seconds may not be "good" depending on your definition.
SLO: Service Level Objective
An SLO is the internal target you set for an SLI. It answers: how well does this service need to perform?
SLO = target value for an SLI, measured over a time window.
If your availability SLI is 99.94% this month, and your SLO is 99.9% over a rolling 28-day window, you're meeting your SLO. If availability drops to 99.85%, you're violating it.
Why SLOs Matter
Before SLOs, reliability conversations are unanchored. "The system should be reliable" doesn't tell you when to deploy, when to slow down, when to escalate, or how to prioritize reliability work versus features.
SLOs change that:
Deployment decisions become quantitative. If you have 30 days of error budget remaining, deploy aggressively. If you have 2 days left, slow down and stabilize.
Reliability investment gets objective justification. "We've violated our SLO 3 of the last 4 months" is a concrete argument for investment. "We feel like things aren't reliable enough" isn't.
On-call urgency is calibrated. Not every alert is a 3am emergency. SLOs help engineers understand when something genuinely threatens reliability and when it's noise.
Setting SLOs
The most common mistake: setting SLOs too high without data to support them.
If your system has been running at 99.6% availability for the past year, setting an SLO of 99.99% is not aspirational — it's fictional. Nobody believes it, nobody acts on it, and it makes the whole framework feel like theater.
Start by measuring your actual SLI performance over the last 90 days. Set your initial SLO slightly below that number. Then tighten it as you improve reliability.
Practical SLO ranges by service type:
| Service Type | Typical Availability SLO |
|---|---|
| Core user-facing API | 99.9% – 99.95% |
| Internal API (non-critical path) | 99.5% – 99.9% |
| Background jobs / batch | 99% – 99.5% |
| Best-effort / experimental | 95% – 99% |
Error Budgets
An error budget is the inverse of your SLO: the acceptable amount of unreliability over a time window.
A 99.9% availability SLO over 30 days = 0.1% of 30 days = 43.2 minutes of acceptable downtime.
Error budgets make the development-reliability tradeoff explicit:
- Error budget healthy: Ship features aggressively. Risk is acceptable.
- Error budget at 50%: Monitor closely. Be careful with deployments.
- Error budget nearly exhausted: Freeze non-critical deployments. Focus on reliability work.
- Error budget exceeded: Postmortem required. No new features until reliability is restored.
This is how mature SRE organizations govern the pace of development. Not "engineering wants to go fast, operations wants stability" — but "here's our shared budget for unreliability, here's how much we've spent, here's what we can afford."
SLA: Service Level Agreement
An SLA is a contract between you and your customers, defining the reliability you're promising and the consequences of not meeting it.
SLA = external commitment with consequences.
SLAs are almost always softer than SLOs. A common pattern: SLO is 99.9%, SLA promises 99.5%. The gap is your safety margin — it accounts for measurement variance, unexpected incidents, and gives you room to violate your SLO without immediately breaching a customer contract.
SLA Consequences
SLAs typically include:
- Service credits: A percentage of monthly fees credited back when SLA is breached. Common range: 10-30% credit per percentage point of downtime below target.
- Termination rights: Customer can exit the contract if SLA is consistently missed.
- Exclusions: Planned maintenance windows, force majeure events, failures caused by customer misuse.
SLA Examples
Basic availability SLA:
"Provider will maintain 99.5% monthly uptime. For each 1% below 99.5%, Customer receives a 10% service credit, up to 30% of monthly fees. Excludes scheduled maintenance windows communicated 48 hours in advance."
API performance SLA:
"95% of API requests will return a response within 500ms, measured as a monthly rolling average."
How SLI, SLO, and SLA Work Together
The relationship is hierarchical and intentional:
SLI: Your API availability is 99.94% this month
↓ (measured against)
SLO: Internal target is 99.9% — you're meeting it
↓ (informs)
SLA: You promise customers 99.5% — significant headroom
The gaps at each level are intentional:
- SLO tighter than SLA gives early warning before breaching customer commitments
- SLI measurement better than SLO accounts for normal variance
When teams set these without gaps — SLA = SLO = SLI target — any real incident immediately breaches the customer contract. That's the wrong design.
SLA vs KPI: The Difference
A KPI (Key Performance Indicator) is an internal business metric — revenue, growth rate, customer acquisition cost. KPIs measure business health.
An SLA is a contractual reliability commitment — uptime, response time, support turnaround. SLAs measure service quality promises.
They interact — SLA breaches affect customer satisfaction, which affects KPIs like churn and NPS. But they're fundamentally different in scope and consequences. Missing a KPI is a business problem. Breaching an SLA is a legal and contractual problem.
Some teams use "SLA" loosely to mean any internal performance target. If it's not a customer-facing contract with consequences, it's an SLO, not an SLA.
Common Mistakes in Implementation
Setting targets without data. Measure what you have before setting targets. Aspirational SLOs that don't reflect current performance breed cynicism.
Too many SLIs. Five well-chosen SLIs per service are more useful than fifty. If you can't act on the signal, you don't need the measurement.
No error budget policy. SLOs without an error budget policy are just numbers. Define what happens at 50%, 80%, and 100% consumption before you need to invoke it.
SLAs tighter than SLOs. This guarantees SLA breaches whenever you violate an SLO. Build in headroom.
No review cadence. SLOs should be reviewed quarterly. Services change, traffic patterns change, user expectations change.
SLOs and On-Call
SLOs directly affect on-call experience in ways most teams don't fully account for.
Alert thresholds should be set relative to SLOs. If your SLO is 99.9% availability, an alert at 99.85% gives warning before violation. An alert at 99.0% fires after significant breach has already occurred.
Error budget burn rate determines urgency. Fast burn rate (consuming the month's budget in hours) is a P1. Slow burn (over weeks) is worth investigating but not a 3am page. Teams that implement SLO-based alerting page less and catch real problems more reliably.
On-call context needs SLO data. When an engineer is paged at 2am, knowing that the incident has consumed 60% of the month's error budget in 20 minutes changes how urgently they escalate, what they communicate to stakeholders, and how aggressively they prioritize resolution.
This is where context matters. OpsBrief surfaces SLO status alongside metrics, deployments, and runbooks when an incident opens — so the on-call engineer has a complete picture of what the incident means, not just what the numbers show.
Getting Started
Week 1: Instrument availability and latency for your two most important services. This gives you SLI data.
Week 2: Calculate your actual availability over the last 90 days. Set initial SLOs slightly below actual performance.
Week 3: Define your error budget policy — what happens at 50%, 80%, 100% consumption?
Month 2: Review and calibrate. Are the SLOs meaningful? Are engineers using them to make decisions?
Quarter 2: If you have customer commitments, define SLAs based on SLOs with appropriate headroom.
The goal isn't a perfect framework from day one. It's a shared vocabulary for reliability that gets better with iteration.
Implementing SLOs and want to connect reliability data to incident context? OpsBrief integrates with Datadog and PagerDuty to surface SLO context automatically during incidents — so on-call engineers know what an incident means for reliability commitments, not just what the metrics show.


