Back to Glossary
Reliability Engineering

SRE

Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations, with the goal of creating scalable and reliable systems.

Origins of SRE

SRE was pioneered at Google in 2003. Ben Treynor Sloss famously defined it as:

"SRE is what happens when you ask a software engineer to design an operations team."

Core SRE Principles

1. Embrace Risk 100% reliability is impossible and too expensive. Define acceptable risk through SLOs.

2. Eliminate Toil Automate repetitive manual work. If a human does it twice, automate it.

3. Error Budgets If you're under budget, ship faster. If over, focus on reliability.

4. Monitoring & Observability You can't fix what you can't see. Invest heavily in visibility.

5. Simplicity Complex systems fail in complex ways. Keep things simple.

6. Change Management Most outages are caused by changes. Manage them carefully.

SRE vs DevOps vs Ops

AspectTraditional OpsDevOpsSRE
FocusStabilityCollaborationReliability through engineering
ApproachManual, reactiveCultural shiftEngineering practices
ToilAcceptedReducedEliminated
ChangeResistedEmbracedManaged via error budgets

Getting Started with SRE

1. Define SLOs for critical services 2. Measure current reliability 3. Establish error budgets 4. Identify and reduce toil 5. Build observability

Put This Knowledge Into Practice

OpsBrief helps you improve operational visibility by consolidating events from all your tools into a unified daily brief.