SRE
Site Reliability Engineering
Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations, with the goal of creating scalable and reliable systems.
Origins of SRE
SRE was pioneered at Google in 2003. Ben Treynor Sloss famously defined it as:
"SRE is what happens when you ask a software engineer to design an operations team."
Core SRE Principles
1. Embrace Risk 100% reliability is impossible and too expensive. Define acceptable risk through SLOs.
2. Eliminate Toil Automate repetitive manual work. If a human does it twice, automate it.
3. Error Budgets If you're under budget, ship faster. If over, focus on reliability.
4. Monitoring & Observability You can't fix what you can't see. Invest heavily in visibility.
5. Simplicity Complex systems fail in complex ways. Keep things simple.
6. Change Management Most outages are caused by changes. Manage them carefully.
SRE vs DevOps vs Ops
| Aspect | Traditional Ops | DevOps | SRE |
|---|---|---|---|
| Focus | Stability | Collaboration | Reliability through engineering |
| Approach | Manual, reactive | Cultural shift | Engineering practices |
| Toil | Accepted | Reduced | Eliminated |
| Change | Resisted | Embraced | Managed via error budgets |
Getting Started with SRE
1. Define SLOs for critical services 2. Measure current reliability 3. Establish error budgets 4. Identify and reduce toil 5. Build observability