RCA
Root Cause Analysis
Root Cause Analysis (RCA) is a systematic process for identifying the underlying causes of an incident, rather than just addressing symptoms.
Why RCA Matters
Without understanding WHY incidents happen, you're doomed to repeat them.
Good RCA: - Prevents similar incidents - Improves system reliability - Builds organizational knowledge - Identifies systemic issues
Blameless RCA
The most important principle: RCA should be blameless.
- Focus on systems and processes, not individuals - People make mistakes-ask why the system allowed the mistake - Psychological safety enables honest analysis - Blame leads to hiding information
RCA Techniques
Five Whys: Ask "why?" repeatedly until you reach root causes. - Why did the site go down? → Database ran out of connections - Why? → Connection pool too small for traffic - Why? → Never tested at this scale - Why? → No load testing in CI/CD - Root cause: Missing load testing in deployment pipeline
Fishbone Diagram: Categorize potential causes: People, Process, Technology, Environment
Timeline Analysis: Reconstruct the incident chronologically to identify contributing factors.
RCA Best Practices
1. Conduct within 48 hours - Memory fades quickly 2. Include all responders - Different perspectives reveal more 3. Document thoroughly - Create lasting organizational knowledge 4. Identify action items - RCA without follow-up is useless 5. Share widely - Help other teams learn too