Post-Mortems: Learning from Incidents
How to run blameless retrospectives that actually drive improvement.
What is a Post-Mortem?
A post-mortem (also called incident retrospective or incident review) is a structured meeting and document that analyzes an incident after it's resolved. The goal is to understand what happened, why it happened, and how to prevent similar incidents in the future.
Post-mortems are where organizations learn from failure. Without them, you're doomed to repeat the same incidents. With good post-mortems, each incident makes your systems more resilient.
The Blameless Culture
Blameless post-mortems are foundational to effective incident learning. The principle is simple: focus on systems and processes, not individuals.
The Blameless Mindset
- • People made the best decisions they could with the information they had
- • If someone made a "mistake," the system allowed that mistake to cause harm
- • Our job is to make systems that are resilient to human error
- • Blame creates fear. Fear hides information. Hidden information causes incidents.
Common Anti-Patterns
- • "Who broke prod?" — Focus on what, not who
- • "More careful next time" — Not an action item. What changes?
- • "Human error" as root cause — Dig deeper. Why was error possible?
- • Skipping post-mortems for "small" incidents — Small incidents reveal big risks
When to Run Post-Mortems
Not every incident needs a full post-mortem meeting, but every significant incident should be documented. Common triggers:
- SEV0 or SEV1 incidents: Always
- SEV2 incidents: Usually, especially if novel
- Customer-impacting incidents: Always, regardless of severity
- Near-misses: Often the most valuable to analyze
- Repeat incidents: Why didn't the last fix work?
Post-Mortem Template
Use this template as a starting point. Adapt it to your organization's needs.
1. Incident Summary
Date/Time: [When did it happen?]
Duration: [How long until resolved?]
Severity: [SEV level]
Impact: [Who/what was affected?]
TL;DR: [1-2 sentence summary]
2. Timeline
Chronological list of events with timestamps:
14:32 - Monitoring alert fires for high error rate
14:35 - On-call acknowledges alert
14:42 - Root cause identified as bad deployment
14:45 - Rollback initiated
14:52 - Service restored
3. Root Cause Analysis
What caused this incident? Use the "5 Whys" technique:
• Why did users see errors? → API returned 500s
• Why did API return 500s? → Database connection failures
• Why did DB connections fail? → Connection pool exhausted
• Why was pool exhausted? → New code path had connection leak
• Why wasn't this caught? → No integration tests for this path
4. Contributing Factors
What made this incident worse or harder to resolve?
5. What Went Well
What worked during response? Celebrate the wins.
6. What Could Be Improved
Where did we struggle? What slowed us down?
7. Action Items
Specific, assignable tasks with owners and due dates:
☐ [P1] Add connection leak detection - @alice - Due 2/15
☐ [P1] Write integration tests for payment path - @bob - Due 2/20
☐ [P2] Add runbook for DB connection issues - @charlie - Due 2/28
Running the Post-Mortem Meeting
Before the Meeting
- Schedule within 2-5 days of incident resolution (memories are fresh)
- Fill in the timeline and basic facts beforehand
- Invite all responders plus relevant stakeholders
- Assign a facilitator (often not the incident commander)
During the Meeting
- Set the tone: Remind everyone it's blameless
- Walk through timeline: Fill in gaps, correct errors
- Discuss root cause: Use "5 Whys" together
- Identify improvements: What would prevent recurrence?
- Assign action items: Specific owners and deadlines
After the Meeting
- Publish the post-mortem document (internally or publicly)
- Track action items to completion
- Share learnings with broader team
- Review action item completion in team meetings
Making Action Items Stick
The most common post-mortem failure is action items that never get done. To prevent this:
- • Make action items specific and measurable
- • Assign a single owner to each item
- • Set realistic deadlines
- • Prioritize ruthlessly (P1, P2, P3)
- • Track completion in a shared system
- • Review completion weekly
- • Create vague action items ("improve monitoring")
- • Assign to "the team" (no one owns it)
- • Create 20 action items (nothing gets done)
- • Skip the deadline
- • Let action items languish in a backlog
- • Forget to follow up
Advanced Topics
Public Post-Mortems
Some organizations publish post-mortems externally. This builds trust with customers and contributes to industry learning. Examples include Cloudflare, GitHub, and Google.
Blameful Environments
If your organization isn't ready for blameless culture, start small. Run blameless post-mortems within your team. Demonstrate the value. Culture change takes time.
Post-Mortem Fatigue
Too many post-mortems can be as bad as too few. If you're running multiple per week, consider batching similar incidents or doing lightweight written reviews for minor issues.