INCIDENT RESPONSE RUNBOOKS
Learn how to write incident response runbooks that actually work. Includes templates, examples, common mistakes, and how to make runbooks your team will actually use.
Andrea Brown

How to Write Incident Response Runbooks That Actually Work
Most incident response runbooks are useless.
They sit in a wiki nobody reads. They're written by someone who left the company. They're so generic they don't actually help. Or they're so specific they break when anything changes.
When 2 AM hits and your service is down, engineers don't have time to read 10-page documents. They need clear steps that work right now.
This guide shows you how to write runbooks that your team will actually use during incidents.
Why Runbooks Matter
Runbooks reduce MTTR by 20-40%. Here's why:
Without a runbook:
Service X fails at 2 AM
Engineer wakes up
Engineer thinks: "What do I do first?"
Engineer checks Slack history for similar incidents
Engineer checks logs to understand the problem
Engineer googles the error message
Engineer tries different things (wrong thing, then right thing)
Engineer eventually fixes it (after 45 minutes)
With a good runbook:
Service X fails at 2 AM
Engineer wakes up
Engineer opens runbook
Engineer follows steps 1-5 in order
Engineer identifies root cause in step 3
Engineer executes fix in step 4
Engineer verifies in step 5
Service is back up (15 minutes)
Time saved: 30 minutes per incident
For a team of 25 with 50 incidents per month:
- Time saved: 50 × 30 minutes = 1,500 minutes/month = 25 hours/month
- Annual time saved: 300 hours
- Annual cost savings: $24,000 (at $80/hour)
But here's the thing: Most runbooks don't get used because they're poorly written.
The Problem with Most Runbooks
Problem 1: Too Long
Bad runbook example:
Service X Runbook - Complete Operations Guide
Table of Contents
1. Architecture Overview (5 pages)
2. Service Dependencies (3 pages)
3. Monitoring (4 pages)
4. Common Issues (12 pages)
5. Troubleshooting (8 pages)
6. Recovery (6 pages)
Total: 38 pages
Reality: Engineer at 2 AM reads first page, gives up
Good runbook:
Service X Runbook - Emergency Response
IF: Service X returning 500 errors
THEN: Follow these 5 steps (2 minutes to read)
Step 1: Check service health page
Step 2: Check database connectivity
Step 3: Restart service
Step 4: Verify recovery
Step 5: Page database team if not recovered
Reality: Engineer reads entire runbook, fixes problem
Problem 2: Too Generic
Bad runbook:
Service Failing Runbook
1. Check if service is running
2. Look at logs
3. Check dependencies
4. Fix the problem
5. Verify service is working
(Applies to literally all services, useless for any specific service)
Good runbook:
Payment Service 500 Errors Runbook
THIS RUNBOOK IS FOR: "Payment Service returning 500 errors"
IF YOU SEE: Error rate > 1% on /api/payments endpoint
Step 1: Check Payment Service pod status
Command: kubectl get pods -n prod -l app=payment
Expected: 3 pods Running
If not: Restart pods (step 3)
Step 2: Check database connection pool
Command: SELECT count(*) FROM pg_stat_activity WHERE datname='payments'
Expected: < 500 connections
If > 500: Run TRUNCATE dead_connections; RESTART payment service
Step 3: Check Auth Service dependency
Check: Is Auth Service also experiencing issues?
Command: curl https://auth-service-internal.prod.svc/health
If not responding: This is root cause, page auth team
Step 4: Restart Payment Service
Command: kubectl rollout restart deployment payment -n prod
Wait: 30 seconds for pods to start
Check: Verify error rate drops below 0.1%
Step 5: If not recovered
Page: On-call payment team lead
Message: "Payment Service still broken, tried restart and DB check"
Problem 3: Written for Documentation, Not Emergency Use
Bad runbook:
When a service becomes unavailable, the on-call engineer should:
First, understand the service architecture and dependencies. Reference
the service dependency diagram (see Architecture section) to understand
what this service depends on and what depends on it. This context is
critical for...
Good runbook:
SERVICE DOWN? DO THIS NOW:
Step 1: Check service status
Step 2: Check dependencies
Step 3: Restart
Step 4: Verify recovery
Step 5: Escalate if still down
Problem 4: Not Maintained
Runbooks get out of date the moment they're written.
Scenario:
Runbook says: "Restart using: systemctl restart payment-service"
But: Service was migrated to Kubernetes 6 months ago
Result: Engineer tries command, it fails, they're confused
The Anatomy of a Good Runbook
Here's what a good runbook looks like:
1. Title (Clear and Specific)
Bad:
Service Issues Runbook
Database Problems Guide
Error Handling
Good:
Payment Service 500 Errors Runbook
Database Connection Pool Exhaustion Runbook
Memory Leak in Worker Service Runbook
Cache Invalidation Failure Runbook
2. What Triggers This Runbook (Be Specific)
THIS RUNBOOK APPLIES WHEN:
You see one of these:
✓ Alert: "Payment Service Error Rate > 1%"
✓ Alert: "Payment Service HTTP 500 Count > 10/min"
✓ Customer report: "I can't checkout"
NOT when:
✗ Alert: "Database CPU > 80%" (use Database runbook)
✗ Alert: "API Gateway 502 errors" (use API Gateway runbook)
3. Quick Decision Tree
IS THIS YOUR RUNBOOK?
[Start]
↓
Q: Is Payment Service returning 500 errors?
YES → Continue to Step 1
NO → Use different runbook
Q: Is error rate > 0.1%?
YES → This is critical, continue
NO → Monitor and escalate if worsens
4. Triage Steps (2-3 minutes)
These are the FIRST things to check:
STEP 1: VERIFY THE PROBLEM (30 seconds)
Check: Is the alert real or false positive?
Command: curl https://api.prod.opsbr.io/api/payments/health
Expected responses:
✓ 200 OK → Service is actually healthy, likely false alert
✗ 500 Internal Server Error → Service is broken, continue
✗ 503 Service Unavailable → Service is down, continue
✗ Timeout (no response) → Service is not responding, continue
Time: 30 seconds
Next: If service is broken, go to Step 2
5. Root Cause Steps (5-10 minutes)
STEP 2: IDENTIFY ROOT CAUSE (5-10 minutes)
Check: Is it database connectivity?
Command: kubectl logs deployment/payment -n prod | grep -i "database"
Look for patterns:
✓ "Connection refused" → Database is down or not accessible
✓ "Connection pool exhausted" → Too many connections
✓ "Query timeout" → Database is slow
✓ "Authentication failed" → Database credentials wrong
What to do for each:
IF: Connection refused
THEN: Go to Step 3 (Database Recovery)
IF: Connection pool exhausted
THEN: Go to Step 4 (Connection Pool)
IF: Database slow
THEN: Check database runbook, page database team
Time: 5-10 minutes
Next: Based on what you find, go to appropriate fix step
6. Fix Steps (Ordered by Likelihood)
STEP 3: FIX #1 - RESTART SERVICE (Most likely to work)
Action: Restart Payment Service pods
Command: kubectl rollout restart deployment payment -n prod
Wait: 30 seconds for pods to start
Verify:
Command: kubectl get pods -n prod -l app=payment
Expected: All pods show "Running" and "Ready 1/1"
Command: curl https://api.prod.opsbr.io/api/payments/health
Expected: 200 OK (should respond in < 1 second)
Time: 2-3 minutes
If service recovered:
✓ Send "RESOLVED" message to #incidents
✓ Continue to Post-Incident (Step 6)
If service NOT recovered:
→ Go to Step 4
7. Escalation (Who to Page)
STEP 5: ESCALATION (If still broken after 5 minutes)
If you've tried all steps and service is still down:
Page: @payment-team-lead
Message: "Payment Service still down after 10 minutes.
Tried: service restart, database check.
Last error: [paste error from logs]"
Time: < 1 minute
8. Post-Incident (What to Do When Fixed)
STEP 6: POST-INCIDENT
Once service is recovered:
1. Post to #incidents
Message: "Payment Service RESOLVED - was due to [root cause]"
2. Create incident ticket
Go to: PagerDuty or Incident.io
Include:
- When it started
- Root cause
- How you fixed it
- Time to fix
3. Schedule post-mortem
When: Within 24 hours
Who: Payment team + on-call who fixed it
Do NOT skip this step - post-mortems prevent recurrence
Runbook Template
Here's a template you can copy and fill in:
# [SERVICE NAME] - [ISSUE TYPE] RUNBOOK
## What Triggers This Runbook
You should use this runbook when:
✓ [Alert name or symptom 1]
✓ [Alert name or symptom 2]
✓ [Customer report description]
You should NOT use this runbook when:
✗ [Different issue]
✗ [Different issue]
## Quick Triage (1 minute)
Is this the right runbook?
Q: [Ask a yes/no question that identifies this specific issue]
YES → Continue
NO → Use different runbook
## Root Cause Diagnosis (5 minutes)
STEP 1: Check [specific system]
Command: [exact command to run]
Expected result: [what should you see]
If different: Go to Step [X]
STEP 2: Check [related system]
Command: [exact command to run]
Expected result: [what should you see]
If different: Go to Step [X]
## Recovery Steps (In order of likelihood)
STEP 3: FIX #1 - [Most likely fix]
Action: [Be specific]
Command: [exact command]
Verify: [how to check if it worked]
Time: [how long this takes]
If works: Go to Post-Incident
If doesn't work: Go to Step 4
STEP 4: FIX #2 - [Second most likely fix]
Action: [Be specific]
Command: [exact command]
Verify: [how to check if it worked]
Time: [how long this takes]
If works: Go to Post-Incident
If doesn't work: Go to Step 5
STEP 5: ESCALATION
If still broken after 10 minutes: Page [person/team]
Message: [what to tell them]
## Post-Incident
1. Post in #incidents: "RESOLVED - [root cause]"
2. Create ticket in PagerDuty
3. Schedule post-mortem within 24 hours
Real-World Runbook Examples
Example 1: Database Connection Pool Exhaustion
# DATABASE CONNECTION POOL EXHAUSTION RUNBOOK
## What Triggers This Runbook
You see:
✓ Alert: "Database connections > 90% of max (4,500/5,000)"
✓ Multiple services getting "Connection refused" errors
✓ "No more connections" error in application logs
## Quick Triage (1 minute)
Q: Is database responding to direct connections?
Command: psql -h prod-db.internal.svc -U postgres -c "SELECT 1"
Expected: Returns "1" in < 1 second
If fails: Database is down, use Database Down runbook
Q: Can you connect to database from admin pod?
Command: kubectl exec -it admin-pod -- psql -h prod-db -U postgres -c "SELECT count(*) FROM pg_stat_activity"
Expected: Returns a number (your connection count)
If timeout: Connection pool is exhausted
## Root Cause (3-5 minutes)
STEP 1: Check which service is holding connections
Command: psql -h prod-db -U postgres -c "
SELECT datname, usename, application_name, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, application_name
ORDER BY count DESC"
Look for: Which application has the most connections?
STEP 2: Check if connections are stuck
Command: psql -h prod-db -U postgres -c "
SELECT pid, usename, state, state_change, query
FROM pg_stat_activity
WHERE state != 'idle' OR state_change < now() - interval '10 min'
ORDER BY state_change"
Look for: Queries that have been running for > 10 minutes (these are stuck)
## Recovery
STEP 3: Kill stuck connections (if any)
Command: psql -h prod-db -U postgres -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state_change < now() - interval '10 min'
AND pid <> pg_backend_pid()"
Time: < 1 minute
Verify: Check connection count again (should be lower)
STEP 4: Restart service that's holding connections
If Step 1 showed: Payment Service has 3,000+ connections
Command: kubectl rollout restart deployment payment -n prod
Wait: 30 seconds
Verify: Connections should drop significantly
STEP 5: Monitor recovery
Command: psql -h prod-db -U postgres -c "SELECT count(*) FROM pg_stat_activity"
Target: Should drop below 2,000 within 2 minutes
## If still broken after 5 minutes
Page: @database-team-lead
Message: "Database connection pool exhausted. Killed stuck queries
and restarted [service]. Still at [current connection count]."
Example 2: High CPU Causing Timeouts
# HIGH CPU - SERVICE TIMEOUT RUNBOOK
## What Triggers This Runbook
You see:
✓ Alert: "Service CPU > 90%"
✓ Alert: "Request latency > 5 seconds" (normal is < 100ms)
✓ Customers report: "Requests timing out"
## Quick Triage (1 minute)
Q: Is this sustained high CPU or temporary spike?
Command: kubectl top pods -n prod | grep [service-name]
Look at: Is CPU staying high or dropping?
Sustained high → Likely resource leak or bad query
Temporary → Likely traffic spike or background job
## Root Cause (5 minutes)
STEP 1: Check what's using CPU
Command: kubectl logs deployment/[service] -n prod -f | grep -i "slow\|cpu\|query"
Look for: Slow queries, background jobs, memory pressure
STEP 2: Check if it's a background job
Command: kubectl describe pod [pod-name] -n prod | grep -A 20 "Containers"
Look for: Are background jobs running right now?
STEP 3: Check recent deployments
Command: kubectl rollout history deployment/[service] -n prod
Question: Was a deployment made in last 30 minutes?
## Recovery
STEP 3: FIX #1 - Kill noisy background job
If you found background job using CPU:
Command: kubectl logs [pod] | grep -A 5 "[job-name]"
Kill the job or let it complete (usually 1-2 min)
Verify: CPU drops back to normal
STEP 4: FIX #2 - Rollback recent deployment
If deployment was made < 30 min ago:
Command: kubectl rollout undo deployment/[service] -n prod
Wait: 30 seconds
Verify: CPU drops, latency improves
STEP 5: FIX #3 - Scale up service horizontally
If neither of above works:
Command: kubectl scale deployment [service] --replicas=5 -n prod
Wait: 60 seconds for new pods to start
Verify: CPU per pod drops as load balances across more instances
## If still high after 5 minutes
Page: @[service]-team
Message: "High CPU causing timeouts. Tried: kill jobs, rollback deploy,
scale to 5 replicas. Still at 90% CPU."
How to Make Runbooks Your Team Will Use
1. Make Them Discoverable
Problem: Nobody uses runbooks because they can't find them
Solution: Store in one place everyone knows about (Confluence, Notion, GitHub Wiki). Link from PagerDuty incident directly to runbook. Link from alert messages: "Got this alert? See runbook: [link]". Post in Slack #incidents with runbook links.
2. Test Them Regularly
Problem: Runbook says "restart service" but that command was deprecated 6 months ago
Solution: Monthly, run a chaos engineering test (intentionally break something). Follow the runbook step-by-step. Update anything that doesn't work. Version your runbooks (v1.0, v1.1, etc.).
3. Get Feedback After Every Incident
Problem: Runbooks get better or worse without anyone knowing
Solution: After every incident, ask "Was the runbook helpful?" If yes, note what worked. If no, ask what was missing. Update runbook based on feedback.
4. Keep Them Updated
Problem: Service architecture changes, runbooks don't
Solution: Assign ownership ("Payment team owns payment service runbooks"). Review quarterly. Update when the service changes, new common issues are discovered, or tools/commands change.
5. Make Them Easy to Read
Problem: Runbook full of technical jargon, engineer doesn't understand
Solution: Use exact commands (copy/paste ready). Explain what each command does. Test with someone who's NOT an expert.
Common Runbook Mistakes to Avoid
| Mistake | Why It's Bad | Fix |
|---|---|---|
| Too long (10+ pages) | Nobody reads it during incident | Keep to 2 pages max |
| Generic instructions | Doesn't actually help | Make specific to your system |
| Outdated commands | Commands don't work, engineer gives up | Test monthly |
| Missing escalation | Engineer doesn't know when to page for help | Always include escalation step |
| No decision tree | Engineer doesn't know if it's the right runbook | Start with "IS THIS YOUR RUNBOOK?" |
| Too technical | Engineers at 2 AM don't understand | Use simple language |
| No post-incident step | Root cause never fixed, incident repeats | Always include prevention step |
| No owner | Runbooks get worse over time | Assign team ownership |
Runbook Metrics to Track
If you want to improve your runbooks, track:
- How often is runbook used per month?
- Did incident get resolved (yes/no)?
- Time from engineer opening runbook to incident resolved?
- Did engineer skip to escalation? (means runbook didn't help)
Targets:
- 80%+ of incidents should have runbook (coverage)
- 70%+ of runbook-guided incidents should resolve (effectiveness)
- MTTR with runbook: < 15 minutes
- < 10% of runbooks should require escalation
If metrics are bad:
- Runbook is too confusing (rewrite)
- Runbook is for wrong issue type (add decision tree)
- Runbook is outdated (test and update)
Tools for Managing Runbooks
Confluence
Pros: Great for collaboration, good search, version control Cons: Can get cluttered, hard to find specific runbooks Best for: Teams already using Atlassian stack
Notion
Pros: Beautiful formatting, easy to use, fast Cons: Can be slow at scale, limited integrations Best for: Small-to-medium teams wanting simplicity
GitHub Wiki
Pros: Version control built-in, easy to link, developers already know it Cons: Can be hard to organize, limited search Best for: Engineering-first organizations
PagerDuty + Runbooks
Pros: Integrated with incidents, easy to link alerts Cons: Runbooks limited to PagerDuty platform Best for: PagerDuty users
Recommendation
Start: GitHub Wiki or Notion (easy, free, good enough) Scale: Confluence (better for large organizations) Link: From PagerDuty alerts and incidents (so engineers find them)
Sample Runbook Library
Your organization should have runbooks for:
Tier 1 (Critical - first 3 months):
- [ ] Database down
- [ ] Payment service down
- [ ] API Gateway down
- [ ] Auth service down
- [ ] Cache service down
Tier 2 (High Priority - months 3-6):
- [ ] High database CPU
- [ ] High memory usage
- [ ] Connection pool exhausted
- [ ] Disk full
- [ ] Network connectivity issues
Tier 3 (Medium Priority - months 6+):
- [ ] Slow database queries
- [ ] High error rates
- [ ] Long deployment times
- [ ] Certificate expiration
- [ ] Dependency timeout
Total: 15-20 runbooks for a typical SaaS company
Runbook ROI
If you implement runbooks properly:
Cost:
- Time to write 10 runbooks: 40-60 hours = $3,200-$4,800
- Time to maintain (2 hours/month): $200/month = $2,400/year
Benefit:
- 50 incidents per month
- 30 minutes saved per incident (with good runbook)
- 1,500 minutes/month = 25 hours/month
- 300 hours/year = $24,000/year (at $80/hour)
ROI:
- Year 1: ($24,000 - $4,800 - $2,400) / $4,800 = 350% ROI
- Year 2+: $24,000 / $2,400 = 10x ROI
Result: For every $1 spent on runbooks, you save $10
Conclusion: Good Runbooks = Faster Incident Response
Good runbooks reduce MTTR by 20-40% and make engineers' lives easier. Bad runbooks waste time and confuse engineers.
Start with the critical 5 runbooks, test them, improve based on feedback, then add more.
Your team will thank you when they can resolve incidents 30% faster.
Start this week:
- Pick your most common incident type
- Write a 2-page runbook using the template above
- Test it with your team
- Publish it in your runbook location
- Link it from your alert/on-call tool
- Get feedback after next incident using this runbook
- Update based on feedback
In 3 months, you'll have 5-10 tested, proven runbooks. In 6 months, you'll have a complete library that your team actually uses.
Your MTTR will drop 30-40%. Your on-call morale will improve. Your team will sleep better.


