INCIDENT RESPONSE RUNBOOKS

Learn how to write incident response runbooks that actually work. Includes templates, examples, common mistakes, and how to make runbooks your team will actually use.

Andrea Brown

Andrea Brown

February 27, 20261 min read
how-to-write-incident-response-runbooks-that-actually-work

How to Write Incident Response Runbooks That Actually Work

Most incident response runbooks are useless.

They sit in a wiki nobody reads. They're written by someone who left the company. They're so generic they don't actually help. Or they're so specific they break when anything changes.

When 2 AM hits and your service is down, engineers don't have time to read 10-page documents. They need clear steps that work right now.

This guide shows you how to write runbooks that your team will actually use during incidents.


Why Runbooks Matter

Runbooks reduce MTTR by 20-40%. Here's why:

Without a runbook:

Service X fails at 2 AM
Engineer wakes up
Engineer thinks: "What do I do first?"
Engineer checks Slack history for similar incidents
Engineer checks logs to understand the problem
Engineer googles the error message
Engineer tries different things (wrong thing, then right thing)
Engineer eventually fixes it (after 45 minutes)

With a good runbook:

Service X fails at 2 AM
Engineer wakes up
Engineer opens runbook
Engineer follows steps 1-5 in order
Engineer identifies root cause in step 3
Engineer executes fix in step 4
Engineer verifies in step 5
Service is back up (15 minutes)

Time saved: 30 minutes per incident

For a team of 25 with 50 incidents per month:

  • Time saved: 50 × 30 minutes = 1,500 minutes/month = 25 hours/month
  • Annual time saved: 300 hours
  • Annual cost savings: $24,000 (at $80/hour)

But here's the thing: Most runbooks don't get used because they're poorly written.


The Problem with Most Runbooks

Problem 1: Too Long

Bad runbook example:

Service X Runbook - Complete Operations Guide

Table of Contents
  1. Architecture Overview (5 pages)
  2. Service Dependencies (3 pages)
  3. Monitoring (4 pages)
  4. Common Issues (12 pages)
  5. Troubleshooting (8 pages)
  6. Recovery (6 pages)

Total: 38 pages

Reality: Engineer at 2 AM reads first page, gives up

Good runbook:

Service X Runbook - Emergency Response

IF: Service X returning 500 errors
THEN: Follow these 5 steps (2 minutes to read)

Step 1: Check service health page
Step 2: Check database connectivity
Step 3: Restart service
Step 4: Verify recovery
Step 5: Page database team if not recovered

Reality: Engineer reads entire runbook, fixes problem

Problem 2: Too Generic

Bad runbook:

Service Failing Runbook

1. Check if service is running
2. Look at logs
3. Check dependencies
4. Fix the problem
5. Verify service is working

(Applies to literally all services, useless for any specific service)

Good runbook:

Payment Service 500 Errors Runbook

THIS RUNBOOK IS FOR: "Payment Service returning 500 errors"
IF YOU SEE: Error rate > 1% on /api/payments endpoint

Step 1: Check Payment Service pod status
  Command: kubectl get pods -n prod -l app=payment
  Expected: 3 pods Running
  If not: Restart pods (step 3)

Step 2: Check database connection pool
  Command: SELECT count(*) FROM pg_stat_activity WHERE datname='payments'
  Expected: < 500 connections
  If > 500: Run TRUNCATE dead_connections; RESTART payment service

Step 3: Check Auth Service dependency
  Check: Is Auth Service also experiencing issues?
  Command: curl https://auth-service-internal.prod.svc/health
  If not responding: This is root cause, page auth team

Step 4: Restart Payment Service
  Command: kubectl rollout restart deployment payment -n prod
  Wait: 30 seconds for pods to start
  Check: Verify error rate drops below 0.1%

Step 5: If not recovered
  Page: On-call payment team lead
  Message: "Payment Service still broken, tried restart and DB check"

Problem 3: Written for Documentation, Not Emergency Use

Bad runbook:

When a service becomes unavailable, the on-call engineer should:

First, understand the service architecture and dependencies. Reference
the service dependency diagram (see Architecture section) to understand
what this service depends on and what depends on it. This context is
critical for...

Good runbook:

SERVICE DOWN? DO THIS NOW:

Step 1: Check service status
Step 2: Check dependencies
Step 3: Restart
Step 4: Verify recovery
Step 5: Escalate if still down

Problem 4: Not Maintained

Runbooks get out of date the moment they're written.

Scenario:

Runbook says: "Restart using: systemctl restart payment-service"
But: Service was migrated to Kubernetes 6 months ago
Result: Engineer tries command, it fails, they're confused

The Anatomy of a Good Runbook

Here's what a good runbook looks like:

1. Title (Clear and Specific)

Bad:

Service Issues Runbook
Database Problems Guide
Error Handling

Good:

Payment Service 500 Errors Runbook
Database Connection Pool Exhaustion Runbook
Memory Leak in Worker Service Runbook
Cache Invalidation Failure Runbook

2. What Triggers This Runbook (Be Specific)

THIS RUNBOOK APPLIES WHEN:

You see one of these:
  ✓ Alert: "Payment Service Error Rate > 1%"Alert: "Payment Service HTTP 500 Count > 10/min"
  ✓ Customer report: "I can't checkout"

NOT when:
  ✗ Alert: "Database CPU > 80%" (use Database runbook)
  ✗ Alert: "API Gateway 502 errors" (use API Gateway runbook)

3. Quick Decision Tree

IS THIS YOUR RUNBOOK?

[Start]
  ↓
Q: Is Payment Service returning 500 errors?
  YES → Continue to Step 1
  NOUse different runbook

Q: Is error rate > 0.1%?
  YES → This is critical, continue
  NO → Monitor and escalate if worsens

4. Triage Steps (2-3 minutes)

These are the FIRST things to check:

STEP 1: VERIFY THE PROBLEM (30 seconds)

Check: Is the alert real or false positive?

Command: curl https://api.prod.opsbr.io/api/payments/health

Expected responses:
  ✓ 200 OK → Service is actually healthy, likely false alert
  ✗ 500 Internal Server Error → Service is broken, continue
  ✗ 503 Service Unavailable → Service is down, continue
  ✗ Timeout (no response) → Service is not responding, continue

Time: 30 seconds
Next: If service is broken, go to Step 2

5. Root Cause Steps (5-10 minutes)

STEP 2: IDENTIFY ROOT CAUSE (5-10 minutes)

Check: Is it database connectivity?

Command: kubectl logs deployment/payment -n prod | grep -i "database"

Look for patterns:
  ✓ "Connection refused"Database is down or not accessible"Connection pool exhausted" → Too many connections
  ✓ "Query timeout"Database is slow
  ✓ "Authentication failed"Database credentials wrong

What to do for each:

IF: Connection refused
THEN: Go to Step 3 (Database Recovery)

IF: Connection pool exhausted
THEN: Go to Step 4 (Connection Pool)

IF: Database slow
THEN: Check database runbook, page database team

Time: 5-10 minutes
Next: Based on what you find, go to appropriate fix step

6. Fix Steps (Ordered by Likelihood)

STEP 3: FIX #1 - RESTART SERVICE (Most likely to work)

Action: Restart Payment Service pods

Command: kubectl rollout restart deployment payment -n prod

Wait: 30 seconds for pods to start

Verify:
  Command: kubectl get pods -n prod -l app=payment
  Expected: All pods show "Running" and "Ready 1/1"

  Command: curl https://api.prod.opsbr.io/api/payments/health
  Expected: 200 OK (should respond in < 1 second)

Time: 2-3 minutes

If service recovered:
  ✓ Send "RESOLVED" message to #incidents
  ✓ Continue to Post-Incident (Step 6)

If service NOT recovered:
  → Go to Step 4

7. Escalation (Who to Page)

STEP 5: ESCALATION (If still broken after 5 minutes)

If you've tried all steps and service is still down:

Page: @payment-team-lead
Message: "Payment Service still down after 10 minutes.
          Tried: service restart, database check.
          Last error: [paste error from logs]"

Time: < 1 minute

8. Post-Incident (What to Do When Fixed)

STEP 6: POST-INCIDENT

Once service is recovered:

1. Post to #incidents
   Message: "Payment Service RESOLVED - was due to [root cause]"

2. Create incident ticket
   Go to: PagerDuty or Incident.io
   Include:
     - When it started
     - Root cause
     - How you fixed it
     - Time to fix

3. Schedule post-mortem
   When: Within 24 hours
   Who: Payment team + on-call who fixed it

Do NOT skip this step - post-mortems prevent recurrence

Runbook Template

Here's a template you can copy and fill in:

# [SERVICE NAME] - [ISSUE TYPE] RUNBOOK

## What Triggers This Runbook

You should use this runbook when:
  ✓ [Alert name or symptom 1]
  ✓ [Alert name or symptom 2]
  ✓ [Customer report description]

You should NOT use this runbook when:
  ✗ [Different issue]
  ✗ [Different issue]

## Quick Triage (1 minute)

Is this the right runbook?

Q: [Ask a yes/no question that identifies this specific issue]
  YES → Continue
  NO → Use different runbook

## Root Cause Diagnosis (5 minutes)

STEP 1: Check [specific system]
  Command: [exact command to run]
  Expected result: [what should you see]
  If different: Go to Step [X]

STEP 2: Check [related system]
  Command: [exact command to run]
  Expected result: [what should you see]
  If different: Go to Step [X]

## Recovery Steps (In order of likelihood)

STEP 3: FIX #1 - [Most likely fix]
  Action: [Be specific]
  Command: [exact command]
  Verify: [how to check if it worked]
  Time: [how long this takes]
  If works: Go to Post-Incident
  If doesn't work: Go to Step 4

STEP 4: FIX #2 - [Second most likely fix]
  Action: [Be specific]
  Command: [exact command]
  Verify: [how to check if it worked]
  Time: [how long this takes]
  If works: Go to Post-Incident
  If doesn't work: Go to Step 5

STEP 5: ESCALATION
  If still broken after 10 minutes: Page [person/team]
  Message: [what to tell them]

## Post-Incident

1. Post in #incidents: "RESOLVED - [root cause]"
2. Create ticket in PagerDuty
3. Schedule post-mortem within 24 hours

Real-World Runbook Examples

Example 1: Database Connection Pool Exhaustion

# DATABASE CONNECTION POOL EXHAUSTION RUNBOOK

## What Triggers This Runbook

You see:
  ✓ Alert: "Database connections > 90% of max (4,500/5,000)"
  ✓ Multiple services getting "Connection refused" errors
  ✓ "No more connections" error in application logs

## Quick Triage (1 minute)

Q: Is database responding to direct connections?
  Command: psql -h prod-db.internal.svc -U postgres -c "SELECT 1"
  Expected: Returns "1" in < 1 second
  If fails: Database is down, use Database Down runbook

Q: Can you connect to database from admin pod?
  Command: kubectl exec -it admin-pod -- psql -h prod-db -U postgres -c "SELECT count(*) FROM pg_stat_activity"
  Expected: Returns a number (your connection count)
  If timeout: Connection pool is exhausted

## Root Cause (3-5 minutes)

STEP 1: Check which service is holding connections
  Command: psql -h prod-db -U postgres -c "
    SELECT datname, usename, application_name, count(*)
    FROM pg_stat_activity
    GROUP BY datname, usename, application_name
    ORDER BY count DESC"

  Look for: Which application has the most connections?

STEP 2: Check if connections are stuck
  Command: psql -h prod-db -U postgres -c "
    SELECT pid, usename, state, state_change, query
    FROM pg_stat_activity
    WHERE state != 'idle' OR state_change < now() - interval '10 min'
    ORDER BY state_change"

  Look for: Queries that have been running for > 10 minutes (these are stuck)

## Recovery

STEP 3: Kill stuck connections (if any)
  Command: psql -h prod-db -U postgres -c "
    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state_change < now() - interval '10 min'
    AND pid <> pg_backend_pid()"

  Time: < 1 minute
  Verify: Check connection count again (should be lower)

STEP 4: Restart service that's holding connections
  If Step 1 showed: Payment Service has 3,000+ connections
  Command: kubectl rollout restart deployment payment -n prod
  Wait: 30 seconds
  Verify: Connections should drop significantly

STEP 5: Monitor recovery
  Command: psql -h prod-db -U postgres -c "SELECT count(*) FROM pg_stat_activity"
  Target: Should drop below 2,000 within 2 minutes

## If still broken after 5 minutes
  Page: @database-team-lead
  Message: "Database connection pool exhausted. Killed stuck queries
            and restarted [service]. Still at [current connection count]."

Example 2: High CPU Causing Timeouts

# HIGH CPU - SERVICE TIMEOUT RUNBOOK

## What Triggers This Runbook

You see:
  ✓ Alert: "Service CPU > 90%"
  ✓ Alert: "Request latency > 5 seconds" (normal is < 100ms)
  ✓ Customers report: "Requests timing out"

## Quick Triage (1 minute)

Q: Is this sustained high CPU or temporary spike?
  Command: kubectl top pods -n prod | grep [service-name]
  Look at: Is CPU staying high or dropping?

  Sustained high → Likely resource leak or bad query
  Temporary → Likely traffic spike or background job

## Root Cause (5 minutes)

STEP 1: Check what's using CPU
  Command: kubectl logs deployment/[service] -n prod -f | grep -i "slow\|cpu\|query"
  Look for: Slow queries, background jobs, memory pressure

STEP 2: Check if it's a background job
  Command: kubectl describe pod [pod-name] -n prod | grep -A 20 "Containers"
  Look for: Are background jobs running right now?

STEP 3: Check recent deployments
  Command: kubectl rollout history deployment/[service] -n prod
  Question: Was a deployment made in last 30 minutes?

## Recovery

STEP 3: FIX #1 - Kill noisy background job
  If you found background job using CPU:
  Command: kubectl logs [pod] | grep -A 5 "[job-name]"
  Kill the job or let it complete (usually 1-2 min)
  Verify: CPU drops back to normal

STEP 4: FIX #2 - Rollback recent deployment
  If deployment was made < 30 min ago:
  Command: kubectl rollout undo deployment/[service] -n prod
  Wait: 30 seconds
  Verify: CPU drops, latency improves

STEP 5: FIX #3 - Scale up service horizontally
  If neither of above works:
  Command: kubectl scale deployment [service] --replicas=5 -n prod
  Wait: 60 seconds for new pods to start
  Verify: CPU per pod drops as load balances across more instances

## If still high after 5 minutes
  Page: @[service]-team
  Message: "High CPU causing timeouts. Tried: kill jobs, rollback deploy,
            scale to 5 replicas. Still at 90% CPU."

How to Make Runbooks Your Team Will Use

1. Make Them Discoverable

Problem: Nobody uses runbooks because they can't find them

Solution: Store in one place everyone knows about (Confluence, Notion, GitHub Wiki). Link from PagerDuty incident directly to runbook. Link from alert messages: "Got this alert? See runbook: [link]". Post in Slack #incidents with runbook links.

2. Test Them Regularly

Problem: Runbook says "restart service" but that command was deprecated 6 months ago

Solution: Monthly, run a chaos engineering test (intentionally break something). Follow the runbook step-by-step. Update anything that doesn't work. Version your runbooks (v1.0, v1.1, etc.).

3. Get Feedback After Every Incident

Problem: Runbooks get better or worse without anyone knowing

Solution: After every incident, ask "Was the runbook helpful?" If yes, note what worked. If no, ask what was missing. Update runbook based on feedback.

4. Keep Them Updated

Problem: Service architecture changes, runbooks don't

Solution: Assign ownership ("Payment team owns payment service runbooks"). Review quarterly. Update when the service changes, new common issues are discovered, or tools/commands change.

5. Make Them Easy to Read

Problem: Runbook full of technical jargon, engineer doesn't understand

Solution: Use exact commands (copy/paste ready). Explain what each command does. Test with someone who's NOT an expert.


Common Runbook Mistakes to Avoid

Mistake Why It's Bad Fix
Too long (10+ pages) Nobody reads it during incident Keep to 2 pages max
Generic instructions Doesn't actually help Make specific to your system
Outdated commands Commands don't work, engineer gives up Test monthly
Missing escalation Engineer doesn't know when to page for help Always include escalation step
No decision tree Engineer doesn't know if it's the right runbook Start with "IS THIS YOUR RUNBOOK?"
Too technical Engineers at 2 AM don't understand Use simple language
No post-incident step Root cause never fixed, incident repeats Always include prevention step
No owner Runbooks get worse over time Assign team ownership

Runbook Metrics to Track

If you want to improve your runbooks, track:

  • How often is runbook used per month?
  • Did incident get resolved (yes/no)?
  • Time from engineer opening runbook to incident resolved?
  • Did engineer skip to escalation? (means runbook didn't help)

Targets:

  • 80%+ of incidents should have runbook (coverage)
  • 70%+ of runbook-guided incidents should resolve (effectiveness)
  • MTTR with runbook: < 15 minutes
  • < 10% of runbooks should require escalation

If metrics are bad:

  • Runbook is too confusing (rewrite)
  • Runbook is for wrong issue type (add decision tree)
  • Runbook is outdated (test and update)

Tools for Managing Runbooks

Confluence

Pros: Great for collaboration, good search, version control Cons: Can get cluttered, hard to find specific runbooks Best for: Teams already using Atlassian stack

Notion

Pros: Beautiful formatting, easy to use, fast Cons: Can be slow at scale, limited integrations Best for: Small-to-medium teams wanting simplicity

GitHub Wiki

Pros: Version control built-in, easy to link, developers already know it Cons: Can be hard to organize, limited search Best for: Engineering-first organizations

PagerDuty + Runbooks

Pros: Integrated with incidents, easy to link alerts Cons: Runbooks limited to PagerDuty platform Best for: PagerDuty users

Recommendation

Start: GitHub Wiki or Notion (easy, free, good enough) Scale: Confluence (better for large organizations) Link: From PagerDuty alerts and incidents (so engineers find them)


Sample Runbook Library

Your organization should have runbooks for:

Tier 1 (Critical - first 3 months):

  • [ ] Database down
  • [ ] Payment service down
  • [ ] API Gateway down
  • [ ] Auth service down
  • [ ] Cache service down

Tier 2 (High Priority - months 3-6):

  • [ ] High database CPU
  • [ ] High memory usage
  • [ ] Connection pool exhausted
  • [ ] Disk full
  • [ ] Network connectivity issues

Tier 3 (Medium Priority - months 6+):

  • [ ] Slow database queries
  • [ ] High error rates
  • [ ] Long deployment times
  • [ ] Certificate expiration
  • [ ] Dependency timeout

Total: 15-20 runbooks for a typical SaaS company


Runbook ROI

If you implement runbooks properly:

Cost:

  • Time to write 10 runbooks: 40-60 hours = $3,200-$4,800
  • Time to maintain (2 hours/month): $200/month = $2,400/year

Benefit:

  • 50 incidents per month
  • 30 minutes saved per incident (with good runbook)
  • 1,500 minutes/month = 25 hours/month
  • 300 hours/year = $24,000/year (at $80/hour)

ROI:

  • Year 1: ($24,000 - $4,800 - $2,400) / $4,800 = 350% ROI
  • Year 2+: $24,000 / $2,400 = 10x ROI

Result: For every $1 spent on runbooks, you save $10


Conclusion: Good Runbooks = Faster Incident Response

Good runbooks reduce MTTR by 20-40% and make engineers' lives easier. Bad runbooks waste time and confuse engineers.

Start with the critical 5 runbooks, test them, improve based on feedback, then add more.

Your team will thank you when they can resolve incidents 30% faster.

Start this week:

  1. Pick your most common incident type
  2. Write a 2-page runbook using the template above
  3. Test it with your team
  4. Publish it in your runbook location
  5. Link it from your alert/on-call tool
  6. Get feedback after next incident using this runbook
  7. Update based on feedback

In 3 months, you'll have 5-10 tested, proven runbooks. In 6 months, you'll have a complete library that your team actually uses.

Your MTTR will drop 30-40%. Your on-call morale will improve. Your team will sleep better.

Share this article:

Try OpsBrief Free

Never miss what matters across your company. Start your 14-day free trial today.