Incident Severity Levels Explained
How to define SEV0 through SEV4 for consistent incident triage and response.
Why Severity Levels Matter
Severity levels create a shared language for incident urgency. Without them, every incident feels equally urgent (or equally ignorable). With them, your team knows exactly how to respond:
- Appropriate response: A SEV0 needs all hands. A SEV3 can wait until Monday.
- Clear communication: Stakeholders know what "SEV1" means without explanation.
- Consistent triage: Different on-call engineers make the same classification decisions.
- Resource planning: You can staff appropriately for different severity levels.
Severity Level Definitions
Most organizations use a 4 or 5-level system. Here's a common framework that works for most engineering teams:
Complete service outage or data loss affecting all users
Examples
- •Website completely down
- •Database corruption with data loss
- •Security breach with active exploitation
- •Payment processing completely broken
Response Expectations
Major functionality broken affecting most users
Examples
- •Core feature completely unavailable
- •Significant performance degradation (>10s latency)
- •Authentication broken for all users
- •Major revenue impact
Response Expectations
Functionality degraded affecting subset of users
Examples
- •Feature broken for specific user segment
- •Non-critical integrations failing
- •Performance issues affecting some regions
- •Workaround available but inconvenient
Response Expectations
Minor issues with limited impact
Examples
- •Cosmetic bugs
- •Non-critical feature issues
- •Minor performance degradation
- •Issues affecting <1% of users
Response Expectations
Tracking items, tech debt, or improvements
Examples
- •Deprecation warnings
- •Non-urgent technical debt
- •Feature requests from incidents
- •Documentation gaps discovered
Response Expectations
Triage Criteria
When an incident occurs, use these questions to determine severity:
Impact Assessment
- How many users are affected? All users → higher severity. Small subset → lower.
- Which functionality is broken? Core revenue path → higher. Nice-to-have feature → lower.
- Is there a workaround? No workaround → higher. Easy workaround → lower.
- Is data at risk? Data loss or security → automatic SEV0/SEV1.
Urgency Assessment
- Is the situation getting worse? Spreading impact → higher severity.
- Are SLAs at risk? Burning error budget → higher severity.
- Is there external visibility? Customer complaints or press → higher severity.
Pro Tip: When in Doubt, Go Higher
It's easier to downgrade a SEV1 to SEV2 than to catch up after treating a SEV1 as SEV3. Encourage your team to err on the side of higher severity. You can always adjust.
Customizing for Your Organization
The framework above is a starting point. Customize it for your context:
- Define "most users" — Is that 50%? 80%? Be specific.
- List your critical paths — What constitutes "core functionality" for you?
- Set SLA thresholds — At what error rate do you escalate?
- Consider business context — A SEV2 during Black Friday might be a SEV1.
Common Mistakes
- Severity inflation: Everything is SEV1 → nothing is. Reserve SEV0/1 for real emergencies.
- Severity deflation: Downplaying incidents to avoid escalation. This hides real problems.
- Arguing about severity during incidents: Pick one, move on. Argue later in the post-mortem.
- Not adjusting mid-incident: Severity can change as you learn more. Update it.
Document and Train
Your severity definitions are only useful if everyone knows them:
- Write them down in your runbooks or wiki
- Include examples specific to your systems
- Review classifications in post-mortems ("Was SEV1 appropriate?")
- Train new team members during onboarding
Next Steps
Continue learning about incident management: