Incident Severity Levels: How to Define SEV0, SEV1, SEV2, and SEV3
Two engineers look at the same production alert and disagree on whether it's a SEV1 or SEV2. One wants to wake up the VP of Engineering. The other wants to handle it quietly. Both are wrong - not because of their technical judgment, but because their organization hasn't defined what SEV1 means clearly enough for two people to reach the same answer from the same data.
Jasmine Decker

Incident Severity Levels: How to Define SEV0, SEV1, SEV2, and SEV3
Two engineers look at the same production alert and disagree on whether it's a SEV1 or a SEV2. One wants to wake up the VP of Engineering. The other wants to handle it quietly and write it up in the morning.
Both are wrong - not because of their technical judgment, but because their organization hasn't defined what SEV1 means clearly enough for two people to reach the same answer from the same information.
Severity level definitions are foundational to incident management. Get them wrong and every downstream process - escalation, communication, postmortems, on-call burden - inherits the inconsistency.
Severity vs. Priority: The Distinction That Matters
Before building a severity framework, clarify what severity is and isn't.
Severity describes the technical impact of an incident - how bad is the damage to the system and its users? It's an objective assessment of current state.
Priority describes the urgency of response - how quickly should we act? It factors in severity but also considers business context: time of day, available resources, customer impact, SLO status.
A SEV2 incident on Black Friday at peak traffic might be prioritized P1 and get an all-hands response. The same SEV2 at 3am on a Sunday affecting a small subset of users might be prioritized P2 and wait for morning.
Keeping severity and priority distinct gives you cleaner frameworks for both. Severity is technical and objective. Priority is contextual and involves judgment.
This guide focuses on severity. Most organizations use four levels (SEV0 through SEV3, or P1 through P4). Some add a fifth. The number matters less than the precision of the definitions.
The Four Severity Levels
SEV0 - Critical / Catastrophic
SEV0 is reserved for the worst incidents - ones that require immediate all-hands response regardless of time, day, or competing priorities.
Defining characteristics:
- Complete service outage affecting all or most users
- Active data loss or corruption in production
- Security breach with customer data exposure
- System behavior that could cause irreversible harm to data integrity or customer trust
Examples:
- Payment processing completely down
- Database is corrupting writes
- Production credentials exposed in a public repository
- Full platform outage - users cannot log in
Response expectations:
- Immediate page, 24/7
- All engineering leadership notified within 10 minutes
- Dedicated incident commander assigned
- Customer communication within 15 minutes
- Status page updated immediately
- Executive team notified
SEV0 should be rare. If you have SEV0 incidents monthly, your definitions are too broad or your systems have fundamental reliability problems. Aim for fewer than 4 per year for most organizations.
SEV1 - Major Incident
SEV1 is a significant incident requiring urgent response and active stakeholder communication, but not necessarily an all-hands situation.
Defining characteristics:
- Core functionality unavailable or severely degraded for a meaningful percentage of users
- Service degraded enough to affect revenue or SLO materially
- Significant feature unavailable with no workaround
- Error rates high and rising
Examples:
- API error rate above 10% and increasing
- Core feature unavailable for all users
- One geographic region completely affected
- Checkout flow severely degraded but not completely down
- Database query latency 10x normal for 30+ minutes
Response expectations:
- Page immediately, 24/7
- Acknowledge within 5 minutes
- Incident commander assigned
- Stakeholder communication within 20 minutes
- Status page updated
- Regular updates every 30 minutes until resolved
- Postmortem required
SEV2 - Moderate Incident
SEV2 represents meaningful degradation that affects users but doesn't threaten the core function of the service or represent an immediate SLO risk.
Defining characteristics:
- Non-critical feature unavailable
- Performance degraded but service functional
- Subset of users affected
- Single integration failing with graceful degradation in place
- Issue stable (not spreading) with known workaround
Examples:
- Reporting feature unavailable for all users
- API latency elevated (2-3x normal) but stable
- One integration (e.g., Slack notifications) failing while core product works
- Mobile app affected while web app works normally
- <5% error rate, stable
Response expectations:
- Page during business hours; Slack notification at night
- Acknowledge within 15-20 minutes
- Assigned engineer
- Update stakeholders every hour
- Status page update if customer-facing impact
- Postmortem if learning opportunity exists
SEV3 - Minor Incident
SEV3 is a low-impact issue that doesn't require immediate response and can be handled in normal business hours.
Defining characteristics:
- Minimal or no user impact
- Internal tools or non-critical services affected
- Cosmetic or minor functional issues
- Affects a very small number of users
- Issue stable with existing workaround
Examples:
- Internal dashboard unavailable
- Non-critical scheduled job failed
- UI rendering bug that doesn't affect functionality
- Single user reporting an isolated issue
- Monitoring alert fired but system is healthy
Response expectations:
- No page - appears in queue
- Logged as a ticket
- Handled in normal sprint cycle
- No required customer communication
- Lightweight postmortem optional
How to Write Definitions That Actually Work
The difference between severity frameworks that get used and ones that gather dust is specificity. Vague definitions produce inconsistent classification.
Bad definition: "SEV1: Major impact to users."
Good definition: "SEV1: Core product functionality unavailable or severely degraded for >10% of active users, OR API error rate above 10% for more than 5 consecutive minutes, OR any confirmed data loss, OR payment processing degraded by more than 20%."
The good definition passes a test: two engineers with the same monitoring data should reach the same severity classification independently. If they can't, the definition needs more specificity.
The impact dimensions to define explicitly
For each severity level, specify:
User reach: What percentage of users are affected? (All, >10%, <10%, <1%, internal only)
Functional impact: Is core functionality unavailable, degraded, or just non-critical features affected?
Data integrity: Is data being lost, corrupted, or just delayed?
Revenue exposure: Is the affected flow revenue-generating?
SLO status: What is the error budget burn rate? Fast burn = higher severity.
Trend: Is the situation stable, worsening, or improving?
A severity determination that checks all six dimensions is more consistent than one based on gut feel.
Severity and SLO Burn Rate
The most quantitative input to severity classification is SLO burn rate - how quickly the incident is consuming your error budget.
A framework based on burn rate:
- SEV0: Error budget will be exhausted within 1 hour at current burn rate
- SEV1: Error budget will be exhausted within 8 hours at current burn rate
- SEV2: Error budget will be exhausted within the current calendar month at current burn rate
- SEV3: Error budget remains healthy; incident is within acceptable parameters
Burn rate-based severity is more objective than impact estimates alone. It connects severity directly to your SLO commitments and makes the business consequence of each incident explicit.
When OpsBrief surfaces an incident, it shows SLO status alongside the incident - so the engineer classifying severity has the burn rate context immediately, not after 10 minutes of checking Datadog. Classification becomes faster and more consistent.
SEV vs. P: When to Use Each
Many organizations use both severity (SEV) and priority (P) frameworks. The convention:
- SEV levels describe the technical incident (set by on-call engineer based on objective criteria)
- P levels describe the response urgency (informed by SEV but adjusted for business context)
The workflow: engineer detects incident, classifies it as SEV2 based on technical criteria, then assigns it P1 because it's during peak business hours and the affected feature is under heavy use. Another engineer might classify the same SEV2 as P2 at 3am on a Sunday.
For smaller organizations, a single framework is usually cleaner. Use P1-P4 with definitions that incorporate both technical severity and business context. The dual-framework is valuable at larger organizations where technical severity and business priority genuinely need separate governance.
Severity and Postmortem Requirements
Severity levels should drive postmortem requirements:
| Severity | Postmortem Requirement |
|---|---|
| SEV0 | Always required, within 48 hours |
| SEV1 | Always required, within 72 hours |
| SEV2 | Required when there's a learning opportunity or recurrence |
| SEV3 | Lightweight retrospective optional |
Postmortem requirements tied to severity make the postmortem process predictable. Engineers know when they're expected to write one. Managers know when to expect them. The learning from incidents accumulates systematically rather than selectively.
The barrier to postmortems is usually time - 90 minutes of reconstruction after an already-exhausting incident. OpsBrief's auto-timeline captures every decision and action during the incident, so the postmortem takes 10 minutes rather than 90. That makes the SEV1 postmortem requirement feel reasonable rather than punitive.
Common Severity Definition Mistakes
Too many levels. SEV0 through SEV5 creates classification paralysis. Four levels is usually the maximum. If engineers are uncertain whether something is SEV2 or SEV3, the definition overlap is the problem, not their judgment.
Severity as a measure of urgency rather than impact. "SEV1 because we need to fix it immediately" conflates severity and priority. Severity is objective technical state. Priority is the response decision.
No definition for escalation triggers. What happens when a SEV3 evolves into a SEV2? Severity frameworks need downgrade and upgrade criteria, not just initial classification rules.
Definitions not connected to SLOs. If your severity framework doesn't reference your SLO commitments, it's operating in isolation from the contracts you've made with customers.
Not reviewing classification accuracy. Run a monthly 10-minute review: were there incidents that were misclassified? Did SEV3s escalate to SEV1? Did anything get classified lower than the actual impact warranted? Calibration is an ongoing process, not a one-time definition exercise.
Implementing a Severity Framework
If you're starting from scratch or rebuilding an existing framework:
Step 1: Define impact dimensions. Before assigning levels, agree on what "impact" means: user reach, functionality, data integrity, revenue exposure, SLO burn rate.
Step 2: Draft level definitions. Start with SEV1 and SEV3 as your anchors - the clear extremes. Then define SEV2 as the space between them, and SEV0 as the extreme end above SEV1.
Step 3: Test with past incidents. Take your last 20 incidents and classify them with the new framework. Do the classifications feel right? Are there incidents that fall ambiguously between levels? Refine the definitions until classifications are consistent.
Step 4: Connect to response expectations. Each severity level should have defined response time, escalation, communication, and postmortem requirements. Without these, severity is just a label.
Step 5: Document and train. Put the definitions in a runbook, link them from your incident management tool, and walk new engineers through them during onboarding.
Step 6: Review quarterly. Systems change, team size changes, customer expectations change. Severity definitions should evolve with them.
If your team's severity classification is inconsistent or your on-call engineers are making severity decisions without full context, OpsBrief surfaces SLO burn rate, affected services, and recent deployments when an incident fires - so classification is faster and more accurate from the start.


