Incident Severity Levels: How to Define SEV0, SEV1, SEV2, and SEV3

Two engineers look at the same production alert and disagree on whether it's a SEV1 or a SEV2. One wants to wake up the VP of Engineering. The other wants to handle it quietly and write it up in the morning.

Both are wrong - not because of their technical judgment, but because their organization hasn't defined what SEV1 means clearly enough for two people to reach the same answer from the same information.

Severity level definitions are foundational to incident management. Get them wrong and every downstream process - escalation, communication, postmortems, on-call burden - inherits the inconsistency.

Severity vs. Priority: The Distinction That Matters

Before building a severity framework, clarify what severity is and isn't.

Severity describes the technical impact of an incident - how bad is the damage to the system and its users? It's an objective assessment of current state.

Priority describes the urgency of response - how quickly should we act? It factors in severity but also considers business context: time of day, available resources, customer impact, SLO status.

A SEV2 incident on Black Friday at peak traffic might be prioritized P1 and get an all-hands response. The same SEV2 at 3am on a Sunday affecting a small subset of users might be prioritized P2 and wait for morning.

Keeping severity and priority distinct gives you cleaner frameworks for both. Severity is technical and objective. Priority is contextual and involves judgment.

This guide focuses on severity. Most organizations use four levels (SEV0 through SEV3, or P1 through P4). Some add a fifth. The number matters less than the precision of the definitions.

The Four Severity Levels

SEV0 - Critical / Catastrophic

SEV0 is reserved for the worst incidents - ones that require immediate all-hands response regardless of time, day, or competing priorities.

Defining characteristics:

Complete service outage affecting all or most users
Active data loss or corruption in production
Security breach with customer data exposure
System behavior that could cause irreversible harm to data integrity or customer trust

Examples:

Payment processing completely down
Database is corrupting writes
Production credentials exposed in a public repository
Full platform outage - users cannot log in

Response expectations:

Immediate page, 24/7
All engineering leadership notified within 10 minutes
Dedicated incident commander assigned
Customer communication within 15 minutes
Status page updated immediately
Executive team notified

SEV0 should be rare. If you have SEV0 incidents monthly, your definitions are too broad or your systems have fundamental reliability problems. Aim for fewer than 4 per year for most organizations.

SEV1 - Major Incident

SEV1 is a significant incident requiring urgent response and active stakeholder communication, but not necessarily an all-hands situation.

Defining characteristics:

Core functionality unavailable or severely degraded for a meaningful percentage of users
Service degraded enough to affect revenue or SLO materially
Significant feature unavailable with no workaround
Error rates high and rising

Examples:

API error rate above 10% and increasing
Core feature unavailable for all users
One geographic region completely affected
Checkout flow severely degraded but not completely down
Database query latency 10x normal for 30+ minutes

Response expectations:

Page immediately, 24/7
Acknowledge within 5 minutes
Incident commander assigned
Stakeholder communication within 20 minutes
Status page updated
Regular updates every 30 minutes until resolved
Postmortem required

SEV2 - Moderate Incident

SEV2 represents meaningful degradation that affects users but doesn't threaten the core function of the service or represent an immediate SLO risk.

Defining characteristics:

Non-critical feature unavailable
Performance degraded but service functional
Subset of users affected
Single integration failing with graceful degradation in place
Issue stable (not spreading) with known workaround

Examples:

Reporting feature unavailable for all users
API latency elevated (2-3x normal) but stable
One integration (e.g., Slack notifications) failing while core product works
Mobile app affected while web app works normally
<5% error rate, stable

Response expectations:

Page during business hours; Slack notification at night
Acknowledge within 15-20 minutes
Assigned engineer
Update stakeholders every hour
Status page update if customer-facing impact
Postmortem if learning opportunity exists

SEV3 - Minor Incident

SEV3 is a low-impact issue that doesn't require immediate response and can be handled in normal business hours.

Defining characteristics:

Minimal or no user impact
Internal tools or non-critical services affected
Cosmetic or minor functional issues
Affects a very small number of users
Issue stable with existing workaround

Examples:

Internal dashboard unavailable
Non-critical scheduled job failed
UI rendering bug that doesn't affect functionality
Single user reporting an isolated issue
Monitoring alert fired but system is healthy

Response expectations:

No page - appears in queue
Logged as a ticket
Handled in normal sprint cycle
No required customer communication
Lightweight postmortem optional

How to Write Definitions That Actually Work

The difference between severity frameworks that get used and ones that gather dust is specificity. Vague definitions produce inconsistent classification.

Bad definition: "SEV1: Major impact to users."

Good definition: "SEV1: Core product functionality unavailable or severely degraded for >10% of active users, OR API error rate above 10% for more than 5 consecutive minutes, OR any confirmed data loss, OR payment processing degraded by more than 20%."

The good definition passes a test: two engineers with the same monitoring data should reach the same severity classification independently. If they can't, the definition needs more specificity.

The impact dimensions to define explicitly

For each severity level, specify:

User reach: What percentage of users are affected? (All, >10%, <10%, <1%, internal only)

Functional impact: Is core functionality unavailable, degraded, or just non-critical features affected?

Data integrity: Is data being lost, corrupted, or just delayed?

Revenue exposure: Is the affected flow revenue-generating?

SLO status: What is the error budget burn rate? Fast burn = higher severity.

Trend: Is the situation stable, worsening, or improving?

A severity determination that checks all six dimensions is more consistent than one based on gut feel.

Severity and SLO Burn Rate

The most quantitative input to severity classification is SLO burn rate - how quickly the incident is consuming your error budget.

A framework based on burn rate:

SEV0: Error budget will be exhausted within 1 hour at current burn rate
SEV1: Error budget will be exhausted within 8 hours at current burn rate
SEV2: Error budget will be exhausted within the current calendar month at current burn rate
SEV3: Error budget remains healthy; incident is within acceptable parameters

Burn rate-based severity is more objective than impact estimates alone. It connects severity directly to your SLO commitments and makes the business consequence of each incident explicit.

When OpsBrief surfaces an incident, it shows SLO status alongside the incident - so the engineer classifying severity has the burn rate context immediately, not after 10 minutes of checking Datadog. Classification becomes faster and more consistent.

SEV vs. P: When to Use Each

Many organizations use both severity (SEV) and priority (P) frameworks. The convention:

SEV levels describe the technical incident (set by on-call engineer based on objective criteria)
P levels describe the response urgency (informed by SEV but adjusted for business context)

The workflow: engineer detects incident, classifies it as SEV2 based on technical criteria, then assigns it P1 because it's during peak business hours and the affected feature is under heavy use. Another engineer might classify the same SEV2 as P2 at 3am on a Sunday.

For smaller organizations, a single framework is usually cleaner. Use P1-P4 with definitions that incorporate both technical severity and business context. The dual-framework is valuable at larger organizations where technical severity and business priority genuinely need separate governance.

Severity and Postmortem Requirements

Severity levels should drive postmortem requirements:

Severity	Postmortem Requirement
SEV0	Always required, within 48 hours
SEV1	Always required, within 72 hours
SEV2	Required when there's a learning opportunity or recurrence
SEV3	Lightweight retrospective optional

Postmortem requirements tied to severity make the postmortem process predictable. Engineers know when they're expected to write one. Managers know when to expect them. The learning from incidents accumulates systematically rather than selectively.

The barrier to postmortems is usually time - 90 minutes of reconstruction after an already-exhausting incident. OpsBrief's auto-timeline captures every decision and action during the incident, so the postmortem takes 10 minutes rather than 90. That makes the SEV1 postmortem requirement feel reasonable rather than punitive.

Common Severity Definition Mistakes

Too many levels. SEV0 through SEV5 creates classification paralysis. Four levels is usually the maximum. If engineers are uncertain whether something is SEV2 or SEV3, the definition overlap is the problem, not their judgment.

Severity as a measure of urgency rather than impact. "SEV1 because we need to fix it immediately" conflates severity and priority. Severity is objective technical state. Priority is the response decision.

No definition for escalation triggers. What happens when a SEV3 evolves into a SEV2? Severity frameworks need downgrade and upgrade criteria, not just initial classification rules.

Definitions not connected to SLOs. If your severity framework doesn't reference your SLO commitments, it's operating in isolation from the contracts you've made with customers.

Not reviewing classification accuracy. Run a monthly 10-minute review: were there incidents that were misclassified? Did SEV3s escalate to SEV1? Did anything get classified lower than the actual impact warranted? Calibration is an ongoing process, not a one-time definition exercise.

Implementing a Severity Framework

If you're starting from scratch or rebuilding an existing framework:

Step 1: Define impact dimensions. Before assigning levels, agree on what "impact" means: user reach, functionality, data integrity, revenue exposure, SLO burn rate.

Step 2: Draft level definitions. Start with SEV1 and SEV3 as your anchors - the clear extremes. Then define SEV2 as the space between them, and SEV0 as the extreme end above SEV1.

Step 3: Test with past incidents. Take your last 20 incidents and classify them with the new framework. Do the classifications feel right? Are there incidents that fall ambiguously between levels? Refine the definitions until classifications are consistent.

Step 4: Connect to response expectations. Each severity level should have defined response time, escalation, communication, and postmortem requirements. Without these, severity is just a label.

Step 5: Document and train. Put the definitions in a runbook, link them from your incident management tool, and walk new engineers through them during onboarding.

Step 6: Review quarterly. Systems change, team size changes, customer expectations change. Severity definitions should evolve with them.

If your team's severity classification is inconsistent or your on-call engineers are making severity decisions without full context, OpsBrief surfaces SLO burn rate, affected services, and recent deployments when an incident fires - so classification is faster and more accurate from the start.

Incident Severity Levels: How to Define SEV0, SEV1, SEV2, and SEV3

Incident Severity Levels: How to Define SEV0, SEV1, SEV2, and SEV3

Severity vs. Priority: The Distinction That Matters

The Four Severity Levels

SEV0 - Critical / Catastrophic

SEV1 - Major Incident

SEV2 - Moderate Incident

SEV3 - Minor Incident

How to Write Definitions That Actually Work

The impact dimensions to define explicitly

Severity and SLO Burn Rate

SEV vs. P: When to Use Each

Severity and Postmortem Requirements

Common Severity Definition Mistakes

Implementing a Severity Framework

Related Articles

Why Teams Forget Critical Information Within 24 Hours of an Incident

The Rise of Cross-Functional Operations Intelligence

Incident Response Bottlenecks: Where Your MTTR Is Actually Lost

Try OpsBrief Free