Incident Day Checklist
Purpose
When to use this
Inputs
Checklist
Triage
Stabilize
Evidence capture
Mitigation
Comms
Postmortem stub
Escalation / stop-the-line criteria
Related links

Incident Day Checklist

Purpose

Provide a one-page response checklist to triage, stabilize, and document production incidents with audit-ready evidence.

When to use this

Customer-reported production failures
Data mismatches or dashboard regressions
Auth/access failures with user impact
Any event requiring on-call coordination

Inputs

Reported symptoms and expected vs actual behavior
Scope (users, roles, endpoints, metrics)
Tenant/business context
Time window of impact

Checklist

Triage

Confirm incident severity and owner
Reproduce symptom with exact tenant, period, and role
Identify impacted endpoint/query/view path

Stabilize

Apply immediate containment (rollback/feature guard/reroute) as needed
Protect critical customer workflows first
Confirm stabilization signal before deep investigation

Evidence capture

Capture logs, request IDs, timestamps, and environment details
Capture failing queries/endpoints and comparison baselines
Link relevant runbook/reference artifacts in incident thread

Mitigation

Implement smallest safe corrective action
Re-run targeted smoke checks on impacted flows
Confirm no additional regression introduced

Comms

Publish status updates at agreed cadence
Notify stakeholders of impact, mitigation, and ETA
Record decision points and approvers

Postmortem stub

Create incident summary with root-cause hypothesis
List permanent fix candidates and owners
Track follow-up tasks and due dates

Escalation / stop-the-line criteria

Cross-tenant or security-sensitive impact
Auth/RBAC breach or potential data exposure
Inability to stabilize within agreed incident window
Repeated failure after mitigation attempts

RELEASE DAY CHECKLIST BACKEND DEPLOYMENT METHOD