Incident Day Checklist
Purpose
Provide a one-page response checklist to triage, stabilize, and document production incidents with audit-ready evidence.When to use this
- Customer-reported production failures
- Data mismatches or dashboard regressions
- Auth/access failures with user impact
- Any event requiring on-call coordination
Inputs
- Reported symptoms and expected vs actual behavior
- Scope (users, roles, endpoints, metrics)
- Tenant/business context
- Time window of impact
Checklist
Triage
- Confirm incident severity and owner
- Reproduce symptom with exact tenant, period, and role
- Identify impacted endpoint/query/view path
Stabilize
- Apply immediate containment (rollback/feature guard/reroute) as needed
- Protect critical customer workflows first
- Confirm stabilization signal before deep investigation
Evidence capture
- Capture logs, request IDs, timestamps, and environment details
- Capture failing queries/endpoints and comparison baselines
- Link relevant runbook/reference artifacts in incident thread
Mitigation
- Implement smallest safe corrective action
- Re-run targeted smoke checks on impacted flows
- Confirm no additional regression introduced
Comms
- Publish status updates at agreed cadence
- Notify stakeholders of impact, mitigation, and ETA
- Record decision points and approvers
Postmortem stub
- Create incident summary with root-cause hypothesis
- List permanent fix candidates and owners
- Track follow-up tasks and due dates
Escalation / stop-the-line criteria
- Cross-tenant or security-sensitive impact
- Auth/RBAC breach or potential data exposure
- Inability to stabilize within agreed incident window
- Repeated failure after mitigation attempts