Filled‑in example for Rootly’s Post-mortem template.

1. Summary & Impact

On 1 August 2025 between 14:03 UTC and 14:33 UTC, our User API experienced elevated latency. Average response times increased from ~200 ms to 2–5 seconds, causing slow page loads for approximately 25 % of requests. The trigger was a new deployment that introduced an unoptimised SQL query against the orders table. Rolling back the deployment and adding a missing index restored normal performance within 30 minutes.

2. Key details

Field Value
Incident type & severity SEV‑2 performance degradation
Services/components User API, PostgreSQL database (orders table), ORM layer
Impact window Start: 2025‑08‑01 14:03 UTC; End: 14:33 UTC
Duration & time‑to‑resolve 30 min total; issue detected within 2 min; mitigated at 14:20 UTC; fully resolved by 14:33 UTC
Participants Incident Lead: Alice R. (on‑call SRE); Responders: Bob M. (backend engineer), Charlie T. (team lead); Reporter: automated latency alert
Important links Slack #incident-620-latency-spike; Git commit a7c3dbe; Grafana dashboard; status‑page entry

3. Timeline & Response

Time (UTC) Event / action Notes
14:03 Grafana alert “User API p95 latency > 1 s” triggered Baseline latency 200 ms; p95 spiked to 3 s
14:05 Rootly paged on‑call SRE (Alice) Alice acknowledged within 1 min
14:07 Alice joined #ops‑alerts; confirmed elevated latency across multiple endpoints error rate unchanged
14:09 Alice paged backend engineer (Bob) and team lead (Charlie) suspect recent deployment
14:10 Alice ran pg_stat_activity; saw long‑running query on orders query came from newly released feature
14:12 Bob reviewed recent deployment (a7c3dbe); identified ORM query missing index added new filter on orders.status field
14:15 Decision made to roll back deployment via blue‑green deploy rolled traffic back to previous version
14:18 Latency began dropping; p95 back under 400 ms by 14:20 confirmed with Grafana dashboard
14:20 Incident mitigated; started root‑cause investigation connection pool recovered; no user errors
14:25 Bob added missing index on orders.status in staging and tested query plan index creation time ~1 s in production
14:28 Redeployed feature with index added to production monitored DB CPU and IO – within normal
14:33 Latency fully restored (< 250 ms p95); declared incident resolved status page updated
14:45 Alice posted preliminary internal summary & external status page update begin drafting post‑mortem

4. Contributing Factors & Mitigations

5 . Resolution & Technical Analysis

6. Lessons & Risks

7. Follow‑up Actions

Action Owner Due date
Add index on orders.status in all environments and verify via migration scripts DB team – Priya L. 2025‑08‑05
Implement automated query‑analysis in CI to flag sequential scans (e.g., EXPLAIN with seq_scan_count > 0) DevOps – Bob M. 2025‑08‑12
Expand staging data set to mirror production size and distribution SRE – Alice R. 2025‑08‑19
Review connection‑pool sizing and auto‑scaling policies; adjust defaults Platform team – Charlie T. 2025‑08‑08
Update deployment checklist to include performance regression tests for new queries Engineering management – Sarah K. 2025‑08‑15
Conduct brown‑bag session on indexing best practices for engineers DB team – Priya L. 2025‑08‑22