Filled‑in example for Rootly’s Post-mortem template.
On 1 August 2025 between 14:03 UTC and 14:33 UTC, our User API experienced elevated latency. Average response times increased from ~200 ms to 2–5 seconds, causing slow page loads for approximately 25 % of requests. The trigger was a new deployment that introduced an unoptimised SQL query against the orders table. Rolling back the deployment and adding a missing index restored normal performance within 30 minutes.
| Field | Value |
|---|---|
| Incident type & severity | SEV‑2 performance degradation |
| Services/components | User API, PostgreSQL database (orders table), ORM layer |
| Impact window | Start: 2025‑08‑01 14:03 UTC; End: 14:33 UTC |
| Duration & time‑to‑resolve | 30 min total; issue detected within 2 min; mitigated at 14:20 UTC; fully resolved by 14:33 UTC |
| Participants | Incident Lead: Alice R. (on‑call SRE); Responders: Bob M. (backend engineer), Charlie T. (team lead); Reporter: automated latency alert |
| Important links | Slack #incident-620-latency-spike; Git commit a7c3dbe; Grafana dashboard; status‑page entry |
| Time (UTC) | Event / action | Notes |
|---|---|---|
| 14:03 | Grafana alert “User API p95 latency > 1 s” triggered | Baseline latency 200 ms; p95 spiked to 3 s |
| 14:05 | Rootly paged on‑call SRE (Alice) | Alice acknowledged within 1 min |
| 14:07 | Alice joined #ops‑alerts; confirmed elevated latency across multiple endpoints | error rate unchanged |
| 14:09 | Alice paged backend engineer (Bob) and team lead (Charlie) | suspect recent deployment |
| 14:10 | Alice ran pg_stat_activity; saw long‑running query on orders |
query came from newly released feature |
| 14:12 | Bob reviewed recent deployment (a7c3dbe); identified ORM query missing index |
added new filter on orders.status field |
| 14:15 | Decision made to roll back deployment via blue‑green deploy | rolled traffic back to previous version |
| 14:18 | Latency began dropping; p95 back under 400 ms by 14:20 | confirmed with Grafana dashboard |
| 14:20 | Incident mitigated; started root‑cause investigation | connection pool recovered; no user errors |
| 14:25 | Bob added missing index on orders.status in staging and tested query plan |
index creation time ~1 s in production |
| 14:28 | Redeployed feature with index added to production | monitored DB CPU and IO – within normal |
| 14:33 | Latency fully restored (< 250 ms p95); declared incident resolved | status page updated |
| 14:45 | Alice posted preliminary internal summary & external status page update | begin drafting post‑mortem |
orders.status without an index; the query scanned ~2 M rows, causing high I/O wait and saturating the connection pool.GET /users/:id endpoints limited the spread of latency beyond the affected queries.WHERE status = 'pending' on an unindexed column. Adding a B‑tree index on orders.status and redeploying the feature removed the full table scan.pg_stat_statements showed that the slow query accounted for 98 % of database CPU during the incident. After the index, query time dropped from ~2.5 s to < 10 ms.max_connections. We have reduced per‑pod connections to 10 and raised max_connections appropriately.| Action | Owner | Due date |
|---|---|---|
Add index on orders.status in all environments and verify via migration scripts |
DB team – Priya L. | 2025‑08‑05 |
Implement automated query‑analysis in CI to flag sequential scans (e.g., EXPLAIN with seq_scan_count > 0) |
DevOps – Bob M. | 2025‑08‑12 |
| Expand staging data set to mirror production size and distribution | SRE – Alice R. | 2025‑08‑19 |
| Review connection‑pool sizing and auto‑scaling policies; adjust defaults | Platform team – Charlie T. | 2025‑08‑08 |
| Update deployment checklist to include performance regression tests for new queries | Engineering management – Sarah K. | 2025‑08‑15 |
| Conduct brown‑bag session on indexing best practices for engineers | DB team – Priya L. | 2025‑08‑22 |