Filled‑in example for Rootly’s Post-mortem template.
On 1 August 2025 between 14:03 UTC and 14:33 UTC, our User API experienced elevated latency. Average response times increased from ~200 ms to 2–5 seconds, causing slow page loads for approximately 25 % of requests. The trigger was a new deployment that introduced an unoptimised SQL query against the orders
table. Rolling back the deployment and adding a missing index restored normal performance within 30 minutes.
Field | Value |
---|---|
Incident type & severity | SEV‑2 performance degradation |
Services/components | User API, PostgreSQL database (orders table), ORM layer |
Impact window | Start: 2025‑08‑01 14:03 UTC; End: 14:33 UTC |
Duration & time‑to‑resolve | 30 min total; issue detected within 2 min; mitigated at 14:20 UTC; fully resolved by 14:33 UTC |
Participants | Incident Lead: Alice R. (on‑call SRE); Responders: Bob M. (backend engineer), Charlie T. (team lead); Reporter: automated latency alert |
Important links | Slack #incident-620-latency-spike; Git commit a7c3dbe ; Grafana dashboard; status‑page entry |
Time (UTC) | Event / action | Notes |
---|---|---|
14:03 | Grafana alert “User API p95 latency > 1 s” triggered | Baseline latency 200 ms; p95 spiked to 3 s |
14:05 | Rootly paged on‑call SRE (Alice) | Alice acknowledged within 1 min |
14:07 | Alice joined #ops‑alerts; confirmed elevated latency across multiple endpoints | error rate unchanged |
14:09 | Alice paged backend engineer (Bob) and team lead (Charlie) | suspect recent deployment |
14:10 | Alice ran pg_stat_activity ; saw long‑running query on orders |
query came from newly released feature |
14:12 | Bob reviewed recent deployment (a7c3dbe ); identified ORM query missing index |
added new filter on orders.status field |
14:15 | Decision made to roll back deployment via blue‑green deploy | rolled traffic back to previous version |
14:18 | Latency began dropping; p95 back under 400 ms by 14:20 | confirmed with Grafana dashboard |
14:20 | Incident mitigated; started root‑cause investigation | connection pool recovered; no user errors |
14:25 | Bob added missing index on orders.status in staging and tested query plan |
index creation time ~1 s in production |
14:28 | Redeployed feature with index added to production | monitored DB CPU and IO – within normal |
14:33 | Latency fully restored (< 250 ms p95); declared incident resolved | status page updated |
14:45 | Alice posted preliminary internal summary & external status page update | begin drafting post‑mortem |
orders.status
without an index; the query scanned ~2 M rows, causing high I/O wait and saturating the connection pool.GET /users/:id
endpoints limited the spread of latency beyond the affected queries.WHERE status = 'pending'
on an unindexed column. Adding a B‑tree index on orders.status
and redeploying the feature removed the full table scan.pg_stat_statements
showed that the slow query accounted for 98 % of database CPU during the incident. After the index, query time dropped from ~2.5 s to < 10 ms.max_connections
. We have reduced per‑pod connections to 10 and raised max_connections
appropriately.Action | Owner | Due date |
---|---|---|
Add index on orders.status in all environments and verify via migration scripts |
DB team – Priya L. | 2025‑08‑05 |
Implement automated query‑analysis in CI to flag sequential scans (e.g., EXPLAIN with seq_scan_count > 0 ) |
DevOps – Bob M. | 2025‑08‑12 |
Expand staging data set to mirror production size and distribution | SRE – Alice R. | 2025‑08‑19 |
Review connection‑pool sizing and auto‑scaling policies; adjust defaults | Platform team – Charlie T. | 2025‑08‑08 |
Update deployment checklist to include performance regression tests for new queries | Engineering management – Sarah K. | 2025‑08‑15 |
Conduct brown‑bag session on indexing best practices for engineers | DB team – Priya L. | 2025‑08‑22 |