[Filled-in Example] API Latency Incident – 1 Aug 2025

Filled‑in example for Rootly’s Post-mortem template.

1. Summary & Impact

On 1 August 2025 between 14:03 UTC and 14:33 UTC, our User API experienced elevated latency. Average response times increased from ~200 ms to 2–5 seconds, causing slow page loads for approximately 25 % of requests. The trigger was a new deployment that introduced an unoptimised SQL query against the orders table. Rolling back the deployment and adding a missing index restored normal performance within 30 minutes.

2. Key details

Field	Value
Incident type & severity	SEV‑2 performance degradation
Services/components	User API, PostgreSQL database (orders table), ORM layer
Impact window	Start: 2025‑08‑01 14:03 UTC; End: 14:33 UTC
Duration & time‑to‑resolve	30 min total; issue detected within 2 min; mitigated at 14:20 UTC; fully resolved by 14:33 UTC
Participants	Incident Lead: Alice R. (on‑call SRE); Responders: Bob M. (backend engineer), Charlie T. (team lead); Reporter: automated latency alert
Important links	Slack #incident-620-latency-spike; Git commit `a7c3dbe`; Grafana dashboard; status‑page entry

3. Timeline & Response

Time (UTC)	Event / action	Notes
14:03	Grafana alert “User API p95 latency > 1 s” triggered	Baseline latency 200 ms; p95 spiked to 3 s
14:05	Rootly paged on‑call SRE (Alice)	Alice acknowledged within 1 min
14:07	Alice joined #ops‑alerts; confirmed elevated latency across multiple endpoints	error rate unchanged
14:09	Alice paged backend engineer (Bob) and team lead (Charlie)	suspect recent deployment
14:10	Alice ran `pg_stat_activity`; saw long‑running query on `orders`	query came from newly released feature
14:12	Bob reviewed recent deployment (`a7c3dbe`); identified ORM query missing index	added new filter on `orders.status` field
14:15	Decision made to roll back deployment via blue‑green deploy	rolled traffic back to previous version
14:18	Latency began dropping; p95 back under 400 ms by 14:20	confirmed with Grafana dashboard
14:20	Incident mitigated; started root‑cause investigation	connection pool recovered; no user errors
14:25	Bob added missing index on `orders.status` in staging and tested query plan	index creation time ~1 s in production
14:28	Redeployed feature with index added to production	monitored DB CPU and IO – within normal
14:33	Latency fully restored (< 250 ms p95); declared incident resolved	status page updated
14:45	Alice posted preliminary internal summary & external status page update	begin drafting post‑mortem

4. Contributing Factors & Mitigations

Contributing factors:
- The new feature added a filter on orders.status without an index; the query scanned ~2 M rows, causing high I/O wait and saturating the connection pool.
- Load testing did not include the specific filter combination, so the performance regression was not detected before deployment.
- The staging environment’s data volume was significantly smaller than production, masking the latency issue.
Mitigations/what prevented worse impact:
- The platform’s auto‑scaling increased API container replicas from 6 to 12, preventing timeouts.
- Database connection limits throttled concurrency, avoiding a full database outage.
- Read caching for common GET /users/:id endpoints limited the spread of latency beyond the affected queries.

5 . Resolution & Technical Analysis

A rollback to the previous container image immediately restored acceptable latency. Investigation confirmed that an ORM‑generated SQL query included WHERE status = 'pending' on an unindexed column. Adding a B‑tree index on orders.status and redeploying the feature removed the full table scan.
Examination of pg_stat_statements showed that the slow query accounted for 98 % of database CPU during the incident. After the index, query time dropped from ~2.5 s to < 10 ms.
The connection‑pool saturation also surfaced a misconfiguration: the pool was sized for 20 connections per pod; with auto‑scaling, total connections exceeded the database’s max_connections. We have reduced per‑pod connections to 10 and raised max_connections appropriately.

6. Lessons & Risks

What went well: Monitoring detected the issue within 2 minutes, and the on‑call engineer responded quickly. Rolling back via blue‑green deploy allowed fast mitigation without downtime.
What didn’t: Performance regression testing missed the unindexed query; the staging environment wasn’t representative of production data volumes. Connection pool sizing didn’t account for auto‑scaled pods.
Risks identified: Other endpoints may join tables on unindexed columns; we need tooling to automatically detect full‑table scans in CI. The discrepancy between staging and production data sets increases risk of similar regressions.

7. Follow‑up Actions

Action	Owner	Due date
Add index on `orders.status` in all environments and verify via migration scripts	DB team – Priya L.	2025‑08‑05
Implement automated query‑analysis in CI to flag sequential scans (e.g., `EXPLAIN` with `seq_scan_count > 0`)	DevOps – Bob M.	2025‑08‑12
Expand staging data set to mirror production size and distribution	SRE – Alice R.	2025‑08‑19
Review connection‑pool sizing and auto‑scaling policies; adjust defaults	Platform team – Charlie T.	2025‑08‑08
Update deployment checklist to include performance regression tests for new queries	Engineering management – Sarah K.	2025‑08‑15
Conduct brown‑bag session on indexing best practices for engineers	DB team – Priya L.	2025‑08‑22