The Cascading Failure
It started with a slow database query. Within three minutes, the entire checkout flow is down, the product catalog is returning 500 errors, and the recommendation engine is consuming 100% CPU doing nothing useful. In this incident response challenge, you are paged at 2 AM with dashboards showing simultaneous failures across five services with no obvious single cause. You must systematically trace the cascade back to its origin: a database connection pool exhaustion caused by a long-running analytics query that a team member ran against the production replica. The analytics query locked rows that the replica needed for read queries, causing read latency to spike from 5ms to 8 seconds. The product catalog service, which reads from this replica, exhausted its connection pool waiting for responses. Its health check started failing, causing the ALB to remove instances from the target group. The checkout service, which depends on the product catalog, started timing out on product lookups and retrying aggressively — each retry spawning a new connection attempt to the struggling catalog service, amplifying the load. The recommendation engine saw the catalog service errors and started making direct database queries as a fallback, further overwhelming the database. Your response requires you to: identify the root cause using CloudWatch metrics, X-Ray traces, and RDS Performance Insights; implement immediate mitigation by killing the runaway query and adjusting connection pool limits; design short-term fixes including query timeout enforcement, circuit breakers between services, and read replica isolation for analytics; and write a post-incident review that identifies the systemic failures (missing circuit breakers, no query governance, shared read replica) and proposes architectural improvements. This challenge teaches systematic incident triage, cascade analysis, and the resilience patterns that prevent database failures from becoming platform-wide outages.
AWS Services You'll Use
Challenge Details
- Path
- Production Breakout: Incident Response
- Difficulty
- Expert
- Duration
- 90 min
- Plan
- Pro
Architecture Patterns You'll Learn
Why This Challenge?
Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.
More from Production Breakout: Incident Response
The DDoS Storm
A volumetric DDoS attack is overwhelming your infrastructure. Mitigate it without taking the site offline.
Expert · 80 minThe Certificate Cascade
An expired TLS certificate has broken service-to-service communication across your microservices platform.
Expert · 85 minThe Split Brain
A network partition has caused your distributed database to accept conflicting writes. Resolve the split brain.
Expert · 90 minReady to design this for real?
Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.
Start Challenge