Expert90 min

The Cascading Failure

It started with a slow database query. Within three minutes, the entire checkout flow is down, the product catalog is returning 500 errors, and the recommendation engine is consuming 100% CPU doing nothing useful. In this incident response challenge, you are paged at 2 AM with dashboards showing simultaneous failures across five services with no obvious single cause. You must systematically trace the cascade back to its origin: a database connection pool exhaustion caused by a long-running analytics query that a team member ran against the production replica. The analytics query locked rows that the replica needed for read queries, causing read latency to spike from 5ms to 8 seconds. The product catalog service, which reads from this replica, exhausted its connection pool waiting for responses. Its health check started failing, causing the ALB to remove instances from the target group. The checkout service, which depends on the product catalog, started timing out on product lookups and retrying aggressively — each retry spawning a new connection attempt to the struggling catalog service, amplifying the load. The recommendation engine saw the catalog service errors and started making direct database queries as a fallback, further overwhelming the database. Your response requires you to: identify the root cause using CloudWatch metrics, X-Ray traces, and RDS Performance Insights; implement immediate mitigation by killing the runaway query and adjusting connection pool limits; design short-term fixes including query timeout enforcement, circuit breakers between services, and read replica isolation for analytics; and write a post-incident review that identifies the systemic failures (missing circuit breakers, no query governance, shared read replica) and proposes architectural improvements. This challenge teaches systematic incident triage, cascade analysis, and the resilience patterns that prevent database failures from becoming platform-wide outages.

Start Challenge Back to Production Breakout: Incident Response

AWS Services You'll Use

RDSCloudWatchX-RayALBECSLambda

Challenge Details

Path: Production Breakout: Incident Response
Difficulty: Expert
Duration: 90 min
Plan: Pro

Architecture Patterns You'll Learn

cascade analysisconnection pool managementcircuit breakerincident triagepost-incident review

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge

The Cascading Failure

AWS Services You'll Use

Challenge Details

Architecture Patterns You'll Learn

Why This Challenge?

More from Production Breakout: Incident Response

The DDoS Storm

The Certificate Cascade

The Split Brain

Ready to design this for real?