Advanced75 min

Cascading Failure Prevention

The most devastating outages are not caused by a single service failing — they are caused by that failure cascading through dependent services until the entire system is down. Cascading failures happen when services are too tightly coupled, lack proper timeout and retry configurations, or cannot shed load gracefully under pressure. In this challenge, you will design a resilience architecture that prevents cascading failures across a microservices platform with complex service dependencies. The architecture starts with a dependency mapping exercise: you will model the service call graph and identify critical paths, single points of failure, and amplification risks (where one service failure causes N dependent services to overload the failed service with retries). The primary defense layer uses the circuit breaker pattern implemented via AWS App Mesh (Envoy proxy outlier detection): when error rates from a downstream service exceed a threshold, the circuit opens and requests immediately return a cached or degraded response instead of adding load to the struggling service. Timeout budgets are distributed across the call chain — if a request has a 3-second total budget, the first service call gets 1.5 seconds, nested calls get proportionally less, preventing timeout multiplication. Bulkhead isolation uses ECS task-level resource limits and separate target groups per downstream dependency, ensuring that a slow dependency cannot consume all connection pool resources and starve other dependencies. Load shedding implements priority-based request classification at the ALB: during overload, low-priority requests (analytics, prefetch) are dropped with 503 responses while critical requests (authentication, checkout) continue processing. Retry policies use exponential backoff with jitter and a retry budget (max 20% additional traffic from retries) to prevent retry storms. Health checks distinguish between liveness (is the process running) and readiness (can it handle requests) — a service can be alive but not ready while its dependency recovers. The architecture includes a chaos engineering framework using AWS Fault Injection Simulator that regularly tests these resilience patterns by injecting latency, errors, and instance failures. This challenge teaches distributed systems resilience, failure mode analysis, and the patterns that keep complex systems stable under partial failure.

AWS Services You'll Use

App MeshECSALBCloudWatchFault Injection SimulatorLambdaSNS

Challenge Details

Path
Security & Resilience
Difficulty
Advanced
Duration
75 min
Plan
Pro

Architecture Patterns You'll Learn

circuit breakerbulkheadload sheddingtimeout budgetretry with jitterchaos engineering

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge