The Memory Leak
For the past week, your application's P99 latency has been gradually increasing — from 200ms to 800ms — and your auto-scaling group keeps adding instances, but the new instances degrade at the same rate. Your monitoring shows a classic sawtooth pattern: memory usage on each instance climbs steadily from 40% to 95% over 6 hours, at which point the OOM killer terminates the process, the health check fails, and ASG replaces the instance. The new instance works perfectly for 6 hours, then the cycle repeats. In this incident response challenge, you must find and fix the memory leak while it is happening in production, without the luxury of reproducing it locally. Your investigation starts with CloudWatch Container Insights metrics showing the memory growth pattern and correlating it with application metrics. You deploy a diagnostic sidecar container that runs a lightweight memory profiler, connecting to the application's debug port to take heap snapshots at 30-minute intervals without impacting production traffic. The heap diff between snapshots reveals the leak: a caching library is storing HTTP response objects that include reference to the Express.js request object, which holds a reference to the socket, creating a chain of unreachable but not garbage-collected objects that grows with every request. The leak only manifests in production because the development environment uses HTTP/1.1 which closes connections after each request, while production uses HTTP/2 with persistent connections — the long-lived socket references prevent the request objects from being collected. Your immediate mitigation deploys a health check enhancement that monitors RSS memory growth rate and marks instances as unhealthy when memory exceeds 80%, reducing the degradation window from 6 hours to 4 hours. The actual fix involves patching the caching library to store only the serialized response body instead of the full response object, deployed via a canary deployment to verify the fix before full rollout. Your post-incident actions include adding application-level memory metrics to CloudWatch (heap used, heap total, external memory, array buffers), creating an auto-scaling policy based on memory growth rate rather than absolute usage, and implementing a pre-production load test that runs for 24 hours to catch memory leaks before they reach production. This challenge teaches memory leak diagnosis in production, profiling techniques for containerized applications, and the monitoring patterns that detect gradual degradation before it becomes an outage.
AWS Services You'll Use
Challenge Details
- Path
- Production Breakout: Incident Response
- Difficulty
- Expert
- Duration
- 75 min
- Plan
- Pro
Architecture Patterns You'll Learn
Why This Challenge?
Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.
More from Production Breakout: Incident Response
The Cascading Failure
A single database timeout has taken down five services. Trace the cascade and restore service.
Expert · 90 minThe DDoS Storm
A volumetric DDoS attack is overwhelming your infrastructure. Mitigate it without taking the site offline.
Expert · 80 minThe Certificate Cascade
An expired TLS certificate has broken service-to-service communication across your microservices platform.
Expert · 85 minReady to design this for real?
Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.
Start Challenge