Expert75 min

The Memory Leak

For the past week, your application's P99 latency has been gradually increasing — from 200ms to 800ms — and your auto-scaling group keeps adding instances, but the new instances degrade at the same rate. Your monitoring shows a classic sawtooth pattern: memory usage on each instance climbs steadily from 40% to 95% over 6 hours, at which point the OOM killer terminates the process, the health check fails, and ASG replaces the instance. The new instance works perfectly for 6 hours, then the cycle repeats. In this incident response challenge, you must find and fix the memory leak while it is happening in production, without the luxury of reproducing it locally. Your investigation starts with CloudWatch Container Insights metrics showing the memory growth pattern and correlating it with application metrics. You deploy a diagnostic sidecar container that runs a lightweight memory profiler, connecting to the application's debug port to take heap snapshots at 30-minute intervals without impacting production traffic. The heap diff between snapshots reveals the leak: a caching library is storing HTTP response objects that include reference to the Express.js request object, which holds a reference to the socket, creating a chain of unreachable but not garbage-collected objects that grows with every request. The leak only manifests in production because the development environment uses HTTP/1.1 which closes connections after each request, while production uses HTTP/2 with persistent connections — the long-lived socket references prevent the request objects from being collected. Your immediate mitigation deploys a health check enhancement that monitors RSS memory growth rate and marks instances as unhealthy when memory exceeds 80%, reducing the degradation window from 6 hours to 4 hours. The actual fix involves patching the caching library to store only the serialized response body instead of the full response object, deployed via a canary deployment to verify the fix before full rollout. Your post-incident actions include adding application-level memory metrics to CloudWatch (heap used, heap total, external memory, array buffers), creating an auto-scaling policy based on memory growth rate rather than absolute usage, and implementing a pre-production load test that runs for 24 hours to catch memory leaks before they reach production. This challenge teaches memory leak diagnosis in production, profiling techniques for containerized applications, and the monitoring patterns that detect gradual degradation before it becomes an outage.

AWS Services You'll Use

ECSCloudWatchContainer InsightsAuto ScalingCodeDeployX-Ray

Challenge Details

Path
Production Breakout: Incident Response
Difficulty
Expert
Duration
75 min
Plan
Pro

Architecture Patterns You'll Learn

heap profilingcanary deploymenthealth check enhancementmemory growth monitoringdiagnostic sidecar

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge