The Certificate Cascade
At 6:15 AM on a Monday, your on-call engineer gets paged: the payment service is returning TLS handshake errors when calling the fraud detection service. By 6:20 AM, cascading alerts are firing across twelve services. The root cause: an internal CA certificate used for mutual TLS between microservices expired at midnight, and every service-to-service connection that relies on mTLS is failing. But this is not a simple fix — the certificate was issued by an intermediate CA in AWS Certificate Manager Private CA, and replacing it requires understanding the certificate chain, reissuing leaf certificates for every affected service, and restarting service mesh proxies without causing an outage during the fix. In this incident response challenge, you must navigate a complex certificate hierarchy under time pressure. The architecture uses a root CA (offline, in an HSM) that signed an intermediate CA in ACM Private CA, which issues short-lived certificates (90 days) for service mesh mTLS via App Mesh. The intermediate CA certificate was set to auto-renew, but the renewal Lambda function failed silently three months ago due to an IAM permission change during a security audit. Your immediate mitigation requires issuing a new intermediate CA certificate, but you discover that the root CA's signing profile has been updated and the old intermediate CA's subject name does not match the new template. You must create a new intermediate CA, update the trust stores in all service mesh proxies, and roll the change across the fleet without a full outage. The response includes: using ACM Private CA to issue a new intermediate CA, updating the App Mesh virtual node trust bundles to trust both the old and new intermediate CAs during the transition, performing a rolling restart of Envoy proxies with the new trust bundle, and then revoking the old intermediate CA after all services are verified. Your post-incident actions include implementing certificate expiration monitoring via CloudWatch alarms on ACM Private CA metrics, adding a monthly certificate rotation test to your chaos engineering suite, and designing an automated certificate rotation pipeline using EventBridge scheduled rules and Step Functions. This challenge teaches PKI architecture, certificate chain management, and the incident response procedures for cryptographic infrastructure failures.
AWS Services You'll Use
Challenge Details
- Path
- Production Breakout: Incident Response
- Difficulty
- Expert
- Duration
- 85 min
- Plan
- Pro
Architecture Patterns You'll Learn
Why This Challenge?
Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.
More from Production Breakout: Incident Response
The Cascading Failure
A single database timeout has taken down five services. Trace the cascade and restore service.
Expert · 90 minThe DDoS Storm
A volumetric DDoS attack is overwhelming your infrastructure. Mitigate it without taking the site offline.
Expert · 80 minThe Split Brain
A network partition has caused your distributed database to accept conflicting writes. Resolve the split brain.
Expert · 90 minReady to design this for real?
Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.
Start Challenge