Expert85 min

The Certificate Cascade

At 6:15 AM on a Monday, your on-call engineer gets paged: the payment service is returning TLS handshake errors when calling the fraud detection service. By 6:20 AM, cascading alerts are firing across twelve services. The root cause: an internal CA certificate used for mutual TLS between microservices expired at midnight, and every service-to-service connection that relies on mTLS is failing. But this is not a simple fix — the certificate was issued by an intermediate CA in AWS Certificate Manager Private CA, and replacing it requires understanding the certificate chain, reissuing leaf certificates for every affected service, and restarting service mesh proxies without causing an outage during the fix. In this incident response challenge, you must navigate a complex certificate hierarchy under time pressure. The architecture uses a root CA (offline, in an HSM) that signed an intermediate CA in ACM Private CA, which issues short-lived certificates (90 days) for service mesh mTLS via App Mesh. The intermediate CA certificate was set to auto-renew, but the renewal Lambda function failed silently three months ago due to an IAM permission change during a security audit. Your immediate mitigation requires issuing a new intermediate CA certificate, but you discover that the root CA's signing profile has been updated and the old intermediate CA's subject name does not match the new template. You must create a new intermediate CA, update the trust stores in all service mesh proxies, and roll the change across the fleet without a full outage. The response includes: using ACM Private CA to issue a new intermediate CA, updating the App Mesh virtual node trust bundles to trust both the old and new intermediate CAs during the transition, performing a rolling restart of Envoy proxies with the new trust bundle, and then revoking the old intermediate CA after all services are verified. Your post-incident actions include implementing certificate expiration monitoring via CloudWatch alarms on ACM Private CA metrics, adding a monthly certificate rotation test to your chaos engineering suite, and designing an automated certificate rotation pipeline using EventBridge scheduled rules and Step Functions. This challenge teaches PKI architecture, certificate chain management, and the incident response procedures for cryptographic infrastructure failures.

Start Challenge Back to Production Breakout: Incident Response

AWS Services You'll Use

ACM Private CAApp MeshECSLambdaEventBridgeCloudWatchStep Functions

Challenge Details

Path: Production Breakout: Incident Response
Difficulty: Expert
Duration: 85 min
Plan: Pro

Architecture Patterns You'll Learn

certificate chainmTLS rotationtrust bundle updaterolling restartsilent failure detection

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge

The Certificate Cascade

AWS Services You'll Use

Challenge Details

Architecture Patterns You'll Learn

Why This Challenge?

More from Production Breakout: Incident Response

The Cascading Failure

The DDoS Storm

The Split Brain

Ready to design this for real?