The Split Brain
Your monitoring shows something terrifying: two nodes of your DynamoDB Global Tables deployment in us-east-1 and eu-west-1 are showing different values for the same records, and both regions are accepting writes. A transient network partition between the regions caused replication to halt for 47 minutes, during which both regions continued serving reads and writes independently — creating a split-brain condition where the same customer's order status shows as 'shipped' in one region and 'cancelled' in another. In this incident response challenge, you must resolve the data conflict, restore consistency, and design safeguards to prevent this scenario from recurring. Your immediate investigation uses DynamoDB Streams to identify all records that were modified in both regions during the partition window. You build a Lambda function that reads the streams from both regions, identifies conflicting writes (records modified in both regions between the partition start and end timestamps), and categorizes them by conflict type: non-overlapping field updates (automatically mergeable), overlapping field updates with clear precedence (e.g., status transitions should follow the defined state machine), and true conflicts requiring human review. For automatic conflict resolution, you implement a last-writer-wins policy using DynamoDB's built-in timestamp-based conflict resolution for non-critical fields, and a custom resolution function for business-critical fields like order status that enforces state machine transitions (an order that was 'shipped' cannot transition to 'cancelled' without an explicit return flow). For records requiring human review, a Step Functions workflow creates a conflict resolution queue in SQS with a dashboard showing the conflicting values from each region. Your investigation reveals that the network partition was caused by a misconfigured VPC peering route that was removed during a Terraform apply — you implement a Config rule that monitors cross-region VPC peering routes and alerts on any changes. Your post-incident architecture improvements include implementing a conflict-free data model using CRDTs for counters and sets, adding a replication lag monitor using custom CloudWatch metrics from DynamoDB Streams processing timestamps, and designing a circuit breaker that switches to single-region write mode when replication lag exceeds a threshold, sacrificing availability in the secondary region to preserve consistency. This challenge teaches distributed systems consistency, conflict resolution strategies, and the CAP theorem trade-offs in real multi-region deployments.
AWS Services You'll Use
Challenge Details
- Path
- Production Breakout: Incident Response
- Difficulty
- Expert
- Duration
- 90 min
- Plan
- Pro
Architecture Patterns You'll Learn
Why This Challenge?
Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.
More from Production Breakout: Incident Response
The Cascading Failure
A single database timeout has taken down five services. Trace the cascade and restore service.
Expert · 90 minThe DDoS Storm
A volumetric DDoS attack is overwhelming your infrastructure. Mitigate it without taking the site offline.
Expert · 80 minThe Certificate Cascade
An expired TLS certificate has broken service-to-service communication across your microservices platform.
Expert · 85 minReady to design this for real?
Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.
Start Challenge