Expert90 min

The Split Brain

Your monitoring shows something terrifying: two nodes of your DynamoDB Global Tables deployment in us-east-1 and eu-west-1 are showing different values for the same records, and both regions are accepting writes. A transient network partition between the regions caused replication to halt for 47 minutes, during which both regions continued serving reads and writes independently — creating a split-brain condition where the same customer's order status shows as 'shipped' in one region and 'cancelled' in another. In this incident response challenge, you must resolve the data conflict, restore consistency, and design safeguards to prevent this scenario from recurring. Your immediate investigation uses DynamoDB Streams to identify all records that were modified in both regions during the partition window. You build a Lambda function that reads the streams from both regions, identifies conflicting writes (records modified in both regions between the partition start and end timestamps), and categorizes them by conflict type: non-overlapping field updates (automatically mergeable), overlapping field updates with clear precedence (e.g., status transitions should follow the defined state machine), and true conflicts requiring human review. For automatic conflict resolution, you implement a last-writer-wins policy using DynamoDB's built-in timestamp-based conflict resolution for non-critical fields, and a custom resolution function for business-critical fields like order status that enforces state machine transitions (an order that was 'shipped' cannot transition to 'cancelled' without an explicit return flow). For records requiring human review, a Step Functions workflow creates a conflict resolution queue in SQS with a dashboard showing the conflicting values from each region. Your investigation reveals that the network partition was caused by a misconfigured VPC peering route that was removed during a Terraform apply — you implement a Config rule that monitors cross-region VPC peering routes and alerts on any changes. Your post-incident architecture improvements include implementing a conflict-free data model using CRDTs for counters and sets, adding a replication lag monitor using custom CloudWatch metrics from DynamoDB Streams processing timestamps, and designing a circuit breaker that switches to single-region write mode when replication lag exceeds a threshold, sacrificing availability in the secondary region to preserve consistency. This challenge teaches distributed systems consistency, conflict resolution strategies, and the CAP theorem trade-offs in real multi-region deployments.

AWS Services You'll Use

DynamoDB Global TablesDynamoDB StreamsLambdaStep FunctionsSQSCloudWatchConfig

Challenge Details

Path
Production Breakout: Incident Response
Difficulty
Expert
Duration
90 min
Plan
Pro

Architecture Patterns You'll Learn

split-brain resolutionconflict detectionlast-writer-winsstate machine validationreplication lag monitoring

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge