Expert80 min

The Deployment Rollback

A deployment went out at 5 PM on Friday (of course) that included both a code change and a database migration. The code change updates the user service to read from a new 'email_verified' column, and the migration added this column and backfilled it from the legacy 'verification_status' field. Within an hour, customer support reports that 15% of users are being asked to re-verify their emails. Investigation reveals the bug: the backfill migration had a WHERE clause that excluded users created before 2024, meaning 150,000 long-time users have 'email_verified = false' instead of 'true'. The new code trusts this column and forces re-verification. In this incident response challenge, the straightforward fix — rolling back the deployment — is not an option because the forward migration added the new column and the rollback migration would drop it, destroying the correct data for the 85% of users who were backfilled properly. You must surgically fix the data without losing valid state. Your response follows a structured approach. First, you quantify the impact using Athena queries over the user database replica to identify exactly which users are affected (created before 2024 with email_verified = false but verification_status = 'verified'). Second, you implement an immediate code-level mitigation: deploy a hotfix that checks both the new column AND the legacy field, treating users as verified if either source confirms it. This stops the bleeding within 15 minutes. Third, you design and execute a corrective data migration using a Lambda function that processes affected users in batches, setting email_verified = true for users whose legacy verification_status confirms they were verified. The migration runs with DynamoDB conditional writes to ensure it does not overwrite any user who legitimately re-verified during the incident window. Fourth, you implement a verification dashboard that compares old and new column values across the entire user base, confirming 100% consistency before removing the legacy fallback code. Your post-incident review identifies the systemic failures: the backfill migration was not tested against a production-volume dataset (the bug only manifested for pre-2024 users because the date format changed), the deployment coupled a code change with a data migration (they should be deployed separately with a compatibility window), and there was no rollback plan that accounted for the data dependency. This challenge teaches safe database migration strategies, data-aware rollback procedures, and the deployment patterns that prevent data corruption incidents.

AWS Services You'll Use

RDSLambdaDynamoDBAthenaCodeDeployCloudWatchS3

Challenge Details

Path
Production Breakout: Incident Response
Difficulty
Expert
Duration
80 min
Plan
Pro

Architecture Patterns You'll Learn

forward-only migrationdual-read compatibilitybatch correctionconditional writesdeployment decoupling

Why This Challenge?

Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.

Ready to design this for real?

Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.

Start Challenge