ML Feature Store
Training-serving skew โ where the features used during model training differ from those available at inference time โ is the most insidious source of ML model degradation in production. Feature stores solve this by providing a single source of truth for feature computation, storage, and serving. In this challenge, you will design a feature store architecture using Amazon SageMaker Feature Store that supports both batch feature ingestion for training and low-latency online feature retrieval for real-time inference. The offline store uses S3 with Parquet format and Glue Data Catalog integration, enabling Athena queries for training dataset construction with point-in-time correctness โ crucial for preventing data leakage in time-series features. The online store uses SageMaker Feature Store's built-in low-latency storage backed by DynamoDB, providing single-digit-millisecond feature retrieval during inference. Feature ingestion runs on two paths: batch pipelines using Glue jobs that compute features from raw data in S3 and ingest them on a schedule, and streaming pipelines using Kinesis Data Streams with Lambda consumers that compute real-time features (like rolling averages, session counts) and ingest them immediately. You will design the feature group schema strategy โ organizing features by entity (user features, product features, interaction features) with a consistent naming convention and version tracking. Feature transformations use SageMaker Processing jobs for batch and Lambda for streaming, with shared transformation code packaged as a Lambda layer to ensure consistency. The architecture includes a feature freshness monitoring system using CloudWatch metrics that track ingestion lag per feature group and alert when features become stale. Data quality validation runs on every batch ingestion using Great Expectations-style checks implemented in Lambda, blocking ingestion of features that fail schema or statistical distribution checks. Access control uses IAM policies scoped to feature groups, so ML teams can only read features relevant to their models. This challenge teaches feature store architecture, training-serving consistency, and the data engineering patterns that make ML systems reliable.
AWS Services You'll Use
Challenge Details
- Path
- AI/ML Infrastructure
- Difficulty
- Advanced
- Duration
- 65 min
- Plan
- Pro
Architecture Patterns You'll Learn
Why This Challenge?
Unlike whiteboard exercises or multiple-choice quizzes, this challenge requires you to design a real architecture with actual AWS services, evaluate trade-offs, and defend your decisions. Our automated validators check your design against production-grade criteria. Complete it and it shows up in your verified portfolio with your architecture diagram and design rationale.
More from AI/ML Infrastructure
RAG Pipeline Architecture
Design a Retrieval-Augmented Generation pipeline that grounds LLM responses in enterprise knowledge bases.
Advanced ยท 70 minMulti-Agent Orchestration
Design a multi-agent system where specialized AI agents collaborate to solve complex tasks.
Advanced ยท 75 minML Model Serving Platform
Design a model serving platform that delivers low-latency predictions with A/B testing and canary deployment.
Advanced ยท 70 minReady to design this for real?
Get the full scenario, design your architecture using real AWS services, and validate against production-grade criteria. Your completed challenge shows up in your verified portfolio.
Start Challenge