WP-DE-001 Data Engineering Beginner

Order Data Quality Framework

Build a configurable data quality validation framework for e-commerce order records.

The Scenario

You are a junior data engineer at Shopwire, a multi-channel e-commerce platform that aggregates orders from its own website, Amazon, and Flipkart into a central data warehouse. Every night, a batch job pulls order data from all three channels into a staging area before loading it into the warehouse. Last week, a bad batch slipped through — orders with negative amounts, missing customer IDs, and duplicate order entries made it into the warehouse and corrupted downstream revenue reports. Your lead, Fatima, has assigned you the fix: "We need a proper data quality framework before the staging-to-warehouse load. I want a Python script that validates incoming order data against a defined set of rules, quarantines bad records, produces a validation report, and only lets clean records through. Use the sample dataset. Define your validation rules as configuration — not hardcoded if-statements."

The Task

Build a configurable data quality validation framework in Python that validates incoming e-commerce order records against defined rules, separates clean from quarantined records, and produces a structured validation report.

Requirements

Validation Rules — Implement All Eight

NOT NULL — order_id, customer_id, order_date, total_amount, channel must never be null or empty.
Positive Amount — total_amount must be greater than 0.
Valid Channel — channel must be one of: website, amazon, flipkart (case-insensitive).
Date Format — order_date must parse as valid YYYY-MM-DD.
No Future Dates — order_date must not be in the future.
Duplicate Order ID — order_id must be unique within the batch; only first occurrence retained.
Valid Status — status must be one of: pending, confirmed, shipped, delivered, cancelled.
Referential Integrity — customer_id must exist in the customers reference file.

Framework Behaviour

Rules must be defined as configuration (YAML or Python dict) — the engine reads rules and applies them, not hardcoded conditionals.
Each record evaluated against all applicable rules; a record can fail multiple rules simultaneously.
Clean records written to clean output file; quarantined records to separate file with failed_rules column.

Validation Report

Produce JSON report: run timestamp, total/clean/quarantined counts, per-rule failure breakdown.
Flag if quarantine rate exceeds 10% (alert threshold).

Constraints

Python only — no Airflow, Spark, or dbt. pandas allowed.
Rules as configuration — extensible without modifying core logic.
Input formats: CSV and JSON. No database connections.

Mock Data Schema

orders_batch.csv: order_id, customer_id, order_date, total_amount, channel, status
customers_reference.csv: customer_id, name
Generate 200+ records with intentional dirty data covering all 8 rule types.

What to Submit

Python validation framework (runs with python validate.py --input orders_batch.csv --customers customers_reference.csv)
Sample dataset: orders_batch.csv and customers_reference.csv (200+ records)
Sample outputs: clean_orders.csv, quarantine_orders.csv, validation_report.json
README: setup, how to run, how to add a new rule
DECISIONS.md (200–400 words): config-driven design, multi-rule failure handling, one trade-off

Hints

Two non-prescriptive hints. Read only if you're stuck — don't skip the thinking.

Hint 1

Think of each rule as an object: name, function (record → True/False), error message. Store as a list; engine loops and calls each — never needs to know what the rule does.

Hint 2

Do not early-exit on first failure. For each record, run every rule and collect all failures before classifying.

Wooble · Project #WP-DE-001 · Order Data Quality Framework · Data Engineering Track · v1.0