Order Data Quality Framework
Build a configurable data quality validation framework for e-commerce order records.
The Scenario
You are a junior data engineer at Shopwire, a multi-channel e-commerce platform that aggregates orders from its own website, Amazon, and Flipkart into a central data warehouse. Every night, a batch job pulls order data from all three channels into a staging area before loading it into the warehouse. Last week, a bad batch slipped through — orders with negative amounts, missing customer IDs, and duplicate order entries made it into the warehouse and corrupted downstream revenue reports. Your lead, Fatima, has assigned you the fix: "We need a proper data quality framework before the staging-to-warehouse load. I want a Python script that validates incoming order data against a defined set of rules, quarantines bad records, produces a validation report, and only lets clean records through. Use the sample dataset. Define your validation rules as configuration — not hardcoded if-statements."
The Task
Build a configurable data quality validation framework in Python that validates incoming e-commerce order records against defined rules, separates clean from quarantined records, and produces a structured validation report.
Requirements
Validation Rules — Implement All Eight
- NOT NULL — order_id, customer_id, order_date, total_amount, channel must never be null or empty.
- Positive Amount — total_amount must be greater than 0.
- Valid Channel — channel must be one of: website, amazon, flipkart (case-insensitive).
- Date Format — order_date must parse as valid YYYY-MM-DD.
- No Future Dates — order_date must not be in the future.
- Duplicate Order ID — order_id must be unique within the batch; only first occurrence retained.
- Valid Status — status must be one of: pending, confirmed, shipped, delivered, cancelled.
- Referential Integrity — customer_id must exist in the customers reference file.
Framework Behaviour
- Rules must be defined as configuration (YAML or Python dict) — the engine reads rules and applies them, not hardcoded conditionals.
- Each record evaluated against all applicable rules; a record can fail multiple rules simultaneously.
- Clean records written to clean output file; quarantined records to separate file with
failed_rulescolumn.
Validation Report
- Produce JSON report: run timestamp, total/clean/quarantined counts, per-rule failure breakdown.
- Flag if quarantine rate exceeds 10% (alert threshold).
Constraints
- Python only — no Airflow, Spark, or dbt. pandas allowed.
- Rules as configuration — extensible without modifying core logic.
- Input formats: CSV and JSON. No database connections.
Mock Data Schema
orders_batch.csv: order_id, customer_id, order_date, total_amount, channel, status
customers_reference.csv: customer_id, name
Generate 200+ records with intentional dirty data covering all 8 rule types.
What to Submit
-
Python validation framework (runs with
python validate.py --input orders_batch.csv --customers customers_reference.csv) - Sample dataset: orders_batch.csv and customers_reference.csv (200+ records)
- Sample outputs: clean_orders.csv, quarantine_orders.csv, validation_report.json
- README: setup, how to run, how to add a new rule
- DECISIONS.md (200–400 words): config-driven design, multi-rule failure handling, one trade-off
Hints
Two non-prescriptive hints. Read only if you're stuck — don't skip the thinking.
Hint 1
Think of each rule as an object: name, function (record → True/False), error message. Store as a list; engine loops and calls each — never needs to know what the rule does.
Hint 2
Do not early-exit on first failure. For each record, run every rule and collect all failures before classifying.