WP-DE-001 Data Engineering Beginner

Order Data Quality Framework

Build a configurable data quality validation framework for e-commerce order records.

1

The Scenario

You are a junior data engineer at Shopwire, a multi-channel e-commerce platform that aggregates orders from its own website, Amazon, and Flipkart into a central data warehouse. Every night, a batch job pulls order data from all three channels into a staging area before loading it into the warehouse. Last week, a bad batch slipped through — orders with negative amounts, missing customer IDs, and duplicate order entries made it into the warehouse and corrupted downstream revenue reports. Your lead, Fatima, has assigned you the fix: "We need a proper data quality framework before the staging-to-warehouse load. I want a Python script that validates incoming order data against a defined set of rules, quarantines bad records, produces a validation report, and only lets clean records through. Use the sample dataset. Define your validation rules as configuration — not hardcoded if-statements."

2

The Task

Build a configurable data quality validation framework in Python that validates incoming e-commerce order records against defined rules, separates clean from quarantined records, and produces a structured validation report.

3

Requirements

3A

Validation Rules — Implement All Eight

  • NOT NULL — order_id, customer_id, order_date, total_amount, channel must never be null or empty.
  • Positive Amount — total_amount must be greater than 0.
  • Valid Channel — channel must be one of: website, amazon, flipkart (case-insensitive).
  • Date Format — order_date must parse as valid YYYY-MM-DD.
  • No Future Dates — order_date must not be in the future.
  • Duplicate Order ID — order_id must be unique within the batch; only first occurrence retained.
  • Valid Status — status must be one of: pending, confirmed, shipped, delivered, cancelled.
  • Referential Integrity — customer_id must exist in the customers reference file.
3B

Framework Behaviour

  • Rules must be defined as configuration (YAML or Python dict) — the engine reads rules and applies them, not hardcoded conditionals.
  • Each record evaluated against all applicable rules; a record can fail multiple rules simultaneously.
  • Clean records written to clean output file; quarantined records to separate file with failed_rules column.
3C

Validation Report

  • Produce JSON report: run timestamp, total/clean/quarantined counts, per-rule failure breakdown.
  • Flag if quarantine rate exceeds 10% (alert threshold).
4

Constraints

  • Python only — no Airflow, Spark, or dbt. pandas allowed.
  • Rules as configuration — extensible without modifying core logic.
  • Input formats: CSV and JSON. No database connections.
5

Mock Data Schema

orders_batch.csv: order_id, customer_id, order_date, total_amount, channel, status
customers_reference.csv: customer_id, name
Generate 200+ records with intentional dirty data covering all 8 rule types.

6

What to Submit

  • Python validation framework (runs with python validate.py --input orders_batch.csv --customers customers_reference.csv)
  • Sample dataset: orders_batch.csv and customers_reference.csv (200+ records)
  • Sample outputs: clean_orders.csv, quarantine_orders.csv, validation_report.json
  • README: setup, how to run, how to add a new rule
  • DECISIONS.md (200–400 words): config-driven design, multi-rule failure handling, one trade-off
7

Hints

Two non-prescriptive hints. Read only if you're stuck — don't skip the thinking.

Hint 1

Think of each rule as an object: name, function (record → True/False), error message. Store as a list; engine loops and calls each — never needs to know what the rule does.

Hint 2

Do not early-exit on first failure. For each record, run every rule and collect all failures before classifying.

Wooble  ·  Project #WP-DE-001  ·  Order Data Quality Framework  ·  Data Engineering Track  ·  v1.0