WP-DE-003 Data Engineering Beginner

Patient Outcomes Pipeline & Reporting

Build a validation layer and SQL analytics for health-tech clinical metrics.

The Scenario

You are a junior data engineer at Vitalink, a health-tech SaaS that aggregates patient visit records from hospitals and outpatient clinics into a central analytics platform. The data arrives daily as CSV exports — and the quality is unreliable. Records arrive with missing diagnoses, impossible dates, and duplicate visit IDs. Downstream SQL models produce inconsistent numbers. Your lead, Meera, has tasked you with fixing both layers: "We have two problems. First: the incoming patient visit data is dirty — we need a Python validation layer that quarantines bad records before they hit the database. Second: our SQL reporting models are built on raw tables and the numbers are all over the place. I want you to build a clean staging and reporting layer on top of the validated data. Validation + clean models + three key clinical metrics. Document every decision."

The Task

Build a two-part data pipeline for Vitalink: a Python data quality validation layer that screens incoming patient visit records, and a SQL analytics layer that models cleaned data into reliable clinical reporting metrics.

Requirements

Part 1 — Python Validation Layer

Implement 6 rules: NOT NULL (visit_id, patient_id, clinic_id, admission_date), Date Logic (discharge ≥ admission), Valid Outcome (recovered/referred/deceased/ongoing), Age Range (0–120), No Duplicate Visits, Length of Stay ≤ 365 days.
Output: clean_visits.csv, quarantine_visits.csv, validation_report.json (with 5% alert threshold).
Rules as configuration — same pattern as WP-DE-001.

Part 2 — SQL Analytics Layer

stg_visits: Stage clean data; derive length_of_stay_days.
dim_clinic: One row per clinic; surrogate key.
fact_visits: One row per visit; FK to dim_clinic; derive is_readmission in SQL (prior visit with outcome recovered/referred, discharge within 30 days of current admission).

Three Clinical Metric Queries

Average Length of Stay by Clinic — last 6 months.
30-Day Readmission Rate by Month — last 12 months.
Outcome Distribution by Clinic Type — hospital vs outpatient, current year.

Constraints

Python for validation, SQL for analytics. Do not mix.
Config-driven validation rules. is_readmission computed in SQL — not in Python.
No Airflow, no dbt. Standalone Python + raw SQL only.

Mock Data Schema

patient_visits_raw.csv: visit_id, patient_id, clinic_id, admission_date, discharge_date, patient_age, outcome, diagnosis_code
clinic_reference.csv: clinic_id, clinic_name, clinic_type, region
Generate 500+ synthetic records with dirty data covering all 6 rules.

What to Submit

validate_visits.py — produces all three output files
SQL models: stg_visits, dim_clinic, fact_visits + 3 metric queries
Sample data + outputs from a real run
README: run order, is_readmission logic explanation
DECISIONS.md (200–400 words): readmission window, pipeline structure, one trade-off

Hints

Two non-prescriptive hints. Read only if you're stuck — don't skip the thinking.

Hint 1

is_readmission: LEFT JOIN fact_visits to itself on patient_id; prior discharge_date between (admission - 30 days) and (admission - 1 day); prior outcome in (recovered, referred).

Hint 2

Length-of-stay rule needs full row — rule function must accept the full record to compute date difference.

Wooble · Project #WP-DE-003 · Patient Outcomes Pipeline & Reporting · Data Engineering Track · v1.0