Patient Outcomes Pipeline & Reporting
Build a validation layer and SQL analytics for health-tech clinical metrics.
The Scenario
You are a junior data engineer at Vitalink, a health-tech SaaS that aggregates patient visit records from hospitals and outpatient clinics into a central analytics platform. The data arrives daily as CSV exports — and the quality is unreliable. Records arrive with missing diagnoses, impossible dates, and duplicate visit IDs. Downstream SQL models produce inconsistent numbers. Your lead, Meera, has tasked you with fixing both layers: "We have two problems. First: the incoming patient visit data is dirty — we need a Python validation layer that quarantines bad records before they hit the database. Second: our SQL reporting models are built on raw tables and the numbers are all over the place. I want you to build a clean staging and reporting layer on top of the validated data. Validation + clean models + three key clinical metrics. Document every decision."
The Task
Build a two-part data pipeline for Vitalink: a Python data quality validation layer that screens incoming patient visit records, and a SQL analytics layer that models cleaned data into reliable clinical reporting metrics.
Requirements
Part 1 — Python Validation Layer
- Implement 6 rules: NOT NULL (visit_id, patient_id, clinic_id, admission_date), Date Logic (discharge ≥ admission), Valid Outcome (recovered/referred/deceased/ongoing), Age Range (0–120), No Duplicate Visits, Length of Stay ≤ 365 days.
- Output: clean_visits.csv, quarantine_visits.csv, validation_report.json (with 5% alert threshold).
- Rules as configuration — same pattern as WP-DE-001.
Part 2 — SQL Analytics Layer
- stg_visits: Stage clean data; derive length_of_stay_days.
- dim_clinic: One row per clinic; surrogate key.
- fact_visits: One row per visit; FK to dim_clinic; derive is_readmission in SQL (prior visit with outcome recovered/referred, discharge within 30 days of current admission).
Three Clinical Metric Queries
- Average Length of Stay by Clinic — last 6 months.
- 30-Day Readmission Rate by Month — last 12 months.
- Outcome Distribution by Clinic Type — hospital vs outpatient, current year.
Constraints
- Python for validation, SQL for analytics. Do not mix.
- Config-driven validation rules. is_readmission computed in SQL — not in Python.
- No Airflow, no dbt. Standalone Python + raw SQL only.
Mock Data Schema
patient_visits_raw.csv: visit_id, patient_id, clinic_id, admission_date, discharge_date, patient_age, outcome, diagnosis_code
clinic_reference.csv: clinic_id, clinic_name, clinic_type, region
Generate 500+ synthetic records with dirty data covering all 6 rules.
What to Submit
- validate_visits.py — produces all three output files
- SQL models: stg_visits, dim_clinic, fact_visits + 3 metric queries
- Sample data + outputs from a real run
- README: run order, is_readmission logic explanation
- DECISIONS.md (200–400 words): readmission window, pipeline structure, one trade-off
Hints
Two non-prescriptive hints. Read only if you're stuck — don't skip the thinking.
Hint 1
is_readmission: LEFT JOIN fact_visits to itself on patient_id; prior discharge_date between (admission - 30 days) and (admission - 1 day); prior outcome in (recovered, referred).
Hint 2
Length-of-stay rule needs full row — rule function must accept the full record to compute date difference.