AI4Health Synthetic Data — Professional Edition (1 Million Rows)

Build, test, and benchmark healthcare AI on day one.

One million rows of realistic, privacy-safe structured (tabular) health data, plus 4 starter benchmarks so your team can hit the ground running.

Who this is for

Data scientists, ML engineers, educators, and product teams who need credible, EHR-style data to prototype models, compare methods, and teach workflows - without months of approvals or high access fees.

The problem we solve

Real clinical datasets are powerful, but access can be slow, expensive, and complex. You need to try ideas, evaluate pipelines, and show value now. The Professional Edition gives you a large, ready-to-use dataset that behaves like routine health data, so you can move from “idea” to “evidence” fast.

What you get

1) The dataset (main product)
  • Size: 1,000,000 records
  • Format: CSV (compressed)
  • Schema (70+ variables): demographics, lifestyle, environment (PM2.5, green space), vitals/labs (SBP, eGFR, HbA1c, FEV1%), conditions (COPD, T2D, HF, CVD, CKD, depression, anxiety, cancer history), severities, medications (incl. statins), outcomes (COPD exacerbations, CVD event flag, HF hospitalisations, etc.) 
2) Bonus: 4 benchmark studies (with starter code + guides) 
  • COPD Detection: screening vs diagnostic settings; see the lift from adding spirometry
  • COPD Exacerbations: count models and gain curves for targeted intervention
  • Cardiovascular Events: calibrated logistic regression and risk stratification
  • Type 2 Diabetes Complications: work with class imbalance and sampling strategies
3) Clear documentation
  • Data dictionary, variable groups, value ranges, and quick-start instructions

Why it works

  • Realistic structure and behaviour: Generated from first principles using public evidence (prevalences, risk factors, effect sizes), not copied from real patients
  • Privacy-safe by design: Created from random seeds; no real patient data were used
  • Tool-agnostic: Works with Python, R, SQL, Spark, or your favourite BI tool

How you’ll use it (typical wins in week 1)

  • Stand up an end-to-end ML pipeline (ingest → features → train → evaluate → explain)
  • Compare algorithms on the same cohort with reproducible results
  • Produce credible demo plots (ROC/PR, calibration, gain curves, SHAP) for reviews and slide-decks
  • Teach workshops or labs without governance hurdles

Compatibility

  • Python (pandas, scikit-learn, lifelines), R (tidyverse), SQL/DBT, Spark/Databricks, Snowflake/BigQuery imports
  • Runs locally or in your cloud. No sandbox required

Who succeeds with the Professional Edition

  • Startups/scale-ups: Validate modeling ideas and pipeline reliability before securing real data.
  • Enterprises: Internal POCs, hiring assessments, vendor bake-offs.
  • Universities/bootcamps: Hands-on teaching with realistic, safe data.

FAQs

Is this de-identified real data?
No. It’s fully synthetic and generated from random seeds guided by public evidence; no real patient data are used.


Can I publish results using this dataset?
Yes, you can publish methodology and results using the synthetic data (cite AI4Health Synthetic Data). Avoid implying that results reflect any specific real-world population.

What problems can I model?
Binary classification (e.g., COPD, CVD event), count regression (exacerbations), and survival analysis patterns. You can extend to tree-based models, GLMs, XGBoost, calibration, SHAP, clustering, etc.

Do I get the exact same data as others?
Yes, the Professional Edition is a standard release to ensure benchmarks are comparable.

Compare Editions

Credibility

Developed by Dr Syed Ahmar Shah and the AI4Health team - drawing on years of hands-on experience analysing large, routinely collected UK health datasets (CPRD, OPCRD, EAVE II, SAIL Databank, OpenSAFELY) and publishing extensively in peer-reviewed medical journals
Created with