TY - JOUR
T1 - AI-driven synthetic data generation for accelerating hepatology research
T2 - A study of the United Network for Organ Sharing (UNOS) database
AU - Ahn, Joseph C.
AU - Noh, Yung Kyun
AU - Hu, Mingzhao
AU - Shen, Xiaotong
AU - Simonetto, Douglas A.
AU - Kamath, Patrick S.
AU - Loomba, Rohit
AU - Shah, Vijay H.
N1 - Publisher Copyright:
© 2025 American Association for the Study of Liver Diseases. Published by Wolters Kluwer Health, Inc.
PY - 2025
Y1 - 2025
N2 - Background and Aims: Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data - artificially generated patient records designed to mirror real-world distributions - offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing (UNOS) database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection. Methods: Diffusion models were used to generate synthetic patient cohorts mirroring the UNOS liver transplant waitlist database between years 2019 and 2023. Statistical fidelity was assessed using Maximum Mean Discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments. Results: The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance<1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients. Conclusion: AI-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.
AB - Background and Aims: Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data - artificially generated patient records designed to mirror real-world distributions - offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing (UNOS) database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection. Methods: Diffusion models were used to generate synthetic patient cohorts mirroring the UNOS liver transplant waitlist database between years 2019 and 2023. Statistical fidelity was assessed using Maximum Mean Discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments. Results: The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance<1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients. Conclusion: AI-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.
KW - artificial intelligence
KW - diffusion models
KW - liver transplantation
KW - privacy-preserving healthcare data
KW - synthetic data
UR - http://www.scopus.com/inward/record.url?scp=105000370555&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000370555&partnerID=8YFLogxK
U2 - 10.1097/hep.0000000000001299
DO - 10.1097/hep.0000000000001299
M3 - Article
C2 - 40067682
AN - SCOPUS:105000370555
SN - 0270-9139
JO - Hepatology
JF - Hepatology
M1 - 1299
ER -