TY - JOUR
T1 - Data mining for censored time-to-event data
T2 - a Bayesian network model for predicting cardiovascular risk from electronic health record data
AU - Bandyopadhyay, Sunayan
AU - Wolfson, Julian
AU - Vock, David M.
AU - Vazquez-Benitez, Gabriela
AU - Adomavicius, Gediminas
AU - Elidrisi, Mohamed
AU - Johnson, Paul E.
AU - O’Connor, Patrick J.
N1 - Funding Information:
This work was partially supported by NHLBI Grant R01HL102144-01 and AHRQ Grant R21HS017622-01.
Publisher Copyright:
© 2014, The Author(s).
PY - 2015/7/8
Y1 - 2015/7/8
N2 - Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.
AB - Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.
KW - Bayesian networks
KW - Electronic health data
KW - Inverse probability of censoring weights
KW - Medical decision support
KW - Mining censored data
KW - Risk prediction
KW - Survival analysis
UR - http://www.scopus.com/inward/record.url?scp=84930476972&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84930476972&partnerID=8YFLogxK
U2 - 10.1007/s10618-014-0386-6
DO - 10.1007/s10618-014-0386-6
M3 - Article
AN - SCOPUS:84930476972
SN - 1384-5810
VL - 29
SP - 1033
EP - 1069
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 4
ER -