Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data

Sunayan Bandyopadhyay, Julian Wolfson, David M Vock, Gabriela Vazquez-Benitez, Gediminas Adomavicius, Mohamed Elidrisi, Paul E Johnson, Patrick J. O’Connor

Research output: Contribution to journalArticle

39 Citations (Scopus)

Abstract

Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.

Original languageEnglish (US)
Pages (from-to)1033-1069
Number of pages37
JournalData Mining and Knowledge Discovery
Volume29
Issue number4
DOIs
StatePublished - Jul 8 2015

Fingerprint

Bayesian networks
Data mining
Health
Health care
Learning systems
Hazards

Keywords

  • Bayesian networks
  • Electronic health data
  • Inverse probability of censoring weights
  • Medical decision support
  • Mining censored data
  • Risk prediction
  • Survival analysis

Cite this

Data mining for censored time-to-event data : a Bayesian network model for predicting cardiovascular risk from electronic health record data. / Bandyopadhyay, Sunayan; Wolfson, Julian; Vock, David M; Vazquez-Benitez, Gabriela; Adomavicius, Gediminas; Elidrisi, Mohamed; Johnson, Paul E; O’Connor, Patrick J.

In: Data Mining and Knowledge Discovery, Vol. 29, No. 4, 08.07.2015, p. 1033-1069.

Research output: Contribution to journalArticle

Bandyopadhyay, Sunayan ; Wolfson, Julian ; Vock, David M ; Vazquez-Benitez, Gabriela ; Adomavicius, Gediminas ; Elidrisi, Mohamed ; Johnson, Paul E ; O’Connor, Patrick J. / Data mining for censored time-to-event data : a Bayesian network model for predicting cardiovascular risk from electronic health record data. In: Data Mining and Knowledge Discovery. 2015 ; Vol. 29, No. 4. pp. 1033-1069.
@article{a74399116767442aaae8abb14a71698c,
title = "Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data",
abstract = "Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.",
keywords = "Bayesian networks, Electronic health data, Inverse probability of censoring weights, Medical decision support, Mining censored data, Risk prediction, Survival analysis",
author = "Sunayan Bandyopadhyay and Julian Wolfson and Vock, {David M} and Gabriela Vazquez-Benitez and Gediminas Adomavicius and Mohamed Elidrisi and Johnson, {Paul E} and O’Connor, {Patrick J.}",
year = "2015",
month = "7",
day = "8",
doi = "10.1007/s10618-014-0386-6",
language = "English (US)",
volume = "29",
pages = "1033--1069",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Data mining for censored time-to-event data

T2 - a Bayesian network model for predicting cardiovascular risk from electronic health record data

AU - Bandyopadhyay, Sunayan

AU - Wolfson, Julian

AU - Vock, David M

AU - Vazquez-Benitez, Gabriela

AU - Adomavicius, Gediminas

AU - Elidrisi, Mohamed

AU - Johnson, Paul E

AU - O’Connor, Patrick J.

PY - 2015/7/8

Y1 - 2015/7/8

N2 - Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.

AB - Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected epidemiological cohorts. However, the homogeneity and limited size of such cohorts restrict the predictive power and generalizability of these risk models to other populations. Electronic health data (EHD) from large health care systems provide access to data on large, heterogeneous, and contemporaneous patient populations. The unique features and challenges of EHD, including missing risk factor information, non-linear relationships between risk factors and CV event outcomes, and differing effects from different patient subgroups, demand novel machine learning approaches to risk model development. In this paper, we present a machine learning approach based on Bayesian networks trained on EHD to predict the probability of having a CV event within 5 years. In such data, event status may be unknown for some individuals, as the event time is right-censored due to disenrollment and incomplete follow-up. Since many traditional data mining methods are not well-suited for such data, we describe how to modify both modeling and assessment techniques to account for censored observation times. We show that our approach can lead to better predictive performance than the Cox proportional hazards model (i.e., a regression-based approach commonly used for censored, time-to-event data) or a Bayesian network with ad hoc approaches to right-censoring. Our techniques are motivated by and illustrated on data from a large US Midwestern health care system.

KW - Bayesian networks

KW - Electronic health data

KW - Inverse probability of censoring weights

KW - Medical decision support

KW - Mining censored data

KW - Risk prediction

KW - Survival analysis

UR - http://www.scopus.com/inward/record.url?scp=84930476972&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84930476972&partnerID=8YFLogxK

U2 - 10.1007/s10618-014-0386-6

DO - 10.1007/s10618-014-0386-6

M3 - Article

AN - SCOPUS:84930476972

VL - 29

SP - 1033

EP - 1069

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 4

ER -