TY - JOUR
T1 - Handling missing values in healthcare data
T2 - A systematic review of deep learning-based imputation techniques
AU - Liu, Mingxuan
AU - Li, Siqi
AU - Yuan, Han
AU - Ong, Marcus Eng Hock
AU - Ning, Yilin
AU - Xie, Feng
AU - Saffari, Seyed Ehsan
AU - Shang, Yuqing
AU - Volovici, Victor
AU - Chakraborty, Bibhas
AU - Liu, Nan
N1 - Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2023/8
Y1 - 2023/8
N2 - Objective: The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data. Materials and methods: We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. Results: Out of 1822 articles, a total of 111 were included, of which tabular static data (29 %, 32/111) and temporal data (40 %, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The “integrated” imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52 %, 23/44) and multi-modal data (56 %, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies. Conclusion: The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.
AB - Objective: The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data. Materials and methods: We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. Results: Out of 1822 articles, a total of 111 were included, of which tabular static data (29 %, 32/111) and temporal data (40 %, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The “integrated” imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52 %, 23/44) and multi-modal data (56 %, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies. Conclusion: The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.
KW - Deep learning
KW - Healthcare
KW - Imputation
KW - Missing value
KW - Neural networks
UR - http://www.scopus.com/inward/record.url?scp=85160673489&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85160673489&partnerID=8YFLogxK
U2 - 10.1016/j.artmed.2023.102587
DO - 10.1016/j.artmed.2023.102587
M3 - Review article
C2 - 37316097
AN - SCOPUS:85160673489
SN - 0933-3657
VL - 142
JO - Artificial Intelligence in Medicine
JF - Artificial Intelligence in Medicine
M1 - 102587
ER -