Abstract
Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.
Original language | English (US) |
---|---|
Pages (from-to) | 577-587 |
Number of pages | 11 |
Journal | Communications for Statistical Applications and Methods |
Volume | 30 |
Issue number | 6 |
DOIs | |
State | Published - 2023 |
Bibliographical note
Publisher Copyright:© 2023 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved.
Keywords
- categorical variable
- imputation
- missing data
- variable selection