Two-stage imputation method to handle missing data for categorical response variable

Jong Min Kim, Kee Jae Lee, Seung Joo Lee

Research output: Contribution to journalArticlepeer-review

Abstract

Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

Original languageEnglish (US)
Pages (from-to)577-587
Number of pages11
JournalCommunications for Statistical Applications and Methods
Volume30
Issue number6
DOIs
StatePublished - 2023

Bibliographical note

Publisher Copyright:
© 2023 The Korean Statistical Society, and Korean International Statistical Society. All rights reserved.

Keywords

  • categorical variable
  • imputation
  • missing data
  • variable selection

Fingerprint

Dive into the research topics of 'Two-stage imputation method to handle missing data for categorical response variable'. Together they form a unique fingerprint.

Cite this