The input variable selection problem has recently garnered much interest in the time series modeling community, especially within water resources applications, demonstrating that information theoretic (nonlinear)-based input variable selection algorithms such as partial mutual information (PMI) selection (PMIS) provide an improved representation of the modeled process when compared to linear alternatives such as partial correlation input selection (PCIS). PMIS is a popular algorithm for water resources modeling problems considering nonlinear input variable selection; however, this method requires the specification of two nonlinear regression models, each with parametric settings that greatly influence the selected input variables. Other attempts to develop input variable selection methods using conditional mutual information (CMI) (an analog to PMI) have been formulated under different parametric pretenses such as k nearest-neighbor (KNN) statistics or kernel density estimates (KDE). In this paper, we introduce a new input variable selection method based on CMI that uses a nonparametric multivariate continuous probability estimator based on Edgeworth approximations (EA). We improve the EA method by considering the uncertainty in the input variable selection procedure by introducing a bootstrap resampling procedure that uses rank statistics to order the selected input sets; we name our proposed method bootstrap rank-ordered CMI (broCMI). We demonstrate the superior performance of broCMI when compared to CMI-based alternatives (EA, KDE, and KNN), PMIS, and PCIS input variable selection algorithms on a set of seven synthetic test problems and a real-world urban water demand (UWD) forecasting experiment in Ottawa, Canada.
Bibliographical noteFunding Information:
The authors would like to extend their deepest gratitude to the three anonymous reviewers, the associate editor, and the editor for their invaluable feedback that has helped to greatly improve the presentation of this paper. The authors would also like to thank the City of Ottawa for providing data utilized in this study. Requests for data used in this study should be made through opendata@ ottawa.ca. The authors acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC) for funding part of the research presented in this paper under an NSERC Engage grant, as well as an NSERC Discovery Grant held by J. Adamowski. Special thanks are extended to S. Galelli at Singapore University of Technology and Design along with R. Diduch, C. Rogers, J. Bougadis, and G. Reilly at the City of Ottawa for their time and helpful discussions.
- conditional mutual information
- input variable selection
- regression models