Abstract
We address the regression problem with a new form of data that arises from data privacy applications. Instead of point values, the observed explanatory variables are subsets containing each individual’s original value. In such cases, we cannot apply classical regression analyses, such as the least squares, because the set-valued predictors carry only partial information about the original values. We propose a computationally efficient subset least squares method for performing a regression on such data. We establish upper bounds of the prediction loss and risk in terms of the subset structure, model structure, and data dimension. The error rates are shown to be optimal in some common situations. Furthermore, we develop a model-selection method to identify the most appropriate model for prediction. Experiment results on both simulated and real-world data sets demonstrate the promising performance of the proposed method.
Original language | English (US) |
---|---|
Pages (from-to) | 2545-2560 |
Number of pages | 16 |
Journal | Statistica Sinica |
Volume | 33 |
Issue number | 4 |
DOIs | |
State | Published - Oct 2023 |
Bibliographical note
Publisher Copyright:© 2023 All rights reserved.
Keywords
- Model selection
- Regression
- Set-valued data