TY - JOUR
T1 - Comparative assessment and novel strategy on methods for imputing proteomics data
AU - Shen, Minjie
AU - Chang, Yi Tan
AU - Wu, Chiung Ting
AU - Parker, Sarah J.
AU - Saylor, Georgia
AU - Wang, Yizhi
AU - Yu, Guoqiang
AU - Van Eyk, Jennifer E.
AU - Clarke, Robert
AU - Herrington, David M.
AU - Wang, Yue
N1 - Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy—convex analysis of mixtures—for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.
AB - Missing values are a major issue in quantitative proteomics analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, a comparative assessment of imputation accuracy remains inconclusive, mainly because mechanisms contributing to true missing values are complex and existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of future methodological development. We first re-evaluate the performance of eight representative methods targeting three typical missing mechanisms. These methods are compared on both simulated and masked missing values embedded within real proteomics datasets, and performance is evaluated using three quantitative measures. We then introduce fused regularization matrix factorization, a low-rank global matrix factorization framework, capable of integrating local similarity derived from additional data types. We also explore a biologically-inspired latent variable modeling strategy—convex analysis of mixtures—for missing value imputation and present preliminary experimental results. While some winners emerged from our comparative assessment, the evaluation is intrinsically imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Nevertheless, we show that our fused regularization matrix factorization provides a novel incorporation of external and local information, and the exploratory implementation of convex analysis of mixtures presents a biologically plausible new approach.
UR - http://www.scopus.com/inward/record.url?scp=85123239183&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123239183&partnerID=8YFLogxK
U2 - 10.1038/s41598-022-04938-0
DO - 10.1038/s41598-022-04938-0
M3 - Article
C2 - 35058491
AN - SCOPUS:85123239183
SN - 2045-2322
VL - 12
JO - Scientific reports
JF - Scientific reports
IS - 1
M1 - 1067
ER -