Statistical Methods in Proteomics

Weichuan Yu, Baolin Wu, Tao Huang, Xiaoye Li, Kenneth Williams, Hongyu Zhao

Research output: Chapter in Book/Report/Conference proceedingChapter

14 Scopus citations


Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.
Original languageEnglish (US)
Title of host publicationStatistical Methods in Proteomics
EditorsHoang Pham
Place of PublicationLondon
Number of pages16
ISBN (Print)978-1-84628-288-1
StatePublished - 2006

Publication series

NameSpringer Handbooks
ISSN (Print)2522-8692
ISSN (Electronic)2522-8706


  • Feature Selection
  • Feature Selection Method
  • Mass Spectrometry Data
  • Random Forest
  • Sample Classification


Dive into the research topics of 'Statistical Methods in Proteomics'. Together they form a unique fingerprint.

Cite this