Abstract
Predicting functions of genes is an important issue in biology. Clustering gene expression profiles has been widely used for gene function prediction, but most clustering methods are unstable and sensitive to input parameters such as starting values and number of clusters. In this article, we develop a novel consensus clustering method to address the instability issue and thus improve the performance of clustering methods. The biological function of an unannotated gene is predicted based on the most enriched functional category in its consensus cluster. The MIPS gene annotations are used to evaluate the predictive performance. It is shown that the consensus clustering-based classification method has a significantly better predictive performance than a previously used clustering-based classification method while performing as well as support vector machines (SVMs). In addition to the obvious applicability of consensus clustering to unsupervised learning, the method's advantages in supervised learning include its being a multiclass classifier that can be trained much faster than SVMs, its generality to include any of the many existing clustering algorithms, and its flexibility to be integrated with other predictive models built with other types of data, suggesting its potential for further improved performance. As a concrete example, we consider its combined use with protein-protein interaction data for gene function prediction. It is shown that the combined analysis has a significantly higher predictive accuracy and a much broader functional coverage than using either data source alone.
Original language | English (US) |
---|---|
Pages (from-to) | 733-751 |
Number of pages | 19 |
Journal | Journal of Computational and Graphical Statistics |
Volume | 16 |
Issue number | 3 |
DOIs | |
State | Published - Sep 2007 |
Bibliographical note
Funding Information:The authors are grateful to the two reviewers, an AE and the editor for helpful comments. GX was supported by a Merck Fellowship, WP was partially supported by NIH grant HL65462 and a UM AHC FRD grant.
Keywords
- Classification
- Cross-validation
- Gene annotation
- Integrative analysis
- Microarray
- Protein-protein interaction