Information content and analysis methods for multi-modal high-throughput biomedical data

Bisakha Ray, Mikael Henaff, Sisi Ma, Efstratios Efstathiadis, Eric R. Peskin, Marco Picone, Tito Poli, Constantin F. Aliferis, Alexander Statnikov

Research output: Contribution to journalArticlepeer-review

24 Scopus citations


The spectrum of modern molecular high-throughput assaying includes diverse technologies such as microarray gene expression, miRNA expression, proteomics, DNA methylation, among many others. Now that these technologies have matured and become increasingly accessible, the next frontier is to collect "multi-modal" data for the same set of subjects and conduct integrative, multi-level analyses. While multi-modal data does contain distinct biological information that can be useful for answering complex biology questions, its value for predicting clinical phenotypes and contributions of each type of input remain unknown. We obtained 47 datasets/predictive tasks that in total span over 9 data modalities and executed analytic experiments for predicting various clinical phenotypes and outcomes. First, we analyzed each modality separately using uni-modal approaches based on several state-of-the-art supervised classification and feature selection methods. Then, we applied integrative multi-modal classification techniques. We have found that gene expression is the most predictively informative modality. Other modalities such as protein expression, miRNA expression, and DNA methylation also provide highly predictive results, which are often statistically comparable but not superior to gene expression data. Integrative multi-modal analyses generally do not increase predictive signal compared to gene expression data.

Original languageEnglish (US)
Article number4411
JournalScientific reports
StatePublished - Mar 21 2014

Bibliographical note

Funding Information:
This research was supported in part by the grant 1UL1 RR029893 from the National Center for Research Resources, National Institutes of Health. This study uses the data generated by the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) project, funded by Cancer Research UK and the British Columbia Cancer Agency Branch. The study uses the data generated by the European Union NeoMark project (EU-FP7-ICT-2007-2-22483-NeoMark), funded by the Seventh Framework Programme, The authors also thank Drs. Olivier Gevaert, Anneleen Daemen, and Yves Moreau for providing clarifications on previously predictive analytics multi-modal studies and HIDIDIT software, and Dr. Guillaume Obozinski for SKMsmo software.


Dive into the research topics of 'Information content and analysis methods for multi-modal high-throughput biomedical data'. Together they form a unique fingerprint.

Cite this