Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Dragan Gamberger, Nada Lavrač, Filip Železný, Jakub Tolar

Research output: Contribution to journalArticlepeer-review

39 Scopus citations


Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.

Original languageEnglish (US)
Pages (from-to)269-284
Number of pages16
JournalJournal of Biomedical Informatics
Issue number4
StatePublished - Aug 2004

Bibliographical note

Funding Information:
This work was supported by the Croatian Ministry of Science, Education and Sport, the Slovenian Ministry of Education, Science and Sport, and the Czech Ministry of Education through the project MSM 212300013.


  • Comprehensible classification
  • Disease markers
  • Gene expression measurements
  • Machine learning
  • Subgroup discovery


Dive into the research topics of 'Induction of comprehensible models for gene expression datasets by subgroup discovery methodology'. Together they form a unique fingerprint.

Cite this