Frequent substructure-based approaches for classifying chemical compounds

Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, George Karypis

Research output: Contribution to journalArticlepeer-review

301 Scopus citations


Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

Original languageEnglish (US)
Pages (from-to)1036-1050
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Issue number8
StatePublished - Aug 2005

Bibliographical note

Funding Information:
The authors will like to thank Dr. Ian Watson from Lilly Research Laboratories and Dr. Peter Henstock from Pfizer Inc. for providing them with the various fingerprints used in the experimental evaluation and for the numerous discussions on the practical aspects of virtual screening. This work was supported by the US National Science Foundation EIA-9986042, ACI-9982274, ACI-0133464, ACI-0312828, IIS-0431135, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.


  • Chemical compounds
  • Classification
  • Graphs
  • SVM
  • Virtual screening


Dive into the research topics of 'Frequent substructure-based approaches for classifying chemical compounds'. Together they form a unique fingerprint.

Cite this