Beware of external validation!-a comparative study of several validation techniques used in qsar modelling

Subhabrata Majumdar, Subhash C. Basak

Research output: Contribution to journalArticlepeer-review

22 Scopus citations


Background: Proper validation is an important aspect of QSAR modelling. External validation is one of the widely used validation methods in QSAR where the model is built on a subset of the data and validated on the rest of the samples. However, its effectiveness for datasets with a small number of samples but a large number of predictors remains suspect. Objective: Calculating hundreds or thousands of molecular descriptors using currently available software has become the norm in QSAR research, owing to computational advances in the past few decades. Thus, for n chemical compounds and p descriptors calculated for each molecule, the typical chemometric dataset today has a high value of p but small n (i.e. n << p). Motivated by the evidence of inadequacies of external validation in estimating the true predictive capability of a statistical model in recent literature, this paper performs an extensive and comparative study of this method with several other validation techniques. Methodology: We compared four validation methods: Leave-one-out, K-fold, external and multi-split validation, using statistical models built using the LASSO regression, which simultaneously performs variable selection and modelling. We used 300 simulated datasets and one real dataset of 95 congeneric amine mutagens for this evaluation. Results: External validation metrics have high variation among different random splits of the data, hence are not recommended for predictive QSAR models. LOO has the overall best performance among all validation methods applied in our scenario. Conclusion: Results from external validation are too unstable for the datasets we analyzed. Based on our findings, we recommend using the LOO procedure for validating QSAR predictive models built on high-dimensional small-sample data.

Original languageEnglish (US)
Pages (from-to)284-291
Number of pages8
JournalCurrent computer-aided drug design
Issue number4
StatePublished - 2018

Bibliographical note

Funding Information:
The research of SM is supported by George Michailidis.

Publisher Copyright:
© 2018 Bentham Science Publishers.


  • Aromatic and heteroaromatic amines
  • Chemical mutagens
  • Cross validation
  • External validation
  • K-fold cross validation
  • Lasso
  • Leave one out (loo) cross validation


Dive into the research topics of 'Beware of external validation!-a comparative study of several validation techniques used in qsar modelling'. Together they form a unique fingerprint.

Cite this