Traditionally, QSAR and QSPR models have been fitted by splitting the available compounds into separate learning and validation sets. The model is then fitted to the learning set and assessed using the validation set. Cross-validation (CV) uses all available compounds for both purposes, so that the full body of available information is brought to bear on both the learning and the validation portions of the study. The price paid for this additional information is a substantially greater computational load. A common mistake in using CV is to omit some of the repetitive computations. This mistake leads to substantial bias in the assessment. A hydroxyl radical reaction rate dataset is used to illustrate the superiority of CV and the pitfalls from its improper execution when modeling using nearest neighbors, paralleling behavior in the well-studied linear model setting.
Bibliographical noteFunding Information:
The authors are grateful to Dr Tomas Oberg for his work in validating the PHYSPROP SMILES entries for the OH radical reaction rate constant data. This article represents contribution number 488 from the Center for Water and the Environment of the Natural Resources Research Institute. Research was supported by Grant F49620-02-1-0138 from the United States Air Force and Cooperative Agreement 572112 from the Agency for Toxic Substances and Disease Registry. The US Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official views, policies, or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the US Government, or ATSDR.
- Leave-one-out (LOO) cross-validation
- Molecular descriptor
- Naive q
- OH radical reaction rate constant
- True q
- m-fold cross-validation