TY - JOUR
T1 - Assessing synthetic accessibility of chemical compounds using machine learning methods
AU - Podolyan, Yevgeniy
AU - Walters, Michael A.
AU - Karypis, George
PY - 2010/6/28
Y1 - 2010/6/28
N2 - With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active probes. However, many of them may be synthetically infeasible and, therefore, of limited value to drug developers. On the other hand, most of the tools for synthetic accessibility evaluation are very slow and can process only a few molecules per minute. In this study, we present two approaches to quickly predict the synthetic accessibility of chemical compounds by utilizing support vector machines operating on molecular descriptors. The first approach, RSsvm, is designed to identify the compounds that can be synthesized using a specific set of reactions and starting materials and builds its model by training on the compounds identified as synthetically accessible or not by retrosynthetic analysis. The second approach, DRsvm, is designed to provide a more general assessment of synthetic accessibility that is not tied to any set of reactions or starting materials. The training set compounds for this approach are selected from a diverse library based on the number of other similar compounds within the same library. Both approaches have been shown to perform very well in their corresponding areas of applicability with the RSsvm achieving a receiver operator characteristic score of 0.952 in cross-validation experiments and the DRsvm achieving a score of 0.888 on an independent set of compounds. Our implementations can successfully process thousands of compounds per minute.
AB - With de novo rational drug design, scientists can rapidly generate a very large number of potentially biologically active probes. However, many of them may be synthetically infeasible and, therefore, of limited value to drug developers. On the other hand, most of the tools for synthetic accessibility evaluation are very slow and can process only a few molecules per minute. In this study, we present two approaches to quickly predict the synthetic accessibility of chemical compounds by utilizing support vector machines operating on molecular descriptors. The first approach, RSsvm, is designed to identify the compounds that can be synthesized using a specific set of reactions and starting materials and builds its model by training on the compounds identified as synthetically accessible or not by retrosynthetic analysis. The second approach, DRsvm, is designed to provide a more general assessment of synthetic accessibility that is not tied to any set of reactions or starting materials. The training set compounds for this approach are selected from a diverse library based on the number of other similar compounds within the same library. Both approaches have been shown to perform very well in their corresponding areas of applicability with the RSsvm achieving a receiver operator characteristic score of 0.952 in cross-validation experiments and the DRsvm achieving a score of 0.888 on an independent set of compounds. Our implementations can successfully process thousands of compounds per minute.
UR - http://www.scopus.com/inward/record.url?scp=77954054048&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954054048&partnerID=8YFLogxK
U2 - 10.1021/ci900301v
DO - 10.1021/ci900301v
M3 - Article
C2 - 20536191
AN - SCOPUS:77954054048
SN - 1549-9596
VL - 50
SP - 979
EP - 991
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 6
ER -