TY - JOUR
T1 - Efficient identification of Tanimoto nearest neighbors
T2 - All-pairs similarity search using the extended Jaccard coefficient
AU - Anastasiu, David C.
AU - Karypis, George
N1 - Publisher Copyright:
© 2017, Springer International Publishing AG.
PY - 2017/11/1
Y1 - 2017/11/1
N2 - Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds.
AB - Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds.
KW - All-pairs similarity search
KW - Extended Jaccard
KW - Graph construction
KW - Nearest neighbors
KW - TAPNN
KW - Tanimoto similarity
UR - http://www.scopus.com/inward/record.url?scp=85061544311&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061544311&partnerID=8YFLogxK
U2 - 10.1007/s41060-017-0064-z
DO - 10.1007/s41060-017-0064-z
M3 - Article
AN - SCOPUS:85061544311
SN - 2364-415X
VL - 4
SP - 153
EP - 172
JO - International Journal of Data Science and Analytics
JF - International Journal of Data Science and Analytics
IS - 3
ER -