Abstract
The nearest neighbor graph is an important structure in many data mining methods for clustering, advertising, recommender systems, and outlier detection. Constructing the graph requires computing up to n 2 similarities for a set of n objects. This high complexity has led researchers to seek approximate methods, which find many but not all of the nearest neighbors. In contrast, we leverage shared memory parallelism and recent advances in similarity joins to solve the problem exactly. Our method considers all pairs of potential neighbors but quickly filters pairs that could not be a part of the nearest neighbor graph, based on similarity upper bound estimates. The filtering is data dependent and not easily predicted, which poses load balance challenges in parallel execution. We evaluated our methods on several real-world datasets and found they work up to two orders of magnitude faster than existing methods, display linear strong scaling characteristics, and incur less than 1% load imbalance during filtering.
Original language | English (US) |
---|---|
Pages (from-to) | 61-82 |
Number of pages | 22 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 129 |
DOIs | |
State | Published - Jul 2019 |
Bibliographical note
Funding Information:This work was supported in partby National Science Foundation ( IIS-0905220 , OCI-Q3 461048018 , CNS-1162405 , IIS-1247632 , IIP-1414153 , IIS-1447788 ), Army Research Office ( W911NF-14-1-0316 ), Intel Software and Services Group , and the Digital Technology Center at the University of Minnesota . Access to research and computing facilities was provided by the Digital Technology Center (DTC) and the Minnesota Supercomputing Institute (MSI).
Funding Information:
This work was supported in partby National Science Foundation (IIS-0905220, OCI-Q3 461048018, CNS-1162405, IIS-1247632, IIP-1414153, IIS-1447788), Army Research Office (W911NF-14-1-0316), Intel Software and Services Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center (DTC)and the Minnesota Supercomputing Institute (MSI).
Publisher Copyright:
© 2017 Elsevier Inc.
Keywords
- All-pairs
- Bounded similarity graph
- Cosine similarity
- Nearest neighbors
- Neighborhood graph construction
- Shared memory parallel
- Similarity search