TY - GEN
T1 - An Exploration of Optimization Algorithms for High Performance Tensor Completion
AU - Smith, Shaden
AU - Park, Jongsoo
AU - Karypis, George
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: Alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- A nd distributed-memory systems and address challenges such as memory- A nd operation-efficiency, load balance, cache locality, and communication. Among our advancements are an SGD algorithm which combines stratification with asynchronous communication, an ALS algorithm rich in level-3 BLAS routines, and a communication-efficient CCD++ algorithm. We evaluate our optimizations on a variety of real datasets using a modern supercomputer and demonstrate speedups through 1024 cores. These improvements effectively reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ will provide the lowest time-to-solution on large-scale distributed systems.
AB - Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: Alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- A nd distributed-memory systems and address challenges such as memory- A nd operation-efficiency, load balance, cache locality, and communication. Among our advancements are an SGD algorithm which combines stratification with asynchronous communication, an ALS algorithm rich in level-3 BLAS routines, and a communication-efficient CCD++ algorithm. We evaluate our optimizations on a variety of real datasets using a modern supercomputer and demonstrate speedups through 1024 cores. These improvements effectively reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ will provide the lowest time-to-solution on large-scale distributed systems.
UR - http://www.scopus.com/inward/record.url?scp=85017277412&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85017277412&partnerID=8YFLogxK
U2 - 10.1109/SC.2016.30
DO - 10.1109/SC.2016.30
M3 - Conference contribution
AN - SCOPUS:85017277412
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
SP - 359
EP - 371
BT - Proceedings of SC 2016
PB - IEEE Computer Society
T2 - 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Y2 - 13 November 2016 through 18 November 2016
ER -