TY - JOUR
T1 - Efficient mapping and implementation of matrix algorithms on a hypercube
AU - Cherkassky, Vladimir
AU - Smith, Ross
PY - 1988/9/1
Y1 - 1988/9/1
N2 - It is well known that parallelism by itself does not lead to higher speeds. This study shows how to put parallelism to best use, that is, how to find an optimal balance between communication and computation overheads for two parallel matrix algorithms. The problem graph for matrix algorithms analyzed in this paper is a two-dimensional grid (toroidal mesh) which is mapped onto a hypercube topology. To perform matrix operations on a hypercube, a matrix is partitioned into several submatrices which are stored and manipulated in the nodes. We seek to find an optimal matrix partitioning to minimize overall execution time. The NCUBE parallel machine is used for experimental performance evaluation. For matrix multiplication, we derive an exact analytical model to determine the optimal partitioning size and perform its experimental verification on the NCUBE parallel processor. For a parallel Gaussian elimination known as the balanced algorithm, we present performance measurements and an approximate analytical model for performance evaluation. Our analyses show that the optimal submatrix size is typically small and does not depend on the original matrix size.
AB - It is well known that parallelism by itself does not lead to higher speeds. This study shows how to put parallelism to best use, that is, how to find an optimal balance between communication and computation overheads for two parallel matrix algorithms. The problem graph for matrix algorithms analyzed in this paper is a two-dimensional grid (toroidal mesh) which is mapped onto a hypercube topology. To perform matrix operations on a hypercube, a matrix is partitioned into several submatrices which are stored and manipulated in the nodes. We seek to find an optimal matrix partitioning to minimize overall execution time. The NCUBE parallel machine is used for experimental performance evaluation. For matrix multiplication, we derive an exact analytical model to determine the optimal partitioning size and perform its experimental verification on the NCUBE parallel processor. For a parallel Gaussian elimination known as the balanced algorithm, we present performance measurements and an approximate analytical model for performance evaluation. Our analyses show that the optimal submatrix size is typically small and does not depend on the original matrix size.
UR - http://www.scopus.com/inward/record.url?scp=0037631791&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0037631791&partnerID=8YFLogxK
U2 - 10.1007/BF00127846
DO - 10.1007/BF00127846
M3 - Article
AN - SCOPUS:0037631791
VL - 2
SP - 7
EP - 27
JO - Journal of Supercomputing
JF - Journal of Supercomputing
SN - 0920-8542
IS - 1
ER -