DSP: Efficient GNN Training with Multiple GPUs

Zhenkun Cai, Qihui Zhou, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

Jointly utilizing multiple GPUs to train graph neural networks (GNNs) is crucial for handling large graphs and achieving high efficiency. However, we find that existing systems suffer from high communication costs and low GPU utilization due to improper data layout and training procedures. Thus, we propose a system dubbed Distributed Sampling and Pipelining (DSP) for multi-GPU GNN training. DSP adopts a tailored data layout to utilize the fast NVLink connections among the GPUs, which stores the graph topology and popular node features in GPU memory. For efficient graph sampling with multiple GPUs, we introduce a collective sampling primitive (CSP), which pushes the sampling tasks to data to reduce communication. We also design a producer-consumer-based pipeline, which allows tasks from different mini-batches to run congruently to improve GPU utilization. We compare DSP with state-of-the-art GNN training frameworks, and the results show that DSP consistently outperforms the baselines under different datasets, GNN models and GPU counts. The speedup of DSP can be up to 26x and is over 2x in most cases.

Original languageEnglish (US)
Title of host publicationPPoPP 2023 - Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages392-404
Number of pages13
ISBN (Electronic)9798400700156
DOIs
StatePublished - Feb 25 2023
Externally publishedYes
Event28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023 - Montreal, Canada
Duration: Feb 25 2023Mar 1 2023

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023
Country/TerritoryCanada
CityMontreal
Period2/25/233/1/23

Bibliographical note

Publisher Copyright:
© 2023 ACM.

Keywords

  • GPU
  • graph neural networks
  • model training

Fingerprint

Dive into the research topics of 'DSP: Efficient GNN Training with Multiple GPUs'. Together they form a unique fingerprint.

Cite this