Accelerating transformer-based deep learning models on fpgas using column balanced block pruning

Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, Caiwen Ding

Research output: Chapter in Book/Report/Conference proceedingConference contribution

62 Scopus citations

Abstract

Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer, has been investigated on GPUs. However, the Transformer acceleration using weight pruning on field-programmable gate array (FPGAs) remains unexplored. This paper investigates the column balanced block-wise pruning on Transformer and designs an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model with proper hardware scheduling, and the experiments show that the Transformer inference on FPGA achieves 10.35 ms latency with the batch size of 32, which is $10.96 \times$ speed up comparing to CPU platform and $2.08 \times$ speed up comparing to GPU platform.

Original languageEnglish (US)
Title of host publicationProceedings of the 22nd International Symposium on Quality Electronic Design, ISQED 2021
PublisherIEEE Computer Society
Pages142-148
Number of pages7
ISBN (Electronic)9781728176413
DOIs
StatePublished - Apr 7 2021
Externally publishedYes
Event22nd International Symposium on Quality Electronic Design, ISQED 2021 - Santa Clara, United States
Duration: Apr 7 2021Apr 9 2021

Publication series

NameProceedings - International Symposium on Quality Electronic Design, ISQED
Volume2021-April
ISSN (Print)1948-3287
ISSN (Electronic)1948-3295

Conference

Conference22nd International Symposium on Quality Electronic Design, ISQED 2021
Country/TerritoryUnited States
CitySanta Clara
Period4/7/214/9/21

Bibliographical note

Publisher Copyright:
© 2021 IEEE.

Keywords

  • Acceleration
  • Deep learning
  • FPGA
  • Pruning
  • Transformer

Fingerprint

Dive into the research topics of 'Accelerating transformer-based deep learning models on fpgas using column balanced block pruning'. Together they form a unique fingerprint.

Cite this