HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Shaoyi Huang, Shiyang Chen, Hongwu Peng, Daniel Manu, Zhenglun Kong, Geng Yuan, Lei Yang, Shusen Wang, Hang Liu, Caiwen Ding

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-The-Art.

Original languageEnglish (US)
Title of host publicationGLSVLSI 2021 - Proceedings of the 2021 Great Lakes Symposium on VLSI
PublisherAssociation for Computing Machinery
Pages169-174
Number of pages6
ISBN (Electronic)9781450383936
DOIs
StatePublished - Jun 22 2021
Externally publishedYes
Event31st Great Lakes Symposium on VLSI, GLSVLSI 2021 - Virtual, Online, United States
Duration: Jun 22 2021Jun 25 2021

Publication series

NameProceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI

Conference

Conference31st Great Lakes Symposium on VLSI, GLSVLSI 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/22/216/25/21

Bibliographical note

Publisher Copyright:
© 2021 ACM.

Keywords

  • bert
  • block weight pruning
  • low-rank
  • tensor-core
  • transformer

Fingerprint

Dive into the research topics of 'HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU'. Together they form a unique fingerprint.

Cite this