E.T. Re-Thinking self-Attention for transformer models on GPUs

Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R. Gao, Long Zheng, Caiwen Ding, Hang Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Scopus citations

Abstract

Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T.That rE-Thinks self-Attention computation for Transformer models on GPUs with the following contributions: First, we introduce a novel self-Attention architecture, which encompasses two tailored self-Attention operators with corresponding sequence length-Aware optimizations, and operation reordering optimizations. Second, we present an attention-Aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2021
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond
PublisherIEEE Computer Society
ISBN (Electronic)9781450384421
DOIs
StatePublished - Nov 14 2021
Externally publishedYes
Event33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021 - Virtual, Online, United States
Duration: Nov 14 2021Nov 19 2021

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference33rd International Conference for High Performance Computing, Networking, Storage and Analysis: Science and Beyond, SC 2021
Country/TerritoryUnited States
CityVirtual, Online
Period11/14/2111/19/21

Bibliographical note

Publisher Copyright:
© 2021 IEEE Computer Society. All rights reserved.

Fingerprint

Dive into the research topics of 'E.T. Re-Thinking self-Attention for transformer models on GPUs'. Together they form a unique fingerprint.

Cite this