Abstract
Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. In this paper, we propose a compression-compilation co-design framework that can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Our framework applies a compiler-aware neural architecture optimization method (CANAO), which can generate the optimal compressed model that balances both accuracy and latency. We are able to achieve up to 7.8× speedup compared with TensorFlow-Lite with only minor accuracy loss. We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation. Both can be executed in real-time with latency as low as 45ms.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI 2021 |
Editors | Zhi-Hua Zhou |
Publisher | International Joint Conferences on Artificial Intelligence |
Pages | 5000-5003 |
Number of pages | 4 |
ISBN (Electronic) | 9780999241196 |
DOIs | |
State | Published - 2021 |
Externally published | Yes |
Event | 30th International Joint Conference on Artificial Intelligence, IJCAI 2021 - Virtual, Online, Canada Duration: Aug 19 2021 → Aug 27 2021 |
Publication series
Name | IJCAI International Joint Conference on Artificial Intelligence |
---|---|
ISSN (Print) | 1045-0823 |
Conference
Conference | 30th International Joint Conference on Artificial Intelligence, IJCAI 2021 |
---|---|
Country/Territory | Canada |
City | Virtual, Online |
Period | 8/19/21 → 8/27/21 |
Bibliographical note
Publisher Copyright:© 2021 International Joint Conferences on Artificial Intelligence. All rights reserved.