Abstract
Training convolutional neural networks (CNNs) requires intensive computations as well as a large amount of storage and memory access. While low bandwidth off-chip memories in prior FPGA works have hindered the system-level performance, modern FPGAs offer high bandwidth memory (HBM2) that unlocks opportunities to improve the throughput/energy of FPGA-based CNN training. This paper presents a FPGA accelerator for CNN training which (1) uses HBM2 for efficient off-chip communication, and (2) supports various training operations (e.g. residual connections, stride-2 convolutions) for modern CNNs. We analyze the impact of HBM2 on CNN training workloads, provide a comprehensive comparison with DDR3, and present the strategies to efficiently use HBM2 features for enhanced CNN training performance. For training ResNet-20/VGG-like CNNs for CIFAR-10 dataset with low batch size of 2, the proposed CNN training accelerator on Intel Stratix-10 MX FPGA demonstrates 1.4/1.7X energy-efficiency improvement compared to Stratix-10 GX FPGA with DDR3 memory, and 4.5/9.7 X energy-efficiency improvement compared to Tesla V100 GPU.
Original language | English (US) |
---|---|
Article number | 9256704 |
Journal | IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD |
Volume | 2020-November |
DOIs | |
State | Published - Nov 2 2020 |
Externally published | Yes |
Event | 39th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2020 - Virtual, San Diego, United States Duration: Nov 2 2020 → Nov 5 2020 |
Bibliographical note
Publisher Copyright:© 2020 Association on Computer Machinery.
Keywords
- backpropagation
- Convolutional neural networks
- FPGA
- hardware accelerator
- neural network training