Cyclic convolution is a widely used operation in signal processing. In very large-scale integration (VLSI) design, it is usually implemented with systolic array and distributed arithmetic; however, these implementation designs may not be fast enough or use too much hardware cost when the convolution length is large. This paper presents a new fast cyclic convolution algorithm, which is hardware efficient and suitable for high-speed VLSI implementation, especially when the convolution length is large. For example, when the proposed fast cyclic convolution algorithm is applied to the implementation of prime length discrete cosine transform (DCT), the proposed high-throughput implementation of 1297-length DCT design saves 1216 (94%) multiplications, 282 (22%) additions, and 4792 (74%) delay elements compared with those of recently proposed systolic array based algorithms. Furthermore, the proposed algorithm can run at a speed that is 1.5 times that of previous designs and requires less I/O cost as long as the wordlength L is less than 20 bits.
Bibliographical noteFunding Information:
Manuscript received February 15, 2005; revised December 28, 2005. This work was supported in part by the National Science Foundation under Grant CCF-0429979. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Shuvra S. Bhattacharyya.
- Cyclic convolution
- Discrete cosine transforms
- Linear convolution
- Very large-scale integration