A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a register-transfer level simulation methodology, this paper examines the performance of a multiprocessor architecture that combines both coarse-grained and fine-grained parallelism strategies to minimize the execution time of a single application program. These simulations indicate that the best system performance is obtained by using a mix of fine-grained and coarse-grained parallelism in which any number of processors can be used, but each processor should be pipelined to a degree of 2 to 4, or each should be capable of issuing from 2 to 4 instructions per cycle. These results suggest that current high-performance microprocessors, which typically can have 2 to 4 instructions simultaneously executing, may provide excellent components with which to construct a multiprocessor system.
Bibliographical noteFunding Information:
Thanks to Carl Beckmann for his insights into multiprocessor synchronization, and to John Andrews and Pen-Chung Yew for their helpful comments on an early version of this paper. A preliminary version of this work was presented at the Twenty-Fourth Annual Hawaii International Conference on System Sciences \[19\]. Portions of this work were performed at the University of Illinois at Urbana-Champaign while the author was supported by the National Science Foundation under Grant No. MIP-8410110, with additional support from NASA Ames Research Center Grant No. NCC 2-559 (DARPA), National Science Foundation Grant No. MIP-88-07775, and Department of Energy Grant No. DE-FG02-85ER25001.
- Instruction-level Parallelism
- Loop-level parallelism
- Performance comparisons