Abstract
Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy.
Original language | English (US) |
---|---|
Pages (from-to) | 573-584 |
Number of pages | 12 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 5 |
Issue number | 6 |
DOIs | |
State | Published - Jun 1994 |
Bibliographical note
Funding Information:Manuscript received May 20, 1992; revised December 8, 1992, and May 5, 1993. This work. was supported in part by the National Science Foundation under Grants CCR-9209458 and MIP-9221900. A preliminary version of this work [ 161 was presented at the First Midwest Electrotechnology Conference in April 1992.
Keywords
- Cache coherence
- cache pollution
- false sharing
- guided self-scheduling
- multiprocessor
- prefetching
- scheduling
- shared memory