Multi-stage coordinated prefetching for present-day processors

Sanyam Mehta, Zhenman Fang, Antonia Zhai, Pen Chung Yew

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Scopus citations


Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.

Original languageEnglish (US)
Title of host publicationICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Number of pages10
ISBN (Print)9781450326421
StatePublished - 2014
Event28th ACM International Conference on Supercomputing, ICS 2014 - Munich, Germany
Duration: Jun 10 2014Jun 13 2014

Publication series

NameProceedings of the International Conference on Supercomputing


Other28th ACM International Conference on Supercomputing, ICS 2014


  • coordinated prefetching
  • sandybridge
  • xeonphi


Dive into the research topics of 'Multi-stage coordinated prefetching for present-day processors'. Together they form a unique fingerprint.

Cite this