Performance-energy considerations for shared cache management in a heterogeneous multicore processor

Anup Holey, Vineeth Mekkat, Pen Chung Yew, Antonia Zhai

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of concurrent threads supported by the architecture. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism (TLP). In addition to the performance challenge, introduction of diverse cores onto the same die changes the energy consumption profile and, in turn, affects the energy efficiency of the processor. In this work, we propose heterogeneous LLC management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache-sensitive CPU applications. This throttling is achieved by allowing GPU accesses to bypass the LLC when an increase in memory access latency can be tolerated. The latency tolerance of a GPU application is determined by the availability of TLP, which is measured at runtime as the average number of threads that are available for issuing. For a baseline configuration with two CPU cores and four GPU cores, modeled after existing heterogeneous processor designs, HeLM outperforms least recently used (LRU) policy by 10.4%. Additionally, HeLM also outperforms competing policies. Our evaluations show that HeLM is able to sustain performance with varying core mix. In addition to the performance benefit, bypassing also reduces total accesses to the LLC, leading to a reduction in the energy consumption of the LLC module. However, LLC bypassing has the potential to increase off-chip bandwidth utilization and DRAM energy consumption. Our experiments show that HeLM exhibits better energy efficiency by reducing the ED2 by 18% over LRU while impacting only a 7% increase in off-chip bandwidth utilization.

Original languageEnglish (US)
Article number3
JournalACM Transactions on Architecture and Code Optimization
Volume12
Issue number1
DOIs
StatePublished - Mar 1 2015

Fingerprint

Program processors
Energy utilization
Data storage equipment
Energy efficiency
Bandwidth
Graphics processing unit
Dynamic random access storage
Particle accelerators
Availability
Degradation

Cite this

Performance-energy considerations for shared cache management in a heterogeneous multicore processor. / Holey, Anup; Mekkat, Vineeth; Yew, Pen Chung; Zhai, Antonia.

In: ACM Transactions on Architecture and Code Optimization, Vol. 12, No. 1, 3, 01.03.2015.

Research output: Contribution to journalArticle

@article{ff996b0c1a614ed0b0388c6018533b86,
title = "Performance-energy considerations for shared cache management in a heterogeneous multicore processor",
abstract = "Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of concurrent threads supported by the architecture. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism (TLP). In addition to the performance challenge, introduction of diverse cores onto the same die changes the energy consumption profile and, in turn, affects the energy efficiency of the processor. In this work, we propose heterogeneous LLC management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache-sensitive CPU applications. This throttling is achieved by allowing GPU accesses to bypass the LLC when an increase in memory access latency can be tolerated. The latency tolerance of a GPU application is determined by the availability of TLP, which is measured at runtime as the average number of threads that are available for issuing. For a baseline configuration with two CPU cores and four GPU cores, modeled after existing heterogeneous processor designs, HeLM outperforms least recently used (LRU) policy by 10.4{\%}. Additionally, HeLM also outperforms competing policies. Our evaluations show that HeLM is able to sustain performance with varying core mix. In addition to the performance benefit, bypassing also reduces total accesses to the LLC, leading to a reduction in the energy consumption of the LLC module. However, LLC bypassing has the potential to increase off-chip bandwidth utilization and DRAM energy consumption. Our experiments show that HeLM exhibits better energy efficiency by reducing the ED2 by 18{\%} over LRU while impacting only a 7{\%} increase in off-chip bandwidth utilization.",
author = "Anup Holey and Vineeth Mekkat and Yew, {Pen Chung} and Antonia Zhai",
year = "2015",
month = "3",
day = "1",
doi = "10.1145/2710019",
language = "English (US)",
volume = "12",
journal = "ACM Transactions on Architecture and Code Optimization",
issn = "1544-3566",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Performance-energy considerations for shared cache management in a heterogeneous multicore processor

AU - Holey, Anup

AU - Mekkat, Vineeth

AU - Yew, Pen Chung

AU - Zhai, Antonia

PY - 2015/3/1

Y1 - 2015/3/1

N2 - Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of concurrent threads supported by the architecture. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism (TLP). In addition to the performance challenge, introduction of diverse cores onto the same die changes the energy consumption profile and, in turn, affects the energy efficiency of the processor. In this work, we propose heterogeneous LLC management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache-sensitive CPU applications. This throttling is achieved by allowing GPU accesses to bypass the LLC when an increase in memory access latency can be tolerated. The latency tolerance of a GPU application is determined by the availability of TLP, which is measured at runtime as the average number of threads that are available for issuing. For a baseline configuration with two CPU cores and four GPU cores, modeled after existing heterogeneous processor designs, HeLM outperforms least recently used (LRU) policy by 10.4%. Additionally, HeLM also outperforms competing policies. Our evaluations show that HeLM is able to sustain performance with varying core mix. In addition to the performance benefit, bypassing also reduces total accesses to the LLC, leading to a reduction in the energy consumption of the LLC module. However, LLC bypassing has the potential to increase off-chip bandwidth utilization and DRAM energy consumption. Our experiments show that HeLM exhibits better energy efficiency by reducing the ED2 by 18% over LRU while impacting only a 7% increase in off-chip bandwidth utilization.

AB - Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important shared resources due to its impact on performance. Accesses to the shared LLC in heterogeneous multicore processors can be dominated by the GPU due to the significantly higher number of concurrent threads supported by the architecture. Under current cache management policies, the CPU applications' share of the LLC can be significantly reduced in the presence of competing GPU applications. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism (TLP). In addition to the performance challenge, introduction of diverse cores onto the same die changes the energy consumption profile and, in turn, affects the energy efficiency of the processor. In this work, we propose heterogeneous LLC management (HeLM), a novel shared LLC management policy that takes advantage of the GPU's tolerance for memory access latency. HeLM is able to throttle GPU LLC accesses and yield LLC space to cache-sensitive CPU applications. This throttling is achieved by allowing GPU accesses to bypass the LLC when an increase in memory access latency can be tolerated. The latency tolerance of a GPU application is determined by the availability of TLP, which is measured at runtime as the average number of threads that are available for issuing. For a baseline configuration with two CPU cores and four GPU cores, modeled after existing heterogeneous processor designs, HeLM outperforms least recently used (LRU) policy by 10.4%. Additionally, HeLM also outperforms competing policies. Our evaluations show that HeLM is able to sustain performance with varying core mix. In addition to the performance benefit, bypassing also reduces total accesses to the LLC, leading to a reduction in the energy consumption of the LLC module. However, LLC bypassing has the potential to increase off-chip bandwidth utilization and DRAM energy consumption. Our experiments show that HeLM exhibits better energy efficiency by reducing the ED2 by 18% over LRU while impacting only a 7% increase in off-chip bandwidth utilization.

UR - http://www.scopus.com/inward/record.url?scp=84924855916&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84924855916&partnerID=8YFLogxK

U2 - 10.1145/2710019

DO - 10.1145/2710019

M3 - Article

VL - 12

JO - ACM Transactions on Architecture and Code Optimization

JF - ACM Transactions on Architecture and Code Optimization

SN - 1544-3566

IS - 1

M1 - 3

ER -