LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Javier Lira ψ

Carlos Molina ψ,ф

Antonio González ψ,λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009

http://arco.e.ac.upc.edu/wiki/index.php/Image:Logo_dac.jpg

http://deim.urv.cat/

http://www.urv.net/

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

2

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

3

NUCA

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4

NUCA Policies

Bank Placement Policy Bank Access Policy

Bank Replacement PolicyBank Migration Policy

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

6

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC Benchmark Suite

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 250 cycles (from core)

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache 8 MBytes, 256 Banks

NUCA Bank 32 KBytes, 8-way

Baseline NUCA cache architecture

CMP-DNUCA

8 cores

256 banks

Non-inclusive

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline

Introduction

Methodology

LRU-PEA Background How does it work?

Results

Conclusions

9

Background

Entrance into the NUCA Off-chip memory L1 cache replacements

Migration movements Promotion Demotion

10

Data categories

11

1. Off-chip2. L1 cache Replacements3. Promoted data4. Demoted data

LRU-PEA

LRU with Priority Eviction Approach Replacement policy for CMP-NUCA architectures.

Data Eviction Policy: Chooses data to evict from a NUCA bank.

Data Target Policy: Determines the destination bank of the evicted data. Globalizes replacement decisions to the whole NUCA.

12

Data Evictio

n Policy

Data Target Policy

LRU-PEA

Data Eviction Policy

Based on the LRU replacement policy. Static prioritisation of NUCA data categories. Lowest-category data is evicted from the NUCA bank.

PROBLEM: Highest-category could monopolize the NUCA cache.

Category comparisson is restricted to the LRU and the LRU-1 positions.

13

BANK

Local Central

+ L1 Replacements Promoted

PRIORITYPromoted Off-chip

Off-chip Demoted

- Demoted L1 Replacements

Data Eviction Policy

Example (NUCA bank, 4-way)**:

14

@APromoted

@BDemoted

@COffchip

@DPromoted

** The set associativity assumed in this work for NUCA banks is 8-way.

0 1 2 3

MRU LRUL1 Replacement

Promoted

Offchip

Demoted

@COffchip

@DPromoted

LRU-PEA

@DPromoted

Available

Data Target Policy

Migration movements provoke bank usage imbalance in the NUCA cache.

Replacements in most accessed banks are unfair.

LRU-PEA globalizes replacement decisions to evict the most appropriate data from the NUCA cache.

15

Data Target Policy

Example (256 NUCA Banks, 16 possible placements):

16

Current eviction

Off-chipP2

Central

Vs.Step 1

L1 Replac.P1

Local

Step 2

Off-chipP2

Central

Step 3

DemotedP4

Local…

Current eviction

DemotedP4

Local

Cascade mode

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

17

Increasing network congestion

No Cascade Cascade Enabled

Direct Provoked

1 step 64 54 20

2 steps 12 7 7

3 steps 4 2 4

4 steps 3 2 4

5 steps 3 2 3

6 steps 2 1 4

7 steps 2 1 3

8 steps 2 1 4

9 steps 1 1 3

10 steps 1 1 4

11 steps 1 1 3

12 steps 1 1 6

13 steps 1 1 6

14 steps 1 1 30

15 steps 3 21 -

Values in percentage (%)

18

NUCA miss rate analysis

19

Performance analysis

20

Dynamic EPI analysis

21

Outline

Introduction

Methodology

LRU-PEA

Results

Conclusions

22

Conclusions

LRU-PEA is proposed as an alternative to the traditional LRU replacement policy in CMP-NUCA architectures.

Defines four novel NUCA categories and prioritises them to find the most appropriate data to evict.

In a D-NUCA architecture, data movements provoke unfair replacements in most accessed banks.

LRU-PEA globalizes replacement decisions taken in a single bank to the whole NUCA cache.

LRU-PEA reduces miss rate, increases performance with parallel applications, reduces energy consumed per instruction, compared to the traditional LRU policy.

23

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors

Questions?

Documents

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors