Upload
valiant
View
29
Download
2
Embed Size (px)
DESCRIPTION
ICCD 2009, Lake Tahoe , CA (USA) - October 6, 2009. LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors. Javier Lira ψ Carlos Molina ψ, ф Antonio González ψ,λ. ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected]. - PowerPoint PPT Presentation
Citation preview
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors
Javier Lira ψ
Carlos Molina ψ,ф
Antonio González ψ,λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
ICCD 2009, Lake Tahoe, CA (USA) - October 6, 2009
Outline
Introduction
Methodology
LRU-PEA
Results
Conclusions
2
Introduction
CMPs have emerged as a dominant paradigm in system design.
1. Keep performance improvement while reducing power consumption.
2. Take advantage of Thread-level parallelism.
Commercial CMPs are currently available.
CMPs incorporate larger and shared last-level caches.
Wire delay is a key constraint.
3
NUCA
Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].
NUCA divides a large cache in smaller and faster banks.
Banks close to cache controller have smaller latencies than further banks.
Processor
[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02 4
NUCA Policies
Bank Placement Policy Bank Access Policy
Bank Replacement PolicyBank Migration Policy
Outline
Introduction
Methodology
LRU-PEA
Results
Conclusions
6
Methodology
Simulation tools:Simics + GEMSCACTI v6.0
PARSEC Benchmark Suite
Number of cores 8 – UltraSPARC IIIi
Frequency 1.5 GHz
Main Memory Size 4 Gbytes
Memory Bandwidth 512 Bytes/cycle
L1 cache latency 3 cycles
NUCA bank latency 4 cycles
Router delay 1 cycle
On-chip wire delay 1 cycle
Main memory latency 250 cycles (from core)
Private L1 caches 8 x 32 Kbytes, 2-way
Shared L2 NUCA cache 8 MBytes, 256 Banks
NUCA Bank 32 KBytes, 8-way
Baseline NUCA cache architecture
CMP-DNUCA
8 cores
256 banks
Non-inclusive
[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04
Outline
Introduction
Methodology
LRU-PEA Background How does it work?
Results
Conclusions
9
Background
Entrance into the NUCA Off-chip memory L1 cache replacements
Migration movements Promotion Demotion
10
Data categories
11
1. Off-chip2. L1 cache Replacements3. Promoted data4. Demoted data
LRU-PEA
LRU with Priority Eviction Approach Replacement policy for CMP-NUCA architectures.
Data Eviction Policy: Chooses data to evict from a NUCA bank.
Data Target Policy: Determines the destination bank of the evicted data. Globalizes replacement decisions to the whole NUCA.
12
Data Evictio
n Policy
Data Target Policy
LRU-PEA
Data Eviction Policy
Based on the LRU replacement policy. Static prioritisation of NUCA data categories. Lowest-category data is evicted from the NUCA bank.
PROBLEM: Highest-category could monopolize the NUCA cache.
Category comparisson is restricted to the LRU and the LRU-1 positions.
13
BANK
Local Central
+ L1 Replacements Promoted
PRIORITYPromoted Off-chip
Off-chip Demoted
- Demoted L1 Replacements
Data Eviction Policy
Example (NUCA bank, 4-way)**:
14
@APromoted
@BDemoted
@COffchip
@DPromoted
** The set associativity assumed in this work for NUCA banks is 8-way.
0 1 2 3
MRU LRUL1 Replacement
Promoted
Offchip
Demoted
@COffchip
@DPromoted
LRU-PEA
@DPromoted
Available
Data Target Policy
Migration movements provoke bank usage imbalance in the NUCA cache.
Replacements in most accessed banks are unfair.
LRU-PEA globalizes replacement decisions to evict the most appropriate data from the NUCA cache.
15
Data Target Policy
Example (256 NUCA Banks, 16 possible placements):
16
Current eviction
Off-chipP2
Central
Vs.Step 1
L1 Replac.P1
Local
Step 2
Off-chipP2
Central
Step 3
DemotedP4
Local…
Current eviction
DemotedP4
Local
Cascade mode
Outline
Introduction
Methodology
LRU-PEA
Results
Conclusions
17
Increasing network congestion
No Cascade Cascade Enabled
Direct Provoked
1 step 64 54 20
2 steps 12 7 7
3 steps 4 2 4
4 steps 3 2 4
5 steps 3 2 3
6 steps 2 1 4
7 steps 2 1 3
8 steps 2 1 4
9 steps 1 1 3
10 steps 1 1 4
11 steps 1 1 3
12 steps 1 1 6
13 steps 1 1 6
14 steps 1 1 30
15 steps 3 21 -
Values in percentage (%)
18
NUCA miss rate analysis
19
Performance analysis
20
Dynamic EPI analysis
21
Outline
Introduction
Methodology
LRU-PEA
Results
Conclusions
22
Conclusions
LRU-PEA is proposed as an alternative to the traditional LRU replacement policy in CMP-NUCA architectures.
Defines four novel NUCA categories and prioritises them to find the most appropriate data to evict.
In a D-NUCA architecture, data movements provoke unfair replacements in most accessed banks.
LRU-PEA globalizes replacement decisions taken in a single bank to the whole NUCA cache.
LRU-PEA reduces miss rate, increases performance with parallel applications, reduces energy consumed per instruction, compared to the traditional LRU policy.
23
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors
Questions?