Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin...

Better than the Two: Exceeding Private and Shared Caches

via Two-Dimensional Page Coloring

Lei Jin and Sangyeun Cho

Dept. of Computer ScienceUniversity of Pittsburgh

CMPMSI’07 02/11/07

Multicore distributed L2 caches

L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”

(Distributed L2 caches + switched NoC) NUCA

Hardware-based management schemes• Private caching• Shared caching• Hybrid caching

Local L2 Cache

ProcessorCore

Router

CMPMSI’07 02/11/07

Private and shared caching

Private caching:Private caching:

short hit latency (always local)short hit latency (always local)

high on-chip miss ratehigh on-chip miss rate

long miss resolution timelong miss resolution time

complex coherence enforcementcomplex coherence enforcement

Shared caching:

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

CMPMSI’07 02/11/07

Other approaches

Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]

Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]

Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]

CMPMSI’07 02/11/07

Motivation

Miss rateMiss rate

Hit latencyHit latency

What is the optimal balance between miss rate and hit latency?

CMPMSI’07 02/11/07

Talk roadmap

Data mapping, a key property [cho and Jin, Micro2006]

Two-dimensional (2D) page coloring algorithm

Evaluation and results

Conclusion and future works

CMPMSI’07 02/11/07

Data mapping

Data mapping• Memory data location in L2 cache

Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control

Shared caching• Data mapping determined by address

slice number = (block address) % (Nslice)

• Mapping is static• No explicit control

CMPMSI’07 02/11/07

Change mapping granularity

slice number = (block address) % (N slice)

Block granularityPage granularity

slice number = (page address) % (N slice)

CMPMSI’07 02/11/07

OS controlled page mapping

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Virtual address spacePhysical address space

CMPMSI’07 02/11/07

2D page coloring: the problem

PagePagePage500 30

500 10

500 12

access miss

PagePage

Network latency / hop = 3 cycles

Memory latency = 300 cycles

Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles)

CMPMSI’07 02/11/07

2D coloring algorithm

Collect L2 reference trace Derive conflict information [Sherwood et al., ICS1999]

Page A Page CPage B Page B

Reference 1 Reference 2 Reference 3 Reference 4

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Reference Matrix

A 0 0 0B 0 0 0C 0 0 0

Conflict Matrix

A 0 0 0B 0 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Reference Matrix

A 0 0 0B 1 0 0C 1 0 0

Conflict Matrix

A 0 0 0B 0 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Page B

Reference 2

Reference Matrix

A 0 0 0B 1 0 0C 1 0 0

Conflict Matrix

A 0 0 0B 0 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Page B

Reference 2

Reference Matrix

A 0 1 0B 1 0 0C 1 1 0

Conflict Matrix

A 0 0 0B 0 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Reference Matrix

A 0 1 0B 0 0 0C 1 1 0

Conflict Matrix

A 0 0 0B 1 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A 0 1 0B 0 0 0C 1 1 0

Conflict Matrix

A 0 0 0B 1 0 0C 0 0 0

CMPMSI’07 02/11/07

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A 0 1 1B 0 0 1C 1 1 0

Conflict Matrix

A 0 0 0B 1 0 0C 0 0 0

1 10 0

CMPMSI’07 02/11/07

2D Page coloring

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A 0 1 1B 0 0 1C 0 0 0

Conflict Matrix

A 0 0 0B 1 0 0C 1 1 0

Conflict Matrix

A 0 0 0B 1 0 0C 1 1 0

Access Counter

CMPMSI’07 02/11/07

2D Page coloring

Conflict Matrix

A 0 0 0B 1 0 0C 1 1 0

Access Counter

#Conflict(color) #Access

Cost(color, page#) = ( x mem latency) +

x #hop(color) x hop delay)

Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors}

(1-α) x

CMPMSI’07 02/11/07

Experiments setup

Experiments were carried out using simulator derived from SimpleScalar toolset.

The simulator models a 16-core tile-based CMP.

Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB).

Profiling 2D coloringTiming

Simulation

TracePage

mapping

Tuning α

CMPMSI’07 02/11/07

Optimal page mapping

α = 1/64

f pages

α = 1/256

CMPMSI’07 02/11/07

Access distribution

misshit

remotelocal

α 1/32 – 1/2048

CMPMSI’07 02/11/07

Relative performance

privatesharedline coloring2D coloring

CMPMSI’07 02/11/07

Value of α

l og(α ) base 1/ 2

l og(α )

CMPMSI’07 02/11/07

Conclusions

With cautious data placement, there is huge room for performance improvement.

Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform-ance improvement.

This method can also be applied to other optimization target.

CMPMSI’07 02/11/07

Current and future works

Dynamic mapping schemes• Performance• Power

Multiprogrammed and parallel workloads

CMPMSI’07 02/11/07

Thank you & Questions?

CMPMSI’07 02/11/07

Private caching

1. L1 miss2. L2 access

• Hit• Miss

3. Access directory• A copy on chip• Global miss

L1 miss

Local L2 access short hit latency (always local)short hit latency (always local)

high on-chip miss ratehigh on-chip miss rate

long miss resolution timelong miss resolution time

complex coherence enforcementcomplex coherence enforcement

CMPMSI’07 02/11/07

Shared caching

1. L1 miss

2. L2 access• Hit• Miss

L1 miss

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

CMPMSI’07 02/11/07

PerformancePerf

141%150%

ammp art crafty gap gcc gzip mcf mgrid twolf vortex wupwise

line placement

2D coloring

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin...

Documents

Cadeaux Caches

Non-blocking Caches

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

MPC8xx Performance-Driven Optimization of Caches … · MPC8xx Performance-Driven Optimization of Caches and MMU Configuration, ... MPC8xx Performance-Driven Optimization of Caches

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213

10 Caches Detail

Lec15 snoop coherence - University of Cretehy425/2016f/lectures/Lec15_snoop... · 2016. 12. 16. · Proc1 Proc2 Proc4 Caches Caches Caches Single’Bus Memory I/O Proc3 Caches. Multiprocessor’CacheCoherency!

Practical Caches

Caches IV - courses.cs.washington.edu

Lecture 14: Caches

Rami Melhem , Rakan Maddah and Sangyeun cho Computer Science Department

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

Multilevel Memory Caches

6.1 Caches

Advanced Caches

Things Caches Do

Cpu Caches

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213

Caches microP

Column Associative Caches