Cache Coherence for GPU Architectures

Cache Coherence for GPU Architectures

Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1

Image source: www.forces.gc.ca

1 University of British Columbia2 Simon Fraser University

3 AMD Research

Inderpreet Singh Cache Coherence for GPU Architectures 2

What is a GPU?

GPU

CPUspawn

doneCPU

CPU

GPU

spawn

time

GPU Core

L1D ▪▪▪

Interconnect

▪▪▪

L2 Bank

GPU Core

L1D

WorkgroupsWavefronts


Evolution of GPUs

• Graphics pipeline

• Compute (OpenCL, CUDA)• e.g. Matrix Multiplication

VertexShader

PixelShaderOpenGL/

DirectX


Evolution of GPUs

• Future: coherent memory space• Efficient critical sections• Load balancing

Stencil computation

Workgroups

lock shared structure…computation…

unlock


C4

L1DA B

C3

L1DA B

C2

L1DA B

GPU Coherence Challenges

• Challenge 1: Coherence traffic

Do not requirecoherence

No coherence MESI

GPU-VI

0.5

1.0

1.5

2.2

Inte

rcon

nect

traf

fic 1.3 RecallsC1

L1DA B

Load C

gets C

rcl A rcl A rcl A

rcl Aack

ack ackack

Load CLoad DLoad ELoad F…

Load GLoad HLoad ILoad J…

Load KLoad LLoad MLoad N…

Load OLoad PLoad QLoad R…

A BL2/Directory


L2 / Directory

MSHR


• Challenge 2: Tracking in-flight requests• Significant % of L2

SShared

MModified

S_M


GPU Coherence Challenges• Challenge 3: Complexity

Non-coherent L1

Non-coherent L2

MESI L1 States

MESI L2 States

States

Events



All three challenges result from introducing coherence messages on a GPU

1. Traffic: transferring2. Storage: tracking3. Complexity: managing

GPU cache coherence without coherence messages?

• YES – using global time


Core 1

L1D ▪▪▪

Temporal Coherence (TC)

• Global time

Interconnect

▪▪▪

L2 Bank

A=00

A=00

Global Timestamp

< Global Time NO L1

COPIES

Core 2

L1D

Local Timestamp

> Global Time VALID


T=0T=11T=15

Core 1

L1D

Interconnect

L2 Bank

Core 2

L1D


▪▪▪A=00

Load A

T=10

A=010 A=010

A=010

Stor

e A=

1

A=1

A=010No coherence messages



What lifetime values should be requested on loads?

• Use a predictor to predict lifetime values

What about stores to unexpired blocks?

• Stall them at the L2?


TC Stalling Issues

Stall?

Problem #1: Sensitive to mispredictionsProblem #2: Impedes other accessesProblem #3: Hurts existing GPU applications

Solution: TC-Weak


L2 Bank

47

T=1T=31

TC-Weak

• Stores return Global Write Completion Time (GWCT)

GPU Core 2

L1D

Interconnect

GWCT Table W0: W1:

data=OLD30

30 data=OLDflag=NULL

GPU Core 1

L1DGWCT Table

W0: W1:

1 data=NEW2 FENCE3 flag=SET

Store

data=NEWStore

flag=SET


30



data=NEWflag=SET

data=OLD30

T=0

47

No stalling at L2


TC-Weak

Stalling TC-Weak

Misprediction sensitivity

Doesn’t impedes other accesses

Good for existing GPU applications


Methodology

• GPGPU-Sim v3.1.2 for GPU core model• GEMS Ruby v2.1.1 for memory system• All protocols written in SLICC• Model a generic NVIDIA Fermi-based GPU (see paper for details)• Applications:

• 6 do not require coherence• 6 require coherence

• Barnes Hut• Cloth Physics• Versatile Place and Route• Max-Flow Min-Cut• 3D Wave Equation Solver• Octree Partitioning

Locks

Stencil communication

Load balancing


0.00

0.25

0.50

0.75

1.00

1.25

1.50 2.3

Interconnect Traffic

• Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications

• Lower traffic than 16x-sized 32-way directory

Inte

rcon

nect

Tra

ffic

NO-COHMESI GPU-VI TC-Weak

Do not require coherence


Performance

• TC-Weak with simple predictor performs 85% better than disabling L1 caches

• Performs 28% better than TC with stalling

• Larger directory sizes do not improve performance

MESI GPU-VI TC-Weak

0.0

0.5

1.0

1.5

2.0

Require coherence

NO-L1

Spee

dup


ComplexityNon-Coherent L1

Non-Coherent L2

MESI L1 States

MESI L2 StatesTC-Weak L1

TC-Weak L2


Summary

• First work to characterize GPU coherence challenges

• Save traffic and energy by using global time

• Reduce protocol complexity

• 85% performance improvement over no coherence

Questions?


Backup Slides


Lifetime Predictor

• One prediction value per L2 bank

• Events local to L2 bank update prediction value

L2 BankT = 0

Prediction Value

Load A

A10

Events Prediction

1. Expired load: ↑

2. Unexpired store: ↓

3. Unexpired eviction: ↓prediction++

T = 20

Store A

A30prediction--


TC-Strong vs TC-Weak

Fixed lifetime for all applications

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

0.6

0.8

1.0

1.2

All applicationsSp

eedu

p

TCSUO TCS TCSOO

TCW TCW w/ predictor

Best lifetime for each application


Interconnect Power and Energy

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

0.0

0.4

0.8

1.2

1.6

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

Nor

mal

ized

Ene

rgy

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

WInter-

workgroupIntra-

workgroup

0.0

0.4

0.8

1.2

1.6

Nor

mal

ized

Pow

er

Documents

Cache Coherence for GPU Architectures