Upload
laird
View
60
Download
1
Embed Size (px)
DESCRIPTION
Cache Coherence for GPU Architectures. Inderpreet Singh 1 , Arrvindh Shriraman 2 , Wilson Fung 1 , Mike O’Connor 3 , Tor Aamodt 1. 1 University of British Columbia 2 Simon Fraser University 3 AMD Research. Image source: www.forces.gc.ca. What is a GPU?. Workgroups. CPU. Wavefronts. - PowerPoint PPT Presentation
Citation preview
Cache Coherence for GPU Architectures
Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1
Image source: www.forces.gc.ca
1 University of British Columbia2 Simon Fraser University
3 AMD Research
Inderpreet Singh Cache Coherence for GPU Architectures 2
What is a GPU?
GPU
CPUspawn
doneCPU
CPU
GPU
spawn
time
GPU Core
L1D ▪▪▪
Interconnect
▪▪▪
L2 Bank
GPU Core
L1D
WorkgroupsWavefronts
Inderpreet Singh Cache Coherence for GPU Architectures 3
Evolution of GPUs
• Graphics pipeline
• Compute (OpenCL, CUDA)• e.g. Matrix Multiplication
VertexShader
PixelShaderOpenGL/
DirectX
Inderpreet Singh Cache Coherence for GPU Architectures 4
Evolution of GPUs
• Future: coherent memory space• Efficient critical sections• Load balancing
Stencil computation
Workgroups
lock shared structure…computation…
unlock
Inderpreet Singh Cache Coherence for GPU Architectures 5
C4
L1DA B
C3
L1DA B
C2
L1DA B
GPU Coherence Challenges
• Challenge 1: Coherence traffic
Do not requirecoherence
No coherence MESI
GPU-VI
0.5
1.0
1.5
2.2
Inte
rcon
nect
traf
fic 1.3 RecallsC1
L1DA B
Load C
gets C
rcl A rcl A rcl A
rcl Aack
ack ackack
Load CLoad DLoad ELoad F…
Load GLoad HLoad ILoad J…
Load KLoad LLoad MLoad N…
Load OLoad PLoad QLoad R…
A BL2/Directory
Inderpreet Singh Cache Coherence for GPU Architectures 6
L2 / Directory
MSHR
GPU Coherence Challenges
• Challenge 2: Tracking in-flight requests• Significant % of L2
SShared
MModified
S_M
Inderpreet Singh Cache Coherence for GPU Architectures 7
GPU Coherence Challenges• Challenge 3: Complexity
Non-coherent L1
Non-coherent L2
MESI L1 States
MESI L2 States
States
Events
Inderpreet Singh Cache Coherence for GPU Architectures 8
GPU Coherence Challenges
All three challenges result from introducing coherence messages on a GPU
1. Traffic: transferring2. Storage: tracking3. Complexity: managing
GPU cache coherence without coherence messages?
• YES – using global time
Inderpreet Singh Cache Coherence for GPU Architectures 9
Core 1
L1D ▪▪▪
Temporal Coherence (TC)
• Global time
Interconnect
▪▪▪
L2 Bank
A=00
A=00
Global Timestamp
< Global Time NO L1
COPIES
Core 2
L1D
Local Timestamp
> Global Time VALID
Inderpreet Singh Cache Coherence for GPU Architectures 10
T=0T=11T=15
Core 1
L1D
Interconnect
L2 Bank
Core 2
L1D
Temporal Coherence (TC)
▪▪▪A=00
Load A
T=10
A=010 A=010
A=010
Stor
e A=
1
A=1
A=010No coherence messages
Inderpreet Singh Cache Coherence for GPU Architectures 11
Temporal Coherence (TC)
What lifetime values should be requested on loads?
• Use a predictor to predict lifetime values
What about stores to unexpired blocks?
• Stall them at the L2?
Inderpreet Singh Cache Coherence for GPU Architectures 12
TC Stalling Issues
Stall?
Problem #1: Sensitive to mispredictionsProblem #2: Impedes other accessesProblem #3: Hurts existing GPU applications
Solution: TC-Weak
Inderpreet Singh Cache Coherence for GPU Architectures 13
L2 Bank
47
T=1T=31
TC-Weak
• Stores return Global Write Completion Time (GWCT)
GPU Core 2
L1D
Interconnect
GWCT Table W0: W1:
data=OLD30
30 data=OLDflag=NULL
GPU Core 1
L1DGWCT Table
W0: W1:
1 data=NEW2 FENCE3 flag=SET
Store
data=NEWStore
flag=SET
1 data=NEW2 FENCE3 flag=SET
30
1 data=NEW2 FENCE3 flag=SET
1 data=NEW2 FENCE3 flag=SET
data=NEWflag=SET
data=OLD30
T=0
47
No stalling at L2
Inderpreet Singh Cache Coherence for GPU Architectures 14
TC-Weak
Stalling TC-Weak
Misprediction sensitivity
Doesn’t impedes other accesses
Good for existing GPU applications
Inderpreet Singh Cache Coherence for GPU Architectures 15
Methodology
• GPGPU-Sim v3.1.2 for GPU core model• GEMS Ruby v2.1.1 for memory system• All protocols written in SLICC• Model a generic NVIDIA Fermi-based GPU (see paper for details)• Applications:
• 6 do not require coherence• 6 require coherence
• Barnes Hut• Cloth Physics• Versatile Place and Route• Max-Flow Min-Cut• 3D Wave Equation Solver• Octree Partitioning
Locks
Stencil communication
Load balancing
Inderpreet Singh Cache Coherence for GPU Architectures 16
0.00
0.25
0.50
0.75
1.00
1.25
1.50 2.3
Interconnect Traffic
• Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications
• Lower traffic than 16x-sized 32-way directory
Inte
rcon
nect
Tra
ffic
NO-COHMESI GPU-VI TC-Weak
Do not require coherence
Inderpreet Singh Cache Coherence for GPU Architectures 17
Performance
• TC-Weak with simple predictor performs 85% better than disabling L1 caches
• Performs 28% better than TC with stalling
• Larger directory sizes do not improve performance
MESI GPU-VI TC-Weak
0.0
0.5
1.0
1.5
2.0
Require coherence
NO-L1
Spee
dup
Inderpreet Singh Cache Coherence for GPU Architectures 18
ComplexityNon-Coherent L1
Non-Coherent L2
MESI L1 States
MESI L2 StatesTC-Weak L1
TC-Weak L2
Inderpreet Singh Cache Coherence for GPU Architectures 19
Summary
• First work to characterize GPU coherence challenges
• Save traffic and energy by using global time
• Reduce protocol complexity
• 85% performance improvement over no coherence
Questions?
Inderpreet Singh Cache Coherence for GPU Architectures 20
Backup Slides
Inderpreet Singh Cache Coherence for GPU Architectures 21
Lifetime Predictor
• One prediction value per L2 bank
• Events local to L2 bank update prediction value
L2 BankT = 0
Prediction Value
Load A
A10
Events Prediction
1. Expired load: ↑
2. Unexpired store: ↓
3. Unexpired eviction: ↓prediction++
T = 20
Store A
A30prediction--
Inderpreet Singh Cache Coherence for GPU Architectures 22
TC-Strong vs TC-Weak
Fixed lifetime for all applications
0.6
0.8
1.0
1.2
1.4
All applications
Spee
dup
0.6
0.8
1.0
1.2
All applicationsSp
eedu
p
TCSUO TCS TCSOO
TCW TCW w/ predictor
Best lifetime for each application
Inderpreet Singh Cache Coherence for GPU Architectures 23
Interconnect Power and Energy
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
W
Inter-workgroup
Intra-workgroup
0.0
0.4
0.8
1.2
1.6
Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)
Nor
mal
ized
Ene
rgy
NO
-L1
MES
IG
PU-V
IG
PU-V
ini
TCW
NO
-CO
HM
ESI
GPU
-VI
GPU
-Vin
iTC
WInter-
workgroupIntra-
workgroup
0.0
0.4
0.8
1.2
1.6
Nor
mal
ized
Pow
er