View
34
Download
0
Category
Tags:
Preview:
DESCRIPTION
Dennis Abts Google . Natalie Enright Jerger University of Toronto. John Kim KAIST. Diamonds are a Memory Controller’s Best Friend*. Dan Gibson Univ of Wisconsin. Mikko Lipasti Univ of Wisconsin. - PowerPoint PPT Presentation
Citation preview
Diamonds are a Memory Controller’s Best Friend*
*Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.
Dennis AbtsGoogle
Natalie Enright JergerUniversity of Toronto
John KimKAIST
Dan GibsonUniv of Wisconsin
Mikko Lipasti
Univ of Wisconsin
Executive Summary ®• On what tiles should memory controllers reside?
– Three-tiered simulation approach• Heuristic-guided search• Detailed network simulation• Full-system simulation
• Diamond MC placement works well for on-chip meshes and tori– Diamonds minimize maximum channel load– Diamonds deliver lower and more predictable
runtimes
Background• Diverse on-chip communication
– Cache-to-cache– LD/ST to Memory– Off-chip traffic (e.g., I/O)
• Processors/chip on the rise– Pins available for memory not rising as fast: Memory
bandwidth becomes more precious– Reality: Many Cores, Few Memory Controllers
• Tiled architectures gaining popularity– Commonly employ on-chip meshes or tori
The Problem• What Memory Controller placement is best
overall?– Flip-chip packaging allows flexible escape routes– n tiles and m ports:
• Don’t worry, there are only configurations!
– What are the characteristics of the best configuration?
• Performance: Low runtime for a set of objective workloads• Throughput: Low latency as a function of offered load• Fairness: Similar (low) average memory latency across all
nodes.• Predictability: Low latency and runtime variance
nm
Slight Simplification: Assume n = k2 and m =
2k
Baseline Placement: row0_7• Ports to MCs located at
top and bottom of chip• Conceptually similar to
real parts:– Tilera’s Tile64
• 64 cores, 4 MCs (4 ports each, top/bottom of chip)
– Intel TeraFLOPs• 80 cores, 2 MCs (8 ports
each, top/bottom of chip)
X-Dimension Traffic Encounters
Congestion on Rows with Memory Controllers
Three-Tiered Approach
Link Contention Simulation
Detailed Network Simulation
Full SystemM
ore Runs
Shorter Runtim
es
More
Detail
Tier 0.5: Exhaustive Search
• It turns out is tractable for k<7
– (At least on the link contention simulator – only 3,268,760 possibilities for k=5)
kk2
2
Patterns Emerge!
Another Contender
Tier 1: Heuristic-Guided Search• k>6: Intractable to search all configurations
– Use search heuristics and random search• Genetic Algorithm:
– Represent designs as a population of strings (Bit Vectors)
– Generate new designs by combining members of the population via genetic crossover (Bit Selection)
– Occasionally, mutate new population members (Swap adjacent bits)
– Reduce population size by removing least-fit members – Survival of the Fittest
Genetic MC Placement0x00AA550000AA55000x0000FF0000FF0000
0x00AAF00000F25100
0x00AAF00000F25080
Mutate
Link Contention Results k=8
Config.Max Channel Load
Mesh Torus
row0_7 13.5 9.25
X 8.93 7.72
Diamond 8.90 7.72
• GA Selected Diamond as most fit solution for 8x8– Minimizes MCs in a single
row/column– Spreads DOR load Sanity Check: GA also
prefers Diamond for 4x4, 5x5, and 6x6
Network Simulation: Open-Loop Evaluation• Detailed simulation of all network events
(buffers, links, etc.)• Cores are Bernoulli injection processes, uniform
random traffic• Measure latency vs. offered load
Parameters Values
Router latency 1 cycle (aggressive)
Inter-router Delay 1 cycle
Buffers 32-flit sized per port
Packet size Request: 1 flitReply: 4 flit
Virtual Channels 4 (XY-YX routing)
Open-Loop Results
0
5
10
15
20
25
0 0.2 0.4 0.6 0.8 1
Offered load (flits/cycle)
Late
ncy
(cyc
les) row0_7
row2_5DiamondX
Closed-Loop Evaluation
• Each processor executes N memory operations• Up to r operations outstanding at a time
– Models MSHRs• Uniform Random requests, and real request
streams with ‘hot spot’ behavior
Closed-Loop Results
0
4
8
12
16
20
3500 4000 4500 5000 5500 6000 6500
Completion Time
Num
ber o
f Pro
cess
ors
8000 8500 9000 9500 10000 10500 11000
Diamond row0_7
Full System Results
14.5
15
15.5
16
16.5
17
17.5
0 0.2 0.4 0.6 0.8 1 1.2
R ow0_7
Diamond
Standard Deviation
Ave
rage
Net
wor
k La
tenc
y (c
ycle
s)
for R
eque
st to
Mem
ory
Con
trolle
r
JBBWEB
TPC-WTPC-W+H
TPC-H
TPC-W+H
TPC-WTPC-H
WEBJBB
Diamond placement yields lower latency and
lower latency variance.
Conclusion• MC Placement Matters!
– Diamond reduces contention, improves latency, and reduces latency/runtime variance
– X does fairly well
Recommended