View
30
Download
0
Category
Tags:
Preview:
DESCRIPTION
Programming Many-Core Systems with GRAMPS. Jeremy Sugerman 14 May 2010. The single fast core era is over. Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core - PowerPoint PPT Presentation
Citation preview
Programming Many-Core Systems with GRAMPS
Jeremy Sugerman14 May 2010
2
The single fast core era is over
• Trends:Changing Metrics: ‘scale out’, not just ‘scale up’Increasing diversity: many different mixes of ‘cores’
• Today’s (and tomorrow’s) machines:commodity, heterogeneous, many-core
Problem: How does one program all this complexity?!
3
High-level programming models
• Two major advantages over threads & locks– Constructs to express/expose parallelism– Scheduling support to help manage concurrency,
communication, and synchronization
• Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …
4
My biases workloads
• Interesting applications have irregularity• Large bundles of coherent work are efficient• Producer-consumer idiom is important
Goal: Rebuild coherence dynamically by aggregating related work as it is generated.
5
My target audience
• Highly informed, but (good) lazy– Understands the hardware and best practices– Dislikes rote, Prefers power versus constraints
Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.
6
Contributions: Design of GRAMPS
• Programs are graphs of stages and queues
• Queues:– Maximum capacities, Packet sizes
• Stages:– No, limited, or total automatic parallelism– Fixed, variable, or reduction (in-place) outputs
Simple Graphics Pipeline
7
Contributions: Implementation
• Broad application scope: – Rendering, MapReduce, image processing, …
• Multi-platform applicability: – GRAMPS runtimes for three architectures
• Performance:– Scale-out parallelism, controlled data footprint– Compares well to schedulers from other models
• (Also: Tunable)
8
Outline
• GRAMPS overview• Study 1: Future graphics architectures• Study 2: Current multi-core CPUs• Comparison with schedulers from other
parallel programming models
GRAMPS Overview
10
GRAMPS
• Programs are graphs of stages and queues– Expose the program structure– Leave the program internals unconstrained
11
Writing a GRAMPS program
• Design the application graph and queues:
• Design the stages• Instantiate and launch.
Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html
Cookie Dough Pipeline
12
Queues
• Bounded size, operate at “packet” granularity– “Opaque” and “Collection” packets
• GRAMPS can optionally preserve ordering– Required for some workloads, adds overhead
13
Thread (and Fixed) stages
• Preemptible, long-lived, stateful– Often merge, compare, or repack inputs
• Queue operations: Reserve/Commit• (Fixed: Thread stages in custom hardware)
14
Shader stages:
• Automatically parallelized:– Horde of non-preemptible, stateless instances– Pre-reserve/post-commit
• Push: Variable/conditional output support– GRAMPS coalesces elements into full packets
15
Queue sets: Mutual exclusion
• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed
• Bonus: Automatically instanced Thread stages
Cookie Dough Pipeline
16
Queue sets: Mutual exclusion
• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed
• Bonus: Automatically instanced Thread stages
Cookie Dough (with queue set)
17
A few other tidbits
• Instanced Thread stages
• Queues as barriers /read all-at-once
• In-place Shader stages /coalescing inputs
18
Formative influences
• The Graphics Pipeline, early GPGPU• “Streaming”• Work-queues and task-queues
Study 1: Future Graphics Architectures
(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)
20
Graphics is a natural first domain
• Table stakes for commodity parallelism• GPUs are full of heterogeneity• Poised to transition from fixed/configurable
pipeline to programmable• We have a lot of experience in it
21
The Graphics Pipeline in GRAMPS
• Graph, setup are (application) software– Can be customized or completely replaced
• Like the transition to programmable shading– Not (unthinkably) radical
• Fits current hw: FIFOs, cores, rasterizer, …
22
Reminder: Design goals
• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware
23
The Experiment
• Three renderers:– Rasterization, Ray Tracer, Hybrid
• Two simulated future architectures– Simple scheduler for each
24
Scope: Two(-plus) renderers
Ray Tracing Extension
Rasterization Pipeline (with ray tracing extension)
Ray Tracing Graph
25
Platforms: Two simulated systems
CPU-Like: 8 Fat Cores, Rast
GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched
26
Performance— Metrics
“Maximize machine utilization while keeping working sets small”
• Priority #1: Scale-out parallelism– Parallel utilization
• Priority #2: ‘Reasonable’ bandwidth / storage– Worst case total footprint of all queues– Inherently a trade-off versus utilization
27
Performance— Scheduling
Simple prototype scheduler (both platforms):• Static stage priorities:
• Only preempt on Reserve and Commit• No dynamic weighting of current queue sizes
(Lowest)
(Highest)
28
Performance— Results
• Utilization: 95+% for all but rasterized fairy (~80%).• Footprint: < 600KB CPU-like, < 1.5MB GPU-like• Surprised how well the simple scheduler worked• Maintaining order costs footprint
Study 2: Current Multi-core CPUs
(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)
30
Reminder: Design goals
• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware
31
The Experiment
• 9 applications, 13 configurations• One (more) architecture: multi-core x86
– It’s real (no simulation here)– Built with pthreads, locks, and atomics
• Per-pthread task-priority-queues with work-stealing– More advanced scheduling
32
Scope: Application bonanza
• GRAMPSRay tracer (0, 1 bounce)Spheres(No rasterization, though)
• MapReduceHist (reduce / combine)LR (reduce / combine)PCA
• Cilk(-like)Mergesort
• CUDAGaussian, SRAD
• StreamItFM, TDE
33
Scope: Many different idioms
FM
Merge SortRay Tracer
SRAD
MapReduce
34
Platform: 2xQuad-core Nehalem
• Queues: copy in/out, global (shared) buffer• Threads: user-level scheduled contexts• Shaders: create one task per input packet
Native: 8 HyperThreaded Core i7’s
35
Performance— Metrics (Reminder)
“Maximize machine utilization while keeping working sets small”
• Priority #1: Scale-out parallelism• Priority #2: ‘Reasonable’ bandwidth / storage
36
Performance– Scheduling
• Static per-stage priorities (still)• Work-stealing task-priority-queues• Eagerly create one task per packet (naïve)• Keep running stages until a low watermark
– (Limited dynamic weighting of queue depths)
37
Performance– Good Scale-out
• (Footprint: Good; detail a little later)
Para
llel S
peed
up
Hardware Threads
38
Performance– Low Overheads
• ‘App’ and ‘Queue’ time are both useful work.
Perc
enta
ge o
f Exe
cutio
n
Execution Time Breakdown (8 cores / 16 hyperthreads)
Comparison with Other Schedulers
(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)
40
Three archetypes
• Task-Stealing: (Cilk, TBB)Low overhead with fine granularity tasksNo producer-consumer, priorities, or data-parallel
• Breadth-First: (CUDA, OpenCL)Simple scheduler (one stage at the time)No producer-consumer, no pipeline parallelism
• Static: (StreamIt / Streaming)No runtime scheduler; complex schedulesCannot adapt to irregular workloads
41
GRAMPS is a natural framework
Shader Support
Producer-Consumer
Structured ‘Work’
Adaptive
GRAMPS Task-Stealing
Breadth-First
Static
42
The Experiment
• Re-use the exact same application code• Modify the scheduler per archetype:
– Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks
– Breadth-First: Unbounded queues, stage at a time, top-to-bottom
– Static: Unbounded queues, offline per-thread schedule using SAS / SGMS
43
Seeing is believing (ray tracer)G
RA
MP
S
Bre
adth
-Firs
tS
tatic
(S
AS
)
Tas
k-S
teal
ing
44
Comparison: Execution time
• Mostly similar: good parallelism, load balance
Perc
enta
ge o
f Tim
e
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
45
Perc
enta
ge o
f Tim
e
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
Comparison: Execution time
• Breadth-first can exhibit load-imbalance
46
Perc
enta
ge o
f Tim
e
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
Comparison: Execution time
• Task-stealing can ping-pong, cause contention
Perc
enta
ge o
f Tim
e
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
47
Comparison: Footprint
• Breadth-First is pathological (as expected)
Relative Packet Footprint (Log-Scale)
Size
ver
sus
GRA
MPS
48
Footprint: GRAMPS & Task-StealingRelative Packet Footprint
Relative Task Footprint
49
GRAMPS gets insight from the graph:• (Application-specified) queue bounds
• Group tasks by stage for priority, preemption
Footprint: GRAMPS & Task-Stealing
MapReduceRay Tracer
MapReduceRay Tracer
50
Static scheduling is challenging
• Generating good Static schedules is *hard*.• Static schedules are fragile:
– Small mismatches compound– Hardware itself is dynamic (cache traffic, IRQs, …)
• Limited upside: dynamic scheduling is cheap!
Execution Time Packet Footprint
51
Discussion (for multi-core CPUs)
• Adaptive scheduling is the obvious choice.– Better load-balance / handling of irregularity
• Semantic insight (app graph) gives a big advantage in managing footprint.
• More cores, development maturity → more complex graphs and thus more advantage.
Conclusion
53
Contributions Revisited
• GRAMPS programming model design– Graph of heterogeneous stages and queues
• Good results from actual implementation – Broad scope: Wide range of applications– Multi-platform: Three different architectures– Performance: High parallelism, good footprint
54
Anecdotes and intuitions
• Structure helps: an explicit graph is handy.• Simple (principled) dynamic scheduling works.• Queues impedance match heterogeneity.• Graphs with cycles and push both paid off.• (Also: Paired instrumentation and visualization
help enormously)
55
Conclusion: Future trends revisited
• Core counts are increasing– Parallel programming models
• Memory and bandwidth are precious– Working set, locality (i.e., footprint) management
• Power, performance driving heterogeneity– All ‘cores’ need to communicate, interoperate
GRAMPS fits them well.
56
Thanks
• Eric, for agreeing to make this happen.• Christos, for throwing helpers at me.• Kurt, Mendel, and Pat, for, well, a lot.
• John Gerth, for tireless computer servitude.• Melissa (and Heather and Ada before her)
57
Thanks
• My practice audiences• My many collaborators• Daniel, Kayvon, Mike, Tim• Supporters at NVIDIA, ATI/AMD, Intel• Supporters at VMware• Everyone who entertained, informed,
challenged me, and made me think
58
Thanks
• My funding agencies:– Rambus Stanford Graduate Fellowship– Department of the Army Research– Stanford Pervasive Parallelism Laboratory
59
Q&A
• Thank you for listening!• Questions?
Extra Material (Backup)
61
Data: CPU-Like & GPU-Like
62
Footprint Data: Native
63
Tunability
• Diagnosis:– Raw counters, statistics, logs– Grampsviz
• Optimize / Control:– Graph topology (e.g., sort-middle vs. sort-last)– Queue watermarks (e.g., 10x win for ray tracing)– Packet size: Match SIMD widths, share data
64
Tunability– Grampsviz (1)
• GPU-Like: Rasterization pipeline
65
Tunability– Grampsviz (2)
• CPU-Like: Histogram (MapReduce)
Reduce Combine
66
• Graph topology/design:
Tunability– Knobs
Sort-Middle Sort-Last
• Sizing critical queues:
Alternatives
69
A few other tidbits
• In-place Shader stages /coalescing inputs
Image Histogram Pipeline
• Instanced Thread stages
• Queues as barriers /read all-at-once
70
Performance– Good Scale-out
• (Footprint: Good; detail a little later)
Para
llel S
peed
up
Hardware Threads
71
Seeing is believing (ray tracer)G
RA
MP
S
Sta
tic (
SA
S)
Tas
k-S
teal
ing
Bre
adth
-Firs
t
72
Perc
enta
ge o
f Tim
e
Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)
Comparison: Execution time
• Small ‘Sched’ time, even with large graphs
Recommended