Programming Many-Core Systems with GRAMPS

Jeremy Sugerman14 May 2010

The single fast core era is over

• Trends:Changing Metrics: ‘scale out’, not just ‘scale up’Increasing diversity: many different mixes of ‘cores’

• Today’s (and tomorrow’s) machines:commodity, heterogeneous, many-core

Problem: How does one program all this complexity?!

High-level programming models

• Two major advantages over threads & locks– Constructs to express/expose parallelism– Scheduling support to help manage concurrency,

communication, and synchronization

• Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

My biases workloads

• Interesting applications have irregularity• Large bundles of coherent work are efficient• Producer-consumer idiom is important

Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

My target audience

• Highly informed, but (good) lazy– Understands the hardware and best practices– Dislikes rote, Prefers power versus constraints

Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

Contributions: Design of GRAMPS

• Programs are graphs of stages and queues

• Queues:– Maximum capacities, Packet sizes

• Stages:– No, limited, or total automatic parallelism– Fixed, variable, or reduction (in-place) outputs

Simple Graphics Pipeline

Contributions: Implementation

• Broad application scope: – Rendering, MapReduce, image processing, …

• Multi-platform applicability: – GRAMPS runtimes for three architectures

• Performance:– Scale-out parallelism, controlled data footprint– Compares well to schedulers from other models

• (Also: Tunable)

Outline

• GRAMPS overview• Study 1: Future graphics architectures• Study 2: Current multi-core CPUs• Comparison with schedulers from other

parallel programming models

GRAMPS Overview

GRAMPS

• Programs are graphs of stages and queues– Expose the program structure– Leave the program internals unconstrained

Writing a GRAMPS program

• Design the application graph and queues:

• Design the stages• Instantiate and launch.

Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html

Cookie Dough Pipeline

Queues

• Bounded size, operate at “packet” granularity– “Opaque” and “Collection” packets

• GRAMPS can optionally preserve ordering– Required for some workloads, adds overhead

Thread (and Fixed) stages

• Preemptible, long-lived, stateful– Often merge, compare, or repack inputs

• Queue operations: Reserve/Commit• (Fixed: Thread stages in custom hardware)

Shader stages:

• Automatically parallelized:– Horde of non-preemptible, stateless instances– Pre-reserve/post-commit

• Push: Variable/conditional output support– GRAMPS coalesces elements into full packets

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough Pipeline

Queue sets: Mutual exclusion

• Independent exclusive (serial) subqueues– Created statically or on first output– Densely or sparsely indexed

• Bonus: Automatically instanced Thread stages

Cookie Dough (with queue set)

A few other tidbits

• Instanced Thread stages

• Queues as barriers /read all-at-once

• In-place Shader stages /coalescing inputs

Formative influences

• The Graphics Pipeline, early GPGPU• “Streaming”• Work-queues and task-queues

Study 1: Future Graphics Architectures

(with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

Graphics is a natural first domain

• Table stakes for commodity parallelism• GPUs are full of heterogeneity• Poised to transition from fixed/configurable

pipeline to programmable• We have a lot of experience in it

The Graphics Pipeline in GRAMPS

• Graph, setup are (application) software– Can be customized or completely replaced

• Like the transition to programmable shading– Not (unthinkably) radical

• Fits current hw: FIFOs, cores, rasterizer, …

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

The Experiment

• Three renderers:– Rasterization, Ray Tracer, Hybrid

• Two simulated future architectures– Simple scheduler for each

Scope: Two(-plus) renderers

Ray Tracing Extension

Rasterization Pipeline (with ray tracing extension)

Ray Tracing Graph

Platforms: Two simulated systems

CPU-Like: 8 Fat Cores, Rast

GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

Performance— Metrics

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism– Parallel utilization

• Priority #2: ‘Reasonable’ bandwidth / storage– Worst case total footprint of all queues– Inherently a trade-off versus utilization

Performance— Scheduling

Simple prototype scheduler (both platforms):• Static stage priorities:

• Only preempt on Reserve and Commit• No dynamic weighting of current queue sizes

(Lowest)

(Highest)

Performance— Results

• Utilization: 95+% for all but rasterized fairy (~80%).• Footprint: < 600KB CPU-like, < 1.5MB GPU-like• Surprised how well the simple scheduler worked• Maintaining order costs footprint

Study 2: Current Multi-core CPUs

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

Reminder: Design goals

• Broad application scope• Multi-platform applicability• Performance: scale-out, footprint-aware

The Experiment

• 9 applications, 13 configurations• One (more) architecture: multi-core x86

– It’s real (no simulation here)– Built with pthreads, locks, and atomics

• Per-pthread task-priority-queues with work-stealing– More advanced scheduling

Scope: Application bonanza

• GRAMPSRay tracer (0, 1 bounce)Spheres(No rasterization, though)

• MapReduceHist (reduce / combine)LR (reduce / combine)PCA

• Cilk(-like)Mergesort

• CUDAGaussian, SRAD

• StreamItFM, TDE

Scope: Many different idioms

Merge SortRay Tracer

MapReduce

Platform: 2xQuad-core Nehalem

• Queues: copy in/out, global (shared) buffer• Threads: user-level scheduled contexts• Shaders: create one task per input packet

Native: 8 HyperThreaded Core i7’s

Performance— Metrics (Reminder)

“Maximize machine utilization while keeping working sets small”

• Priority #1: Scale-out parallelism• Priority #2: ‘Reasonable’ bandwidth / storage

Performance– Scheduling

• Static per-stage priorities (still)• Work-stealing task-priority-queues• Eagerly create one task per packet (naïve)• Keep running stages until a low watermark

– (Limited dynamic weighting of queue depths)

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

llel S

Hardware Threads

Performance– Low Overheads

• ‘App’ and ‘Queue’ time are both useful work.

Execution Time Breakdown (8 cores / 16 hyperthreads)

Comparison with Other Schedulers

(with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez,Richard Yoo; submitted to PACT 2010)

Three archetypes

• Task-Stealing: (Cilk, TBB)Low overhead with fine granularity tasksNo producer-consumer, priorities, or data-parallel

• Breadth-First: (CUDA, OpenCL)Simple scheduler (one stage at the time)No producer-consumer, no pipeline parallelism

• Static: (StreamIt / Streaming)No runtime scheduler; complex schedulesCannot adapt to irregular workloads

GRAMPS is a natural framework

Shader Support

Producer-Consumer

Structured ‘Work’

Adaptive

GRAMPS Task-Stealing

Breadth-First

Static

The Experiment

• Re-use the exact same application code• Modify the scheduler per archetype:

– Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks

– Breadth-First: Unbounded queues, stage at a time, top-to-bottom

– Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

Seeing is believing (ray tracer)G

Comparison: Execution time

• Mostly similar: good parallelism, load balance

Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

• Breadth-first can exhibit load-imbalance

• Task-stealing can ping-pong, cause contention

Comparison: Footprint

• Breadth-First is pathological (as expected)

Relative Packet Footprint (Log-Scale)

Footprint: GRAMPS & Task-StealingRelative Packet Footprint

Relative Task Footprint

GRAMPS gets insight from the graph:• (Application-specified) queue bounds

• Group tasks by stage for priority, preemption

Footprint: GRAMPS & Task-Stealing

MapReduceRay Tracer

Static scheduling is challenging

• Generating good Static schedules is *hard*.• Static schedules are fragile:

– Small mismatches compound– Hardware itself is dynamic (cache traffic, IRQs, …)

• Limited upside: dynamic scheduling is cheap!

Execution Time Packet Footprint

Discussion (for multi-core CPUs)

• Adaptive scheduling is the obvious choice.– Better load-balance / handling of irregularity

• Semantic insight (app graph) gives a big advantage in managing footprint.

• More cores, development maturity → more complex graphs and thus more advantage.

Conclusion

Contributions Revisited

• GRAMPS programming model design– Graph of heterogeneous stages and queues

• Good results from actual implementation – Broad scope: Wide range of applications– Multi-platform: Three different architectures– Performance: High parallelism, good footprint

Anecdotes and intuitions

• Structure helps: an explicit graph is handy.• Simple (principled) dynamic scheduling works.• Queues impedance match heterogeneity.• Graphs with cycles and push both paid off.• (Also: Paired instrumentation and visualization

help enormously)

Conclusion: Future trends revisited

• Core counts are increasing– Parallel programming models

• Memory and bandwidth are precious– Working set, locality (i.e., footprint) management

• Power, performance driving heterogeneity– All ‘cores’ need to communicate, interoperate

GRAMPS fits them well.

Thanks

• Eric, for agreeing to make this happen.• Christos, for throwing helpers at me.• Kurt, Mendel, and Pat, for, well, a lot.

• John Gerth, for tireless computer servitude.• Melissa (and Heather and Ada before her)

Thanks

• My practice audiences• My many collaborators• Daniel, Kayvon, Mike, Tim• Supporters at NVIDIA, ATI/AMD, Intel• Supporters at VMware• Everyone who entertained, informed,

challenged me, and made me think

Thanks

• My funding agencies:– Rambus Stanford Graduate Fellowship– Department of the Army Research– Stanford Pervasive Parallelism Laboratory

• Thank you for listening!• Questions?

Extra Material (Backup)

Data: CPU-Like & GPU-Like

Footprint Data: Native

Tunability

• Diagnosis:– Raw counters, statistics, logs– Grampsviz

• Optimize / Control:– Graph topology (e.g., sort-middle vs. sort-last)– Queue watermarks (e.g., 10x win for ray tracing)– Packet size: Match SIMD widths, share data

Tunability– Grampsviz (1)

• GPU-Like: Rasterization pipeline

Tunability– Grampsviz (2)

• CPU-Like: Histogram (MapReduce)

Reduce Combine

• Graph topology/design:

Tunability– Knobs

Sort-Middle Sort-Last

• Sizing critical queues:

Alternatives

A few other tidbits

• In-place Shader stages /coalescing inputs

Image Histogram Pipeline

• Instanced Thread stages

• Queues as barriers /read all-at-once

Performance– Good Scale-out

• (Footprint: Good; detail a little later)

llel S

Hardware Threads

Seeing is believing (ray tracer)G

• Small ‘Sched’ time, even with large graphs

Programming Many-Core Systems with GRAMPS

Documents

Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe

Network Programming: Clients - Core Web Programming: Course Notes

GRAMPS: A Programming Model for Graphics Pipelines and ...yoel/notes/090305-gramps-davis.pdf · –Optimize each piece (intra-) –Optimize the interaction between pieces (inter-)

Multi-core Programming

Chapter 3: Core Programming Elements

Gramps 9 28 92 Thompson Cr Rd

Porch Talk with Gramps on Parenting - Reciprocal Respect.pdf

Introduction to Core Java Programming

Doing More With GRAMPS - Stanford Universityyoel/notes/091210-gramps-mr-gcafe.pdf4 GRAMPS Review (2) • Shaders: data-parallel, plus push • Threads/Fixed-function: stateful / tasks

Doing More With GRAMPS

Core Animation Programming Guide

GRAMPS: A Programming Model for Graphics Pipelines · 2008-10-15 · GRAMPS: A Programming Model for Graphics Pipelines · TBD in Section 5, we demonstrate implementation scope by

GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan

GRAMPS: A Programming Model for Graphics Pipelinesgraphics.stanford.edu/papers/gramps-tog/gramps-tog08.pdf · graphics over the past three decades has been the co-evolution of a pipeline

Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008

Further Developing GRAMPS - Computer Graphicsyoel/notes/090127-gramps-flashg.pdfREYES is the third major rendering scheme (in addition to OpenGL/Direct3D and ray tracing). During GRAMPS

Dynamic Linear Programming - CORE

Intel Multi Core Programming

Core Python Programming