EPFL – August 26, 2009 - Scala › old › sites › default › files › pdfs › PPL_EPFL-2009.pdfFinding independent tasks 2. Mapping tasks to execution units 3. Implementing

Kunle Olukotun

EPFL – August 26, 2009

  The end of uniprocessors   Embedded, server, desktop

  New “Moore‟s Law is 2X cores per chip every technology generation   Cores are the new GHz!   No other viable options for increasing performance

  Major dislocation for SW industry since now need parallel programming   $200 B software industry

  Parallelism is not for the average programmer   Too difficult to find parallelism, to debug, maintain and get

good performance for the masses   Need a solution for “Joe the programmer”

  Goal: the parallel computing platform for 2015   Parallel application development practical for the masses

  Joe the programmer…   Parallel applications without parallel programming

  PPL is a collaboration of   Leading Stanford researchers across multiple domains

  Applications, languages, software systems, architecture   Leading companies in computer systems and software

  Sun, AMD, Nvidia, IBM, Intel, NEC, HP

  PPL is open   Any company can join; all results in the public domain

  Applications   Ron Fedkiw, Vladlen Koltun, Sebastian Thrun

  Programming & software systems   Alex Aiken, Pat Hanrahan, John Ousterhout,

Mendel Rosenblum

  Architecture   Bill Dally, John Hennessy, Mark Horowitz,

Christos Kozyrakis, Kunle Olukotun (director)

1.  Finding independent tasks 2.  Mapping tasks to execution units 3.  Implementing synchronization

  Races, livelocks, deadlocks, …

4.  Composing parallel tasks 5.  Recovering from HW & SW errors 6.  Optimizing locality and communication 7.  Predictable performance & scalability 8.  … and all the sequential programming issues

  Even with new tools, can Joe handle these issues?

 Guiding observations   Must hide low-level issues from most

programmers   No single discipline can solve all problems   Top-down research driven by applications

 Core techniques   Domain specific languages (DSLs)

  Simple & portable programs

  Heterogeneous hardware   Energy and area efficient computing

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Programmable Hierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Virtual Worlds

Personal Robotics

Data informatics

Scientific Engineering

Physics Scripting Probabilistic Machine Learning Rendering

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Applications

Domain Specific

Languages

Heterogeneous Hardware

DSL Infrastructure





Scalable Coherence



Virtual Worlds

Personal Robotics

Data informatics





Applications

Domain Specific

Languages


DSL Infrastructure

  Leverage domain expertise at Stanford   National centers for scientific computing & CS research groups

PPL

Media-X

NIH NCBC

PSAAP

Environmental Science

Geophysics

Web, Mining Streaming DB

Graphics Games

Mobile HCI

Existing Stanford research center

Seismic modeling

Existing Stanford CS research groups

AI/ML Robotics

  Next-generation web platform   Millions of players in vast landscapes   Immersive collaboration   Social gaming

  Computing challenges   Client-side game engine

  Graphics rendering   Server-side world simulation

  Object scripting, geometric queries, AI, physics computation

  Dynamic content, huge datasets

  More at http://vw.stanford.edu/





Scalable Coherence



Virtual Worlds

Personal Robotics

Data informatics





Applications

Domain Specific

Languages


DSL Infrastructure

  High level language targeted/restricted to specific domain   E.g.: SQL, Matlab, OpenGL, Ruby/Rails, …   Usually declarative and simpler than GP languages

  DSLs ⇒ higher productivity for users   DSL user   High-level data types & ops (e.g. relations, triangles, …)   Express high-level intent w/o implementation artifacts

  DSLs ⇒ scalable parallelism for the system   DSL developer   Declarative description of parallelism & locality patterns

  Can be ported or scaled to available machine   Allows for domain specific optimization

  Automatically adjust structures, mapping, and scheduling

  Currently code is in C based on ten years of work   Limitations of current code

  Optimized for Cluster/MPI but cannot run on other architectures   20% scientific code, 80% plumbing code   Minor changes in application level code can require major

architectural changes in framework   Large learning curve for incoming students who are not parallel-

computing aware

Fuel injection Transition Thermal

Turbulence

Turbulence

Combustion

  Goal: simplify code of mesh-based PDE solvers   Write once, run on any type of parallel machine   From multi-cores and GPUs to clusters

  Language features   Built-in mesh data types

  Vertex, edge, face, cell   Collections of mesh elements

  cell.faces(), face.edgesCCW()   Mesh-based data storage

  Fields, sparse matrices

val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex]

for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …

val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex]

for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …

High-level data types & operations

Explicit parallelism using map/reduce/forall Implicit parallelism with help from DSL & HW

No low-level code to manage parallelism

  Liszt compiler & runtime manage parallel execution   Data layout & access, domain decomposition, communication, …

  Domain specific optimizations   Select mesh layout (grid, tetrahedral, unstructured, custom, …)   Select decomposition that improves locality of access   Optimize communication strategy across iterations

  Optimizations are possible because   Mesh semantics are visible to compiler & runtime   Iterative programs with data accesses based on mesh topology   Mesh topology is known to runtime





Scalable Coherence



Virtual Worlds

Personal Robotics

Data informatics





Applications

Domain Specific

Languages


DSL Infrastructure

  Provide a shared framework for DSL development   Simplify DSL development

  Features   Common parallel language that retains DSL semantics

  Mechanism to express domain specific optimizations

  Static compilation + dynamic management environment   For regular and unpredictable patterns respectively

  Synthesize HW features into high-level solutions   E.g. from HW messaging to fast runtime for fine-grain tasks

  Exploit heterogeneous hardware to improve efficiency

  Required features   Support for functional programming (FP)

  Declarative programming style for portable parallelism   High-order functions allow parallel control structures

  Support for object-oriented programming (OOP)   Familiar model for complex programs   Allows mutable data-structures and domain-specific attributes

  Managed execution environment   For runtime optimizations & automated memory management

  Our approach: embed DSLs in the Scala language   Supports both FP and OOP features   Supports embedding of higher-level abstractions   Compiles to Java bytecode

  Domain Extracted Locality Informed Task Execution

  An infrastructure for building implicitly parallel DSLs

  Enables task-based and data-level parallelism

  Allows for dataflow analysis and optimizations

  Allows for domain specific optimizations

  Targets heterogeneous compute environments

Calls Matrix DSL methods

Delite applies generic & domain transformations and generates mapping

DSL defers OP execution to Delite

OP_add

// x is a matrix

// y is vector

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128

Sp

eed

up

Execution Cores

Gaussian Discriminant Analysis

Original + Domain Optimizations + Data Parallelism

Low speedup due to loop dependencies


Domain info used to refactor dependencies

Domain info used to fuse operations

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128

Sp

eed

up

Execution Cores



Domain info used to refactor dependencies

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128

Sp

eed

up

Execution Cores



Exploiting data parallelism within tasks

Dealing with Mutable State   Histogram example

  Delite currently serializes operations on mutable state

  Can use optimistic concurrency techniques to allow operation to proceed in parallel   Especially useful for collection types

for(word<‐words){wordFrequencies+=word

}

Optimistic Scalable Concurrent Collections   TM is a natural way to interact with concurrent

collections   Many ad-hoc txns are already provided: putIfAbsent(check, put), replace(delete, put), …

  Conflict detection on collections should be semantic   Fewer conflicts ⇒ better scalability

  Transactional collection classes (Carlstrom, PPoPP’07)   Data structure doesn’t have to be implemented with TM

  Transactional boosting (Herlihy, PPoPP’08)

  STMs that manage all reads and writes are not yet ready for widespread adoption   Simple semantics, scalability, performance

  Pick 1½   Give up strong isolation for performance

The Key Idea

  Library-only STM that operates on explicitly transactional types (TVar)   Like Haskell, Clojure, and djspiewak’s Scala STM   Extra hooks for semantic conflict detection:

unrecorded reads, callbacks

  “Pre-boosted” transactional collections   Internals based on concurrent collection classes   TMap, TSet, TSortedMap, TVector, …   Per-transaction views that implement Map, Set, …   Non-transactional view provides atomicity and

isolation for single operations

TMap Example: Interface traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}

TMap.get requires a txn, enforced by compiler


TMap.bind returns view bound to a txn


TMap.nonTxn returns a non transactional view single operations are still atomic

TMap Example: Usage traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}

deffoo(m:Map[String,Int],k:String)=m.get(k)defbar(m:TMap[String,Int]){for(implicittxn<‐atomic){valx=m.get(“directlookup”)valy=foo(m.bind,“lookupviatxnMapview”)}valz=foo(m.nonTxn,“lookupoutsidetransaction”)}



Proposed syntax for transactions



m.get and foo call are all within same transaction



foo’s call to get is atomic and isolated with respect to other txns

Other Optimistic Concurrency Use Cases

  Concurrent relaxed-balanced search tree with lazy copy-on-write   Linearizable clone() and size() in O(#concurrent

writers)

  TIntCounter – specialized mutable integer for highly contended uses   Increment, other ops delayed to reduce conflicts   Internally striped

  T*FieldUpdater, modeled on Java’s Atomic*FieldUpdater   Transactional access to fields in user objects   Sugared with an annotation + Scala compiler plugin?   Reduces storage cost of object





Scalable Coherence



Virtual Worlds

Personal Robotics

Data informatics





Applications

Domain Specific

Languages


DSL Infrastructure

  Heterogeneous HW for energy & area efficiency   ILP, threads, data-parallel engines, custom engines   Q: what is the right balance of features in a chip?   Q: what tools can generate best chip for an app/domain?

  H.264 encode study

1.0

1.5

2.0

2.5

3.0

4 cores + ILP + SIMD + custom inst

ASIC

Area

1

10

100

1000

4 cores + ILP + SIMD + custom inst

ASIC

Performance Energy Savings

  Revisit architectural support for parallelism   Which are the basic HW primitives needed?   Challenges: semantics, implementation, scalability,

virtualization, interactions, granularity (fine-grain & bulk), …

  HW primitives   Coherence & consistency, atomicity & isolation, memory

partitioning, data and control messaging, event monitoring

  Runtime synthesizes primitives into SW solutions   Streaming system: mem partitioning + bulk data messaging   TLS: isolation + fine-grain control communication   Transactional memory: atomicity + isolation + consistency   Security: mem partitioning + isolation   Fault tolerance: isolation + checkpoint + bulk data messaging

  Parallel tasks with a few thousand instructions   Critical to exploit in large-scale chips   Tradeoff: load balance vs overheads vs locality

  Software-only scheduling   Per-thread task queues + task stealing   Flexible algorithms but high stealing overheads

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128

SW-only HW-only ADM SW-only HW-only ADM

cg gtfold

Execu

tio

n T

ime

Idle

Overhead

Running

Overheads and scheduling get worse at larger scales

  Hardware-only scheduling   HW tasks queues + HW stealing protocol   Minimal overheads (bypass coherence protocol)   But fixes scheduling algorithm

  Optimal approach varies across applications   Impractical to support all options in HW

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128


cg gtfold

Execu

tio

n T

ime

Idle

Overhead

Running

Wrong scheduling algorithm makes HW slower than SW

  Simple HW feature: asynchronous direct messages   Register-to-register, received with user-level interrupt   Fast messaging for SW schedulers with flexible algorithms

  E.g., gtfold scheduler tracks domain-specific dependencies   Also useful for fast barriers, reductions, IPC, …

  Better performance, simpler HW, more flexibility & uses

0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8

32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128


cg gtfold

Execu

tio

n T

ime

Idle

Overhead

Running

Scalable scheduling without heavy-weight HW

  Goal: make parallel computing practical for the masses

  Technical approach   Domain specific languages (DSLs)

  Simple & portable programs

  Heterogeneous hardware   Energy and area efficient computing

  Working on the SW & HW techniques that bridge them

  More info at: http://ppl.stanford.edu

Documents

EPFL – August 26, 2009 - Scala › old › sites › default › files › pdfs › PPL_EPFL-2009.pdfFinding independent tasks 2. Mapping tasks to execution units 3. Implementing