Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Kunle Olukotun
EPFL – August 26, 2009
The end of uniprocessors Embedded, server, desktop
New “Moore‟s Law is 2X cores per chip every technology generation Cores are the new GHz! No other viable options for increasing performance
Major dislocation for SW industry since now need parallel programming $200 B software industry
Parallelism is not for the average programmer Too difficult to find parallelism, to debug, maintain and get
good performance for the masses Need a solution for “Joe the programmer”
Goal: the parallel computing platform for 2015 Parallel application development practical for the masses
Joe the programmer… Parallel applications without parallel programming
PPL is a collaboration of Leading Stanford researchers across multiple domains
Applications, languages, software systems, architecture Leading companies in computer systems and software
Sun, AMD, Nvidia, IBM, Intel, NEC, HP
PPL is open Any company can join; all results in the public domain
Applications Ron Fedkiw, Vladlen Koltun, Sebastian Thrun
Programming & software systems Alex Aiken, Pat Hanrahan, John Ousterhout,
Mendel Rosenblum
Architecture Bill Dally, John Hennessy, Mark Horowitz,
Christos Kozyrakis, Kunle Olukotun (director)
1. Finding independent tasks 2. Mapping tasks to execution units 3. Implementing synchronization
Races, livelocks, deadlocks, …
4. Composing parallel tasks 5. Recovering from HW & SW errors 6. Optimizing locality and communication 7. Predictable performance & scalability 8. … and all the sequential programming issues
Even with new tools, can Joe handle these issues?
Guiding observations Must hide low-level issues from most
programmers No single discipline can solve all problems Top-down research driven by applications
Core techniques Domain specific languages (DSLs)
Simple & portable programs
Heterogeneous hardware Energy and area efficient computing
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Programmable Hierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Virtual Worlds
Personal Robotics
Data informatics
Scientific Engineering
Physics Scripting Probabilistic Machine Learning Rendering
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Applications
Domain Specific
Languages
Heterogeneous Hardware
DSL Infrastructure
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Programmable Hierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Virtual Worlds
Personal Robotics
Data informatics
Scientific Engineering
Physics Scripting Probabilistic Machine Learning Rendering
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Applications
Domain Specific
Languages
Heterogeneous Hardware
DSL Infrastructure
Leverage domain expertise at Stanford National centers for scientific computing & CS research groups
PPL
Media-X
NIH NCBC
PSAAP
Environmental Science
Geophysics
Web, Mining Streaming DB
Graphics Games
Mobile HCI
Existing Stanford research center
Seismic modeling
Existing Stanford CS research groups
AI/ML Robotics
Next-generation web platform Millions of players in vast landscapes Immersive collaboration Social gaming
Computing challenges Client-side game engine
Graphics rendering Server-side world simulation
Object scripting, geometric queries, AI, physics computation
Dynamic content, huge datasets
More at http://vw.stanford.edu/
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Programmable Hierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Virtual Worlds
Personal Robotics
Data informatics
Scientific Engineering
Physics Scripting Probabilistic Machine Learning Rendering
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Applications
Domain Specific
Languages
Heterogeneous Hardware
DSL Infrastructure
High level language targeted/restricted to specific domain E.g.: SQL, Matlab, OpenGL, Ruby/Rails, … Usually declarative and simpler than GP languages
DSLs ⇒ higher productivity for users DSL user High-level data types & ops (e.g. relations, triangles, …) Express high-level intent w/o implementation artifacts
DSLs ⇒ scalable parallelism for the system DSL developer Declarative description of parallelism & locality patterns
Can be ported or scaled to available machine Allows for domain specific optimization
Automatically adjust structures, mapping, and scheduling
Currently code is in C based on ten years of work Limitations of current code
Optimized for Cluster/MPI but cannot run on other architectures 20% scientific code, 80% plumbing code Minor changes in application level code can require major
architectural changes in framework Large learning curve for incoming students who are not parallel-
computing aware
Fuel injection Transition Thermal
Turbulence
Turbulence
Combustion
Goal: simplify code of mesh-based PDE solvers Write once, run on any type of parallel machine From multi-cores and GPUs to clusters
Language features Built-in mesh data types
Vertex, edge, face, cell Collections of mesh elements
cell.faces(), face.edgesCCW() Mesh-based data storage
Fields, sparse matrices
val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex]
for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …
val position = vertexProperty[double3](“pos”) val A = new SparseMatrix[Vertex,Vertex]
for (c <- mesh.cells) { val center = average position of c.vertices for (f <- c.faces) { val face_dx = average position of f.vertices – center for (e <- f.edges With c CounterClockwise) { val v0 = e.tail val v1 = e.head val v0_dx = position(v0) – center val v1_dx = position(v1) – center val face_normal = v0_dx cross v1_dx // calculate flux for face … A(v0,v1) += … A(v1,v0) -= …
High-level data types & operations
Explicit parallelism using map/reduce/forall Implicit parallelism with help from DSL & HW
No low-level code to manage parallelism
Liszt compiler & runtime manage parallel execution Data layout & access, domain decomposition, communication, …
Domain specific optimizations Select mesh layout (grid, tetrahedral, unstructured, custom, …) Select decomposition that improves locality of access Optimize communication strategy across iterations
Optimizations are possible because Mesh semantics are visible to compiler & runtime Iterative programs with data accesses based on mesh topology Mesh topology is known to runtime
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Programmable Hierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Virtual Worlds
Personal Robotics
Data informatics
Scientific Engineering
Physics Scripting Probabilistic Machine Learning Rendering
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Applications
Domain Specific
Languages
Heterogeneous Hardware
DSL Infrastructure
Provide a shared framework for DSL development Simplify DSL development
Features Common parallel language that retains DSL semantics
Mechanism to express domain specific optimizations
Static compilation + dynamic management environment For regular and unpredictable patterns respectively
Synthesize HW features into high-level solutions E.g. from HW messaging to fast runtime for fine-grain tasks
Exploit heterogeneous hardware to improve efficiency
Required features Support for functional programming (FP)
Declarative programming style for portable parallelism High-order functions allow parallel control structures
Support for object-oriented programming (OOP) Familiar model for complex programs Allows mutable data-structures and domain-specific attributes
Managed execution environment For runtime optimizations & automated memory management
Our approach: embed DSLs in the Scala language Supports both FP and OOP features Supports embedding of higher-level abstractions Compiles to Java bytecode
Domain Extracted Locality Informed Task Execution
An infrastructure for building implicitly parallel DSLs
Enables task-based and data-level parallelism
Allows for dataflow analysis and optimizations
Allows for domain specific optimizations
Targets heterogeneous compute environments
Calls Matrix DSL methods
Delite applies generic & domain transformations and generates mapping
DSL defers OP execution to Delite
OP_add
// x is a matrix
// y is vector
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128
Sp
eed
up
Execution Cores
Gaussian Discriminant Analysis
Original + Domain Optimizations + Data Parallelism
Low speedup due to loop dependencies
Gaussian Discriminant Analysis
Domain info used to refactor dependencies
Domain info used to fuse operations
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128
Sp
eed
up
Execution Cores
Gaussian Discriminant Analysis
Original + Domain Optimizations + Data Parallelism
Domain info used to refactor dependencies
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128
Sp
eed
up
Execution Cores
Gaussian Discriminant Analysis
Original + Domain Optimizations + Data Parallelism
Exploiting data parallelism within tasks
Dealing with Mutable State Histogram example
Delite currently serializes operations on mutable state
Can use optimistic concurrency techniques to allow operation to proceed in parallel Especially useful for collection types
for(word<‐words){wordFrequencies+=word
}
Optimistic Scalable Concurrent Collections TM is a natural way to interact with concurrent
collections Many ad-hoc txns are already provided: putIfAbsent(check, put), replace(delete, put), …
Conflict detection on collections should be semantic Fewer conflicts ⇒ better scalability
Transactional collection classes (Carlstrom, PPoPP’07) Data structure doesn’t have to be implemented with TM
Transactional boosting (Herlihy, PPoPP’08)
STMs that manage all reads and writes are not yet ready for widespread adoption Simple semantics, scalability, performance
Pick 1½ Give up strong isolation for performance
The Key Idea
Library-only STM that operates on explicitly transactional types (TVar) Like Haskell, Clojure, and djspiewak’s Scala STM Extra hooks for semantic conflict detection:
unrecorded reads, callbacks
“Pre-boosted” transactional collections Internals based on concurrent collection classes TMap, TSet, TSortedMap, TVector, … Per-transaction views that implement Map, Set, … Non-transactional view provides atomicity and
isolation for single operations
TMap Example: Interface traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
TMap.get requires a txn, enforced by compiler
TMap Example: Interface traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
TMap.bind returns view bound to a txn
TMap Example: Interface traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
TMap.nonTxn returns a non transactional view single operations are still atomic
TMap Example: Usage traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
deffoo(m:Map[String,Int],k:String)=m.get(k)defbar(m:TMap[String,Int]){for(implicittxn<‐atomic){valx=m.get(“directlookup”)valy=foo(m.bind,“lookupviatxnMapview”)}valz=foo(m.nonTxn,“lookupoutsidetransaction”)}
TMap Example: Usage traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
deffoo(m:Map[String,Int],k:String)=m.get(k)defbar(m:TMap[String,Int]){for(implicittxn<‐atomic){valx=m.get(“directlookup”)valy=foo(m.bind,“lookupviatxnMapview”)}valz=foo(m.nonTxn,“lookupoutsidetransaction”)}
Proposed syntax for transactions
TMap Example: Usage traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
deffoo(m:Map[String,Int],k:String)=m.get(k)defbar(m:TMap[String,Int]){for(implicittxn<‐atomic){valx=m.get(“directlookup”)valy=foo(m.bind,“lookupviatxnMapview”)}valz=foo(m.nonTxn,“lookupoutsidetransaction”)}
m.get and foo call are all within same transaction
TMap Example: Usage traitTMap[A,B]{defget(key:A)(implicittxn:Txn):Option[B]..defbind(implicittxn:Txn):Map[A,B]defnonTxn:Map[A,B]}
deffoo(m:Map[String,Int],k:String)=m.get(k)defbar(m:TMap[String,Int]){for(implicittxn<‐atomic){valx=m.get(“directlookup”)valy=foo(m.bind,“lookupviatxnMapview”)}valz=foo(m.nonTxn,“lookupoutsidetransaction”)}
foo’s call to get is atomic and isolated with respect to other txns
Other Optimistic Concurrency Use Cases
Concurrent relaxed-balanced search tree with lazy copy-on-write Linearizable clone() and size() in O(#concurrent
writers)
TIntCounter – specialized mutable integer for highly contended uses Increment, other ops delayed to reduce conflicts Internally striped
T*FieldUpdater, modeled on Java’s Atomic*FieldUpdater Transactional access to fields in user objects Sugared with an annotation + Scala compiler plugin? Reduces storage cost of object
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Programmable Hierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Virtual Worlds
Personal Robotics
Data informatics
Scientific Engineering
Physics Scripting Probabilistic Machine Learning Rendering
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Applications
Domain Specific
Languages
Heterogeneous Hardware
DSL Infrastructure
Heterogeneous HW for energy & area efficiency ILP, threads, data-parallel engines, custom engines Q: what is the right balance of features in a chip? Q: what tools can generate best chip for an app/domain?
H.264 encode study
1.0
1.5
2.0
2.5
3.0
4 cores + ILP + SIMD + custom inst
ASIC
Area
1
10
100
1000
4 cores + ILP + SIMD + custom inst
ASIC
Performance Energy Savings
Revisit architectural support for parallelism Which are the basic HW primitives needed? Challenges: semantics, implementation, scalability,
virtualization, interactions, granularity (fine-grain & bulk), …
HW primitives Coherence & consistency, atomicity & isolation, memory
partitioning, data and control messaging, event monitoring
Runtime synthesizes primitives into SW solutions Streaming system: mem partitioning + bulk data messaging TLS: isolation + fine-grain control communication Transactional memory: atomicity + isolation + consistency Security: mem partitioning + isolation Fault tolerance: isolation + checkpoint + bulk data messaging
Parallel tasks with a few thousand instructions Critical to exploit in large-scale chips Tradeoff: load balance vs overheads vs locality
Software-only scheduling Per-thread task queues + task stealing Flexible algorithms but high stealing overheads
0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8
32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128
SW-only HW-only ADM SW-only HW-only ADM
cg gtfold
Execu
tio
n T
ime
Idle
Overhead
Running
Overheads and scheduling get worse at larger scales
Hardware-only scheduling HW tasks queues + HW stealing protocol Minimal overheads (bypass coherence protocol) But fixes scheduling algorithm
Optimal approach varies across applications Impractical to support all options in HW
0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8
32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128
SW-only HW-only ADM SW-only HW-only ADM
cg gtfold
Execu
tio
n T
ime
Idle
Overhead
Running
Wrong scheduling algorithm makes HW slower than SW
Simple HW feature: asynchronous direct messages Register-to-register, received with user-level interrupt Fast messaging for SW schedulers with flexible algorithms
E.g., gtfold scheduler tracks domain-specific dependencies Also useful for fast barriers, reductions, IPC, …
Better performance, simpler HW, more flexibility & uses
0.0 0.3 0.5 0.8 1.0 1.3 1.5 1.8
32 64 128 32 64 128 32 64 128 32 64 128 32 64 128 32 64 128
SW-only HW-only ADM SW-only HW-only ADM
cg gtfold
Execu
tio
n T
ime
Idle
Overhead
Running
Scalable scheduling without heavy-weight HW
Goal: make parallel computing practical for the masses
Technical approach Domain specific languages (DSLs)
Simple & portable programs
Heterogeneous hardware Energy and area efficient computing
Working on the SW & HW techniques that bridge them
More info at: http://ppl.stanford.edu