Upload
carlos-reed
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
BERKELEY PAR LABBERKELEY PAR LAB
Efficiency Programming for the (Productive) Masses
Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson,
Krste Asanovic, Dave Patterson, Kurt Keutzer
UC Berkeley Parallel Computing Lab/UPCRC
BERKELEY PAR LAB
Make productivity programmers efficient, and efficiency programmers productive?
Productivity level language (PLL): Python, Ruby high-level abstractions well-matched to application
domain => 5x faster development and 3-10x fewer lines of code
>90% of programmers Efficiency level language (ELL): C/C++, CUDA, OpenCL
>5x longer development time potential 10x-100x performance by exposing HW
model <10% programmers, yet their work is poorly reused
5x development time 10x-100x performance!
Raise level of abstraction and get performance?
BERKELEY PAR LAB
Capture patterns instead of “domains”?
Efficiency programmers know how to target computation patterns to hardware stencil/SIMD codes => GPUs sparse matrix => communication-avoiding algo’s on
multicore “Big finance” Monte Carlo sim => MapReduce
Libraries? Useful, but don’t raise abstraction level
How to make ELL work accessible to more PLL programmers?
BERKELEY PAR LAB
“Stovepipes”: Connect Pattern to Platform
OOO GPU SIMD FPGA Cloud
Runtime & OS
Common language substrate
Rendering Probabilistic Physics Lin. Alg.
Virt. worlds Data viz. Robotics Music App domainsComputation domainsLanguage
Thick RuntimeHardware
Traditional Traditional LayersLayers
OOO GPU SIMD FPGA Cloud
Runtime & OS
Virt. worlds
Data viz. Robotics Music Applications
Motifs/Patterns
Thin Runtime
Hardware
““Stovepipes”Stovepipes”Sparse Matrix
Dense to
Dense to
GP
UG
PU St
enci
l
Sten
cil
to S
IMD
to S
IMD
Ste
ncil
to
Ste
ncil
to
FP
GA
FP
GA
Den
se
Den
se
to O
oOto
OoO
Dense Matrix Stencil
Humans must produce
these
BERKELEY PAR LAB
SEJITS: Selective, Embedded Just-in-Time Specialization
Productivity programmers write in general purpose, modern, high level PLL
SEJITS infrastructure specializes computation patterns selectively at runtime
Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware
Embedded because PLL’s own machinery enables (vs. extending PLL interpreter)
BERKELEY PAR LAB
Specifically...
When “specializable” function is called: determine if specializer available for current platform if no: continue executing normally in PLL
If a specializer is found, it can: manipulate/traverse AST of the function emit & JIT-compile ELL source code dynamically link compiled code to PLL interpreter
Specializers written in PLL
Necessary features present in modern PLL’s, but absent from older widely-used PLL’s
BERKELEY PAR LAB
.py.py
OS/HWOS/HW
f()f() @h()@h()
SpecializerSpecializer
.c.c@g()@g()
SEJITSSEJITS
Productivity app
.so.so
cc/ldcc/ld
$$
SEJITS makes tuning decisions per-function (not per-app)
BERKELEY PAR LAB
.py.py
OS/HWOS/HW
f()f() @h()@h()
SpecializerSpecializer
.c.c@g()@g()
SEJITSSEJITS
Productivity app
.so.so
cc/ldcc/ld
$$
SEJITS makes tuning decisions per-function (not per-app)
Selective
Embedded
JIT
Specialization
BERKELEY PAR LAB
Example: Stencil Computation in Ruby
9
class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end endend
VALUE kern_par(int argc, VALUE* argv, VALUE self) {unpack_arrays into in_grid and out_grid;
#pragma omp parallel for default(shared) private (t_6,t_7,t_8)for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)])); ... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));;}}}return Qtrue;}
•Specializer emits OpenMP•1000x-2000x faster than Ruby
Use introspection to grab parameters, inspect AST of computation
BERKELEY PAR LAB
Example: Sparse Matrix-Vector Multiply in Python
10
# “Gather nonzero entries, # multiply them by vector,# do for each column”
Specializer outputs CUDA for nvcc:
SEJITS leverages downstream toolchains
B. Catanzaro et al., joint work with NVIDIA Research
BERKELEY PAR LAB
.py.py
Nexus on Eucalyptus or EC2Nexus on Eucalyptus or EC2
f()f() @h()@h()
SpecializerSpecializer
@g()@g()
SEJITSSEJITS
Productivity app
SparkworkerSparkworker
.scala.scala
scalacscalac
$$
Spark & Nexus• Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures
• Relies on Scala runtime and data-parallel abstractions
• Relies on Nexus (cloud resource management) layer
SEJITS in the Cloud
BERKELEY PAR LAB
Example: Logistic regression using Spark/Scala (in progress)
M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09
B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘0912
BERKELEY PAR LAB
.py.py
Nexus on CloudNexus on Cloud
f()f() @h()@h()
SpecializerSpecializer
@g()@g()
SEJITSSEJITS
Productivity app
Hadoop masterHadoop master
.java.java
javacjavac
$$
SEJITS in the Cloud
BERKELEY PAR LAB
SEJITS for Cloud Computing
Idea: same Python app runs on desktop, on manycore, and in cloud
Cloud/multicore synergy: specialize intra-node as well as generate cloud code
Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C), ...
Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP, ...
Combine abstractions in one app Remember...can always fall back to PLL
BERKELEY PAR LAB
Questions
Won’t we need lots & lots of specializers? if ParLab “motifs” bet is correct, ~10s of specializers
will go a long way
What about libraries, frameworks, etc.? SEJITS is complementary to frameworks Most libraries for ELL, and ELLs lack features that
promote code reuse, don’t raise abstraction level
Why isn’t this just as hard as “magic compiler”? Specializers written by human experts SEJITS allows “crowdsourcing” them
Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?
BERKELEY PAR LAB
Conclusion
SEJITS enables code-generation strategy per-function, not per-app
Uniform approach to productive programming same app on cloud, multicore, autotuned libraries
Combine multiple frameworks/abstractions in same app
Research enabler Incrementally develop specializers for different motifs
or prototype HW Don’t need full compiler & toolchain just to get started
BERKELEY PAR LABBERKELEY PAR LAB
Questions
17