2
Introduction
Clear trend towards multi-core heterogeneous systems Problem: increased application-design complexity
Different resources require different algorithms to execute efficiently Compiler research attempts to compile code for different resources
Fundamentally limited as compilers can’t infer one algorithm from another
Elastic Computing: optimization framework with knowledge base of implementations for different elastic functions Designers call functions that automatically optimize for any system i.e., designers specify “what” without specifying “how”
SystemResources
Perf
orm
ance
Optimal Algorithm
SingleAlgorithm
Sorting Implementations
Compiler
quick_sort(...) {...
}
quick_sort(...)
bitonic_sort(...) {...
}
bitonic_sort(...)
quick_sort(...)quick_sort(...)
bitonic_sort(...)bitonic_sort(...)
uP
FPGA
3
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
} Perf
orm
anc
e
Quick
Sort
Bitoni
c So
rt
Inse
rtion
Sort
…
quick_sort(...)
sort(A, 100);
Instead of specifying a specific implementation, applications use Elastic Functions Elastic Functions contain a knowledge-base of implementation and parallelization options
At run-time, Elastic Computing Framework determines the best execution decisions Decision based on available system resources as well as function parameters
4
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
} Perf
orm
anc
e
Quick
Sort
Bitoni
c So
rt
Inse
rtion
Sort
…
Instead of specifying a specific implementation, applications use Elastic Functions Elastic Functions contain a knowledge-base of implementation and parallelization options
At run-time, Elastic Computing Framework determines the best execution decisions Decision based on available system resources as well as function parameters
bitonic_sort(...)
5
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
}
…
If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources Automatically determines efficient partitioning of work to resources Also, determines most efficient implementation for each resource individually
bitonic_sort(...)
partition
quick sort
bitonic sort
Logical Executionquick_sort(...)
6
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
}
…
If multiple resources are available, the Elastic Computing Framework will dynamically parallelize work across different resources Automatically determines efficient partitioning of work to resources Also, determines most efficient implementation for each resource individually
partition
quick sort
bitonic sort
Logical Execution
quick sort
partition partition
…
7
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
}
…
Elastic Computing is transparent
Applications treat Elastic Computing as a high-performance auto-tuning
library of functions
Elastic Computing determines how to efficiently execute the Elastic Functions
on behalf of the application
8
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
System Resources
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
}
…
Elastic Computing is transparent, portable
System Resources
System Resources
…
Elastic Computing automatically optimizes the Elastic Function execution to the available system resources, even if the application is moved to a different system
9
Elastic Function Library
Overview
Implementations for Sorting Elastic
Function
insertion_sort(...) {...
}
quick_sort(...) {...
}
bitonic_sort(...) {...
}
Elastic ComputingFramework
Applicationint main(...) {
...sort(A, 100);...
}
…
Elastic Computing is transparent, portable, and adaptive
System Resources
Perf
orm
anc
e
Quick
Sort
Bitoni
c So
rt
Inse
rtion
Sort
quick_sort(...)
insertion_sort(...)
Perf
orm
anc
e
Quick
Sort
Bitoni
c So
rt
Inse
rtion
Sort
int main(...) {...sort(A, 5);...
}
Elastic Computing also automatically adapts the Elastic Function execution to the application’s input parameters (e.g., sorting 5 elements as opposed to 100)
10
Related Work
Parallel cross-compiling programming languages: Examples: CUDA, OpenCL, DirectX, ImpulseC
Allows a single code file to describe parallel computation that can compile to numerous devices
Single-domain adaptable software libraries: Examples: FFTW (for FFT) [Frigo 98], ATLAS (for linear algebra) [Whaley 98]
Measures performances of execution alternatives and determines the best way to execute the function for the specific function call and system
General-purpose adaptable software libraries: Examples: PetaBricks [Ansel 09], SPIRAL [Püschel 05]
Uses custom languages to expose algorithmic/implementation choices to the compiler, and relies on measured performance and learning techniques to determine the best
Examples: Qilin [Luk 09] Uses dynamic compilation to determine a data graph, and relies on measured performance to
determine an efficient partitioning of work across heterogeneous resources
Differentiating features of Elastic Computing: Allows specification of multiple algorithms for different devices Automatically determines efficient partitionings of work between heterogeneous devices Supports both multi-core and heterogeneous devices and are not specific to any domain Does not require custom programming languages or non-standard compilation In most cases, previous work can be used in conjunction with Elastic Computing
11
Optimization Steps
Elastic Computing Framework performs two optimization steps to determine how to execute an Elastic Function efficiently Implementation Assessment collects performance information about different
implementation options for an Elastic Function Optimization Planning then analyzes the predicted performance to
determine efficient execution decisions
To reduce run-time overhead, both optimization steps execute at installation-time and save their results to a file May require several minutes to an hour to complete Only needs to occur once per Elastic Function per system
At run-time, the Elastic Function Execution step looks-up the optimization decisions to execute the Elastic Function on behalf of an application
Elastic Function
OptimizationDecisions
Implementation Assessment
Optimization Planning
Elastic Function Execution
Inst
alla
tio
n-t
ime
Ru
n-t
ime
Application
12
Optimization Steps
Elastic Functions inform the Elastic Computing Framework of how to execute and optimize a function May be created for nearly any function (e.g., sort, FFT, matrix multiply)
Elastic Functions contain numerous alternate implementations for executing the function Implementations may be single-core, multi-core, and/or heterogeneous All implementations adhere to the same input/output parameters making
them interchangeable Elastic Functions also contain:
Dependent Implementations that specify how to parallelize the function Adapter to abstract function-specific details from the analyses steps Details discussed later!
SortElastic Function
Quick SortC code
Implementations:
Bitonic SortVHDL code
Merge SortCUDA code
Elastic Function
OptimizationDecisions
Implementation Assessment
Optimization Planning
Elastic Function Execution
Inst
alla
tio
n-t
ime
Ru
n-t
ime
Application
13
Optimization Steps
Implementation Assessment creates performance predictors for the implementations of the Elastic Function
Performance predictors are called Implementation Performance Graphs (IPGs), which are: Created for each implementation individually Returns the estimated execution time of the implementation when given the
implementation’s invocation parameters Example: a quick sort implementation
Quick SortC code
1.3 sec
10,000Input Parameters
Exe
cutio
n T
ime
IPG for Quick Sort
void main() { // Other code... int array[10000]; QuickSort(array); // Other code...}
Sample Invocation
execution time = 1.3 sec
Elastic Function
OptimizationDecisions
Implementation Assessment
Optimization Planning
Elastic Function Execution
Inst
alla
tio
n-t
ime
Ru
n-t
ime
Application
14
Optimization Steps
Optimization Planning then analyzes the IPGs to predetermine efficient Elastic Function execution decisions Goal is to make decisions that minimize the estimated execution time
Answers two main execution questions: Which implementation is the most efficient for an invocation? How to efficiently partition computation across multiple resources? Details discussed later!
Elastic Function
OptimizationDecisions
Implementation Assessment
Optimization Planning
Elastic Function Execution
Inst
alla
tio
n-t
ime
Ru
n-t
ime
Application
1.3 sec
10,000Input Parameters
Exe
cutio
n T
ime
IPG for Quick Sortvoid main() { // Other code... int array[10000]; Sort(array); // Other code...}
Sample Invocation
1.1 sec
10,000Input Parameters
Exe
cutio
n T
ime
IPG for Bitonic Sort
Bitonic Sort estimated to be most efficient at 1.1 sec!
15
Optimization Steps
Output of Implementation Assessment and Optimization Planning is saved to a file for lookup at run-time
Applications execute normally until they invoke an Elastic Function When an Elastic Function is invoked, the Elastic Function Execution
step starts which then: Looks-up predetermined execution decisions based on the invocation
parameters and availability of system resources Executes the Elastic Function using the predetermined decisions Returns control to the application once the Elastic Function completes
Elastic Function
OptimizationDecisions
Implementation Assessment
Optimization Planning
Elastic Function Execution
Inst
alla
tio
n-t
ime
Ru
n-t
ime
Application
16
Design Flow
ApplicationCodeElastic Function
ApplicationExecutable
Hardware Vendors
Library Designers
Open-source Efforts
Application Developer
InstalledElastic Functions
Elastic FunctionInstallation
ApplicationInstallation
Implementation Assessment
Optimization Planning
Compilation Compilation
Elastic Function Interface Specification
Elastic FunctionInvocation
Elastic Function Design Application Design
System Run-time
ApplicationLaunched
Elastic Function Execution
ApplicationExecution
17
Design Flow
ApplicationCode
ApplicationExecutable
Application Developer
InstalledElastic Functions
ApplicationInstallation
Implementation Assessment
Optimization Planning
Compilation
Elastic Function Interface Specification
Elastic FunctionInvocation
Elastic Function Design Application Design
System Run-time
ApplicationLaunched
Elastic Function Execution
ApplicationExecution
Elastic Function
Hardware Vendors
Library Designers
Open-source Efforts
Elastic FunctionInstallation
Compilation
Implementation Assessment
Optimization Planning
How does it work?Implementation Assessment and Optimization Planning
are the main research challenges and the focus of
on-going research
Time for details!
18
Adapter
Implementation Assessment creates Implementation Performance Graphs (IPGs) for each implementation to predict the execution time from the input parameters IPG is a piece-wise linear graph mapping the input parameters to estimated execution time Question: how do we map input parameters to the x-axis for every Elastic Function? Answer: the adapter
Quick SortC code
1.3 sec
10,000Input Parameters
Exe
cutio
n T
ime
IPG for Quick Sortvoid main() { // Other code... int array[10000]; QuickSort(array); // Other code...}
Sample Invocation
execution time = 1.3 sec
Input Parameters
Exe
cutio
n T
ime
IPG for Convolution
ConvolutionC code
void main() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code...}
Sample Invocation
?
19
Adapter
Adapter maps the input/output parameters to a numeric value, called the work metric Essentially provides an abstraction layer to allow Elastic Computing to analyze and, thereby,
optimize any type of Elastic Function Developer creates the adapter as part of the Elastic Function
Rules for the Adapter’s Mapping: 1. Parameters that map to the same Work Metric value should require equal execution times 2. As the Work Metric value increases, execution time should also increase
Example: sorting Elastic Function Adapter: set work metric equal to number of elements to sort Adheres to Rule 1: sorting the same number of elements generally takes the same time Adheres to Rule 2: sorting more elements generally takes longer
Quick SortC code
1.3 sec
10,000Work Metric
Input Parameters
Exe
cutio
n T
ime
IPG for Quick Sort
void main() { // Other code... int array[10000]; QuickSort(array); // Other code...}
Sample Invocation
work metric = 10,000
execution time = 1.3 sec
20
Adapter
Any work metric mapping that (mostly) adheres to Rules 1 & 2 is a valid adapter One technique is to set the mapping equal to the result of an asymptotic analysis on the
performance of a function Asymptotic analysis creates an equation that is approximately proportional to execution time Use that equation as the work metric mapping
Example: convolution Elastic Function Time-domain convolution has asymptotic performance equal to Θ(|a|*|b|) Therefore, set work metric equal to product of the lengths of the two input vectors
void main() { // Other code... float a[100]; float b[10000]; Convolve(a, b); // Other code...}
Sample Invocation
ConvolutionC code
1.7 sec
1,000,000Work Metric
Input Parameters
Exe
cutio
n T
ime
IPG for Convolutionwork metric = 100 * 10,000
= 1,000,000
execution time = 1.7 sec
21
Heuristic collected fewer samples in linear regions
Implementation Assessment
Implementation Assessment relies on a heuristic to create IPGs, which: Samples execution time of the implementation at several work metrics to
determine performance Performs statistical analyses on sets of samples to find work metric intervals
with linear trends Adapts the sampling process to collect fewer samples in regions of linear trends
Work Metric
Exe
cutio
n T
ime
Work Metric
Exe
cutio
n T
ime
Collected Samples Resulting IPG
Implementation
22
Optimization Planning
Optimization Planning analyzes the IPGs to predetermine efficient execution decisions, and performs two main optimizations: Fastest Implementation Planning predetermines the most efficient implementation for
different invocation situations Work Parallelization Planning predetermines how to efficiently parallelize computation
Fastest Implementation Planning (FIP) creates Function Performance Graphs (FPGs) that allow a single lookup to return the best implementation for an invocation FIP creates an FPG by overlaying IPGs corresponding to the possible implementation
alternatives and saving only the lowest-envelope
Candidate Implementations
Quick SortC code
Bitonic SortVHDL code
Corresponding Candidate IPGs
Overlay of IPGs
Work Metric
Exe
cutio
n T
ime
Work Metric
Exe
cutio
n T
ime
Resulting FPG
23
Optimization Planning
Work Parallelization Planning (WPP) analyzes FPGs to determine partitionings of computation that minimize estimated execution time
Dependent implementations are a type of implementation that uses WPP results to determine how to efficiently parallelize computation Developers create dependent implementations based on divide-and-conquer algorithms Divide-and-conquer algorithms divide a big-instance of a problem into multiple smaller instances,
and are common for many types of functions Example: merge sort algorithm (divide-and-conquer algorithm that performs sort) Question: How to parallelize computation and resources to maximize performance?
Answer: Determine partitionings that minimize the estimated execution time!
Merge Sort Algorithm
Initial Call: Sort( [ 3, 5, 7, 1, 2, 8, 5, 2 ] )
Partition:
Nested Calls: Sort( [ 3, 5, 7, 1, 2 ] ) Sort( [ 8, 5, 2 ] )
Merge:
Nested Output:
Output:
return [ 1, 2, 3, 5, 7 ] return [ 2, 5, 8 ]
return [ 1, 2, 2, 3, 5, 5, 7, 8 ]
Merge Sort Dependent Implementation
void MergeSortDepImp(input) { // Partition input [A_in, B_in] = Partition(input); // Perform recursive sorts In Parallel { A_out = sort(A_in); B_out = sort(B_in); } // Merge recursive outputs output = Merge(A_out, B_out); // Return output return output;}
24
Optimization Planning
WPP uses a sweep-line algorithm to analyze pairs of FPGs and determine efficient partitioning of computation between them Example: partitioning sort between two resources Algorithm analyzes all pairs of FPGs to consider all possible resource partitionings Result of algorithm is optimal, assuming estimated FPG performance is accurate
Implementation Assessment and Optimization Planning iterate to consider repeated nesting of dependent implementations Repeated nesting of dependent implementations allow for arbitrarily many partitions
Proposed improvements to WPP consider more parallelization options to allow more efficient parallelization decisions
Work Metric
Exe
cutio
n T
ime
FPG for Sort using a CPU
Work Metric
Exe
cutio
n T
ime
FPG for Sort using a FPGA
1.2 sec
1,000
1.2 sec
5,000
sweep line
when sorting 6,000 elements partition 1,000 to CPU and 5,000 to FPGA
25
Status of Elastic Computing
Elastic Computing Framework is working! Consists of over 200 files and 25k lines of code 13 Elastic Functions (and 35 implementations) created:
Convolution: Circular Convolution, Convolution, 2D Convolution Linear Algebra: Inner Product, Matrix Multiply Image Processing: Mean Filter, Optical Flow, Prewitt Filter, Sum-of-Absolute-
Differences Others: Floyd-Warshall, Lattice-Boltzmann, Longest Common Subsequence, and Sort Easy to add new Elastic Functions and Implementations
5 processing resources supported: Multi-threaded implementations support MPI communication/synchronization features GPU support: any CUDA-supported GPUs FPGA support: H101PCIXM, PROCeIII, and PROCStarIII Adding support for new resources requires creating a wrapper for the driver’s interface
Elastic Computing Framework installed on: Alpha, Delta, Elastic, Marvel, Novo-G, and Warp Easy to add new platforms
26
Experimental Results
Overlap-Add Partitioning
FFT-based Convolution
Overlap-Add Partitioning
Time-domain Convolution
Time-domain Convolution
1,024,000,000
553,512,960 470,487,040
235,233,280 235,233,280
CPUs = 3FPGAs = 1
GPUs = 2( )CPUs = 2
FPGAs = 0GPUs = 2( )CPUs = 1
FPGAs = 0GPUs = 1( )
CPUs = 1FPGAs = 1
GPUs = 0( )CPUs = 1
FPGAs = 0GPUs = 1( )
0x 10x 20x 30x 40x 50x 60x 70x 80x 90x
1 FPGA & 3 GPUs
1 FPGA & 2 GPUs
1 FPGA & 1 GPU
3 GPUs
1 FPGA
2 GPUs
1 GPU
Only CPUs
Speedup
Speedup of Convolution Elastic Function(as more resources are made available)
Parallelization Decisions(for a invocation with work metric = 1,024,000)
Results collected on Elastic system Convolution Elastic Function contains 5 implementations:
Single-threaded CPU implementation using time-domain algorithm Multi-threaded CPU implementation using time-domain algorithm GPU implementation using time-domain algorithm FPGA implementation using frequency-domain algorithm Dependent implementation using overlap-add partitioning
27
Experimental Results
0x 5x 10x 15x 20x 25x
AVG
Sort
SAD
Prewitt
Optical
MM
Mean
Inner
FW
Conv
CConv
2DConv
Speedup
DeltaElasticMarvelNovo-G
80x
49x
117x
Average
Results collected on Delta, Elastic, Marvel, and Novo-G for 11 Elastic Functions: 2DConv = 2D convolution Cconv = circular
convolution Conv = 1D convolution FW = Floyd-Warshall Inner = inner-product Mean = mean image filter MM = matrix multiply Optical = optical flow Prewitt = Prewitt edge
detection SAD = sum of absolute
differences Sort = sort
28
Publication List
Elastic Computing Publications: J. Wernsing and G. Stitt, “Elastic computing: a framework for transparent, portable, and adaptive multi-core
heterogeneous computing,” in LCTES’10: Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems, pp. 115–124, 2010.
J. Wernsing and G. Stitt, “A scalable performance prediction heuristic for implementation planning on heterogeneous systems,” in ESTIMedia’10: 8th IEEE Workshop on Embedded Systems for Real-Time Multimedia, pp. 71 –80, 2010.
J. Wernsing and G. Stitt, "RACECAR: A Heuristic for Automatic Function Specialization on Multi-core Heterogeneous Systems," under review in PPoPP'12: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012.
J. Wernsing and G. Stitt, Elastic Computing: A Portable Optimization Framework for Hybrid Computers, under review in Parallel Computing Journal (ParCo) Special Issue on Application Accelerators in HPC.
Other Publications: J. Wernsing, J. Ling, G. Cieslewski, and A. George, "Lightweight Reliable Communications Library for
High-Performance Embedded Space Applications," in DSN'07: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Edinburgh, UK, June 25-28, 2007 (student forum).
J. Coole, J. Wernsing, and G. Stitt, "A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation," in ReConFig'09: International Conference on Reconfigurable Computing and FPGAs, pp. 143-148, 2009.
J. Fowers, G. Brown, J. Wernsing, and G. Stitt, A Performance and Energy Comparison of Convolution on GPUs, FPGAs, and Multicore Processors, under review in ACM Transactions on Architecture and Code Optimization (TACO) Special Issue on High-Performance and Embedded Architectures and Compilers.
29
Conclusions
Elastic Computing enables effective multi-core heterogeneous computing by: Providing a framework for designing, reusing, and automatically optimizing computation on
multi-core heterogeneous systems Adapting execution decisions to execute efficiently based on the invocation’s input
parameters and the availability of system resources Abstracting application developers from computation and optimization details Enabling applications to be portable yet efficient across different systems
Main research challenges: Implementation Planning ,which creates performance predictors for implementations Optimization Planning, which predetermines efficient execution decisions by analyzing the
performance predictors
Proposed improvements: Improve Implementation Planning to more intelligently sample an implementation when
creating an IPG, resulting in a reduced installation-time overhead without reducing accuracy Improve Optimization Planning to consider more partitioning options, resulting in improved
efficiency when parallelizing computation
Questions?