Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Accelerating ApplicationsJames Spooner | Maxeler | VP of Acceleration
… the art of maximum performance computing
Introduction What do we mean by acceleration?
3
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
Introduction What do we mean by acceleration?
4
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
About Maxeler Technologies• Maxeler offers complete hardware, software and application acceleration solutions for
high performance computing• Founded 2003, ~65 people, offices in London, UK and Palo Alto, CA• Main clients in banking and oil and gas exploration
Hardware Card: PCI Express x16, compute, memory and local interconnect Node: 1U solutions with multi cardsRack: 10U, 20U or 40U, balancing compute, storage & network
Resource management for Accelerated ComputingRuntime support: memory management and data choreographyCompilers and High Level Libraries
ConsultingHPC System Performance ArchitectureAlgorithms and Numerical OptimizationIntegration into business and technical processes
Software
5
Application AccelerationDeliberate, focused approach to improving application speed
May involve using new or additional hardwareMay require (dramatic) changes to the code baseMakes some of the program fasterWill be programmed intentionally and be architecture specific May have multiple implementations
6
Maxeler is a acceleration specialist, delivering end‐to‐end performance for a range of clients in the banking and oil/gas exploration industries. This talk aims to present some of our methodology, and experience across GPU and FPGA acceleration projects
What always makes Acceleration hard?Messy codeComplicated build dependencesConfused control-flowImpenetrable data accessPointer-intensive data structuresPremature optimization
7
xyz
xyzp
xyz
xyz
xyz
xyz
rθ
xyzq
xyzp x
yz
for (i=0; i<N; ++i) {points[i]‐>incx();
}
Conflicting goalsSome well-motivated software
structures have real value, but make acceleration harder
Examples:Virtual method calls inside a loopCollections with non-uniform typeSubstructure sharing
8
xyz
xyzp
xyz
xyz
xyz
xyz
rθ
xyzq
xyzp x
yz
for (i=0; i<N; ++i) {points[i]‐>incx();
}
What makes Acceleration easier?Self-evident data dependencesComputing on large collections of uniform dataAppropriate representation hidingGetting the abstraction right
9
x x x x x x x x
y y y y y y y y
z z z z z z z z
Maximum Performance ComputingIdentify parallelism and take advantage of it• Fully understand data dependencies
Minimize memory bandwidth• Data reuse and representation
Regularize the computation and data• Minimize control flow complexity
Find optimal balance for underlying architecture• Memory hierarchy bandwidth(s) and size(s) and latency(s)• Communication bandwidth(s) and latency(s)• Math performance• Branch cost (control divergence)• Axes of Parallelism
10
Introduction What do we mean by acceleration?
11
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
Maxeler Acceleration Process
Analysis
Code
Transformation
Partitioning
Implementation
Result
Sets theoretical performance bounds
Achieve performance
12
Run the code with profiling tools.Understand data and loop structures and data access patternsInvestigate transformation options for these structures and access patternsDecide which parts of the code need accelerationImplement and validate
AnalysisUnderstand the application (code +
data)Find parallelism• At all levels
Understand data dependencies• Size• Frequency• Distance
13
TransformationChange the structure of the code and data
Put all the computation that matters into one placeStrive for regular code and data structures in the inner loopSplit out irregular or complex pre/post processing
PartitioningWhat should go where?
Storage• Consider capacity, bandwidth, latency• Data access patterns, reuse
Compute• Consider input/output data volume, latency,
computational throughput
Model the performance before implementation.
14
APU Chip
CPU CPU
NBCORE
Embedded GPU
~20 GB/s Non‐coherent path~7 GB/s Coherent path
Partitioning Options
Data Access Plans Code Partitioning
Transformations
Pareto Optimal Options
Runtime
Develop
men
t Time
Try to minimise runtime and development time, while maximising flexibility and precision.
15
ImplementationMaking it work
Visibility is a challenge• Be systematic, don’t bite off too much at once• Get it working first
Unit Testing• Make sure everything works• Have good models and keep them up to date
Evaluate implementation options and measure them all!• Optimize quantitatively, not intuitively
16
Introduction What do we mean by acceleration?
17
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
Maxeler PartonInternal tool suite consisting of the following tools:
Multithreaded lightweight timingArithmetic precision simulationAutomatic control / data flow analysisConstraints-bound performance estimation
18
Parton in the Acceleration Process
Analysis
Code
Transformation
Partitioning
Implementation
Result
Sets theoretical performance bounds
Achieve performance
Loop Timing Measurement
Arithmetic precision simulation
Automatic control / data flow analysis
Constraints‐bound performance estimation
19
Automatic control /data flow analysis• Run-time analysis of compiled software• Allows automatic analysis of impenetrable object-oriented C++ software• Traces data dependencies between functions and libraries• Three stage process:
–Trace–Analyse–Visualise
20
Automatic control /data flow analysis
Trace Control + Memory
Flow
Initial Program
Analyse Control Flow and Data
DependenciesVisualise part or all of the program
21
Automatic control /data flow analysis
Dependency through heap memory allocation
Control flow
Data flow (size in bytes)
Nested ‘for’ loop structures
Hot‐spots identified by color
22
Maxeler Loop Graphs
Boxes represent loops
Ellipses represent computation
23
Diamonds represent data
Introduction What do we mean by acceleration?
24
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
MySQL Internals : Performance ReviewCurrent version of MySQL is optimized for:
Minimizing the disk I/O ( B+tree)Caching queries & data ( hash table, memory buffer pool )Finding the best way to execute a query (optimizer)
Current version of MySQL is not optimized for:Data Level Parallelism : SIMD is not usedThread Level Parallelism to process slow queries : a query is processed by a single threadCompute intensive analytical queries
25
MySQL IndexesIndexes in MySQL use B+Trees – allows searches in logarithmic time.
Search starts at the root, traverses downwards, and performs binary search at each node.
Linked list (red) allows rapid in-order traversal.
3 5
3 41 2 5 6 7
26
Accelerating Index Search Accelerating index search would accelerate many queries (insert,
select, update, joins, ...)Full text search will also benefit.Common queries involving a B+tree search :
SELECT * FROM table WHERE id = 25
SELECT title,text FROM page WHERE MATCH(title) AGAINST(‘name’)
Two possible acceleration strategies:Option 1: Use GPU to accelerate a single search query.
Option 2: Use GPU to process many search queries in parallel.
27
Option 1 : Accelerating a Single Search• Use massively parallel architecture of GPU to accelerate a single search.
• Replace binary search in the B+tree by a K‐ary search.
• Maximum performance improvement : log(K)/log(2)
• Needs more memory bandwidth for reading larger blocks
K‐ary search
B+tree
K‐ary search
K‐ary search1 Approach presented in “Parallel Search on Video Cards (2009)”, T Kaldewey, J Hagen, et al.
28
K-ary search : Use K GPU cores to accelerate the binary search
Binary search with one core done in log2(n) steps
4‐ary search with 3 cores done in log4(n) steps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
5 6 7 8
7
1 2 3 4 5 6 7 8
7 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
7
5 6 7 8
29
GPUs can process hundreds of queries in parallel1.Key features:• Maximize memory bandwidth (parallel reads).• Prefetching to use the SIMD architecture.• Maximize throughput (number of threads).
MySQL architecture implications:• Need modification of the data layout of the binary tree.• Need modification of the query optimizer to gather similar queries.• Needs many search queries to achieve high throughput .
1 0
1 1 0 0
0Query 1
// read
compare
Query 2 Query 3 Query 4 Query 5 ...1 Approach presented in “Fast Architecture Sensitive Tree Search onModern CPUs and GPUs (2010)”, C. Kim, J. Chhugani, et al.
30
• Option 1 optimizes for latency, so speedup is quoted for a single query running with a single thread.
• Option 2 optimizes for throughput, so speedup assumes sufficient queries to keep the GPU fully occupied and is compared to CPU using both cores.
• As option 2 optimizes for throughput, latency may be increased.• Speedup is quoted for overall performance of a full text keyword search on the simple
English Wikipedia database, and not just the index search (accounts for ~71% of runtime).
Speedup vs CPU alone1 Bottleneck
Option 1 (projected) ~2.2x GPU clock frequency.
Option 2 (projected) ~3.2x Number of GPU cores.
Option 2 (initial results) ~2.1x Number of GPU cores.
1 Speedup is quoted for a given system versus the same system not using the GPU.
Performance on Zacate
31
Reverse Time Migration• Geoscience algorithm for imaging the earth’s
subsurface• Runtime of weeks to months on thousands of
CPU cores• Core computational kernel is 3D finite
difference wave propagation
32
Application Structure
33
‘Source’ wave ‘Received’ wave
t=tmax
t=0
Forward extrapolate pressure field
Reverse extrapolate pressure field
×
Cross‐correlate at each time point
RTM Option 1 – Storing pressure fields
34
Extrapolate source wavefield
Load src wave-fields from disk, image with rcver
Extrapolate receiver wavefield
Save wavefields to memory/disk
Post process / save output image
Step 1 Step 2 Step 3
Accelerator
CPU
RTM Option 2 – Recompute pressure fields
35
Step 1 Step 2 Step 3
Accelerator
CPU
Extrapolate source wavefield
Send boundary elements back to source propagator
Extrapolate receiver wavefield
Save boundaries in memory
Post process / save output image
Re-extrapolate source backwards
Imaging
• Option 1 - Store• Higher performance at
low speedups• Limited by disk I/O
• Option 2 – Recompute• Performance scales
• Which is better depends on problem size and relative disk I/O bandwidth
36
Option 2
Option 1: full problem size
Option 1: partial problem
Modelling Performance Impact for Design Alternatives
Introduction What do we mean by acceleration?
37
The Process How do we go about it?
The Tools What can we automate?
Case Studies How can we apply this for real?
Summary What do you think?
SummaryHave a clear goal about what needs to be fasterIdentify all the code that needs to be acceleratedUnderstand the data and compute requirementsMap the algorithm to the architectureEvaluate all options thoroughlyLeave intuition at the door, measure quantitativelyImplement systematically
38
39
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is noobligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.