Accelerating Applications - AMDdeveloper.amd.com/wordpress/media/2013/06/1801_final.pdfhigh performance computing • Founded 2003, ~65 people, offices in London, UK and Palo Alto,

Accelerating ApplicationsJames Spooner | Maxeler | VP of Acceleration

… the art of maximum performance computing

Introduction What do we mean by acceleration?

3

The Process How do we go about it?

The Tools What can we automate?

Case Studies How can we apply this for real?

Summary What do you think?


4





About Maxeler Technologies• Maxeler offers complete hardware, software and application acceleration solutions for

high performance computing• Founded 2003, ~65 people, offices in London, UK and Palo Alto, CA• Main clients in banking and oil and gas exploration

Hardware Card: PCI Express x16, compute, memory and local interconnect Node: 1U solutions with multi cardsRack: 10U, 20U or 40U, balancing compute, storage & network

Resource management for Accelerated ComputingRuntime support: memory management and data choreographyCompilers and High Level Libraries

ConsultingHPC System Performance ArchitectureAlgorithms and Numerical OptimizationIntegration into business and technical processes

Software

5

Application AccelerationDeliberate, focused approach to improving application speed

May involve using new or additional hardwareMay require (dramatic) changes to the code baseMakes some of the program fasterWill be programmed intentionally and be architecture specific May have multiple implementations

6

Maxeler is a acceleration specialist, delivering end‐to‐end performance for a range of clients in the banking and oil/gas exploration industries. This talk aims to present some of our methodology, and experience across GPU and FPGA acceleration projects

What always makes Acceleration hard?Messy codeComplicated build dependencesConfused control-flowImpenetrable data accessPointer-intensive data structuresPremature optimization

7

xyz

xyzp

xyz

xyz

xyz

xyz

rθ

xyzq

xyzp x

yz

for (i=0; i<N; ++i) {points[i]‐>incx();

}

Conflicting goalsSome well-motivated software

structures have real value, but make acceleration harder

Examples:Virtual method calls inside a loopCollections with non-uniform typeSubstructure sharing

8

xyz

xyzp

xyz

xyz

xyz

xyz

rθ

xyzq

xyzp x

yz

for (i=0; i<N; ++i) {points[i]‐>incx();

}

What makes Acceleration easier?Self-evident data dependencesComputing on large collections of uniform dataAppropriate representation hidingGetting the abstraction right

9

x x x x x x x x

y y y y y y y y

z z z z z z z z

Maximum Performance ComputingIdentify parallelism and take advantage of it• Fully understand data dependencies

Minimize memory bandwidth• Data reuse and representation

Regularize the computation and data• Minimize control flow complexity

Find optimal balance for underlying architecture• Memory hierarchy bandwidth(s) and size(s) and latency(s)• Communication bandwidth(s) and latency(s)• Math performance• Branch cost (control divergence)• Axes of Parallelism

10


11





Maxeler Acceleration Process

Analysis

Code

Transformation

Partitioning

Implementation

Result

Sets theoretical performance bounds

Achieve performance

12

Run the code with profiling tools.Understand data and loop structures and data access patternsInvestigate transformation options for these structures and access patternsDecide which parts of the code need accelerationImplement and validate

AnalysisUnderstand the application (code +

data)Find parallelism• At all levels

Understand data dependencies• Size• Frequency• Distance

13

TransformationChange the structure of the code and data

Put all the computation that matters into one placeStrive for regular code and data structures in the inner loopSplit out irregular or complex pre/post processing

PartitioningWhat should go where?

Storage• Consider capacity, bandwidth, latency• Data access patterns, reuse

Compute• Consider input/output data volume, latency,

computational throughput

Model the performance before implementation.

14

APU Chip

CPU CPU

NBCORE

Embedded GPU

~20 GB/s Non‐coherent path~7 GB/s Coherent path

Partitioning Options

Data Access Plans Code Partitioning

Transformations

Pareto Optimal Options

Runtime

Develop

men

t Time

Try to minimise runtime and development time, while maximising flexibility and precision.

15

ImplementationMaking it work

Visibility is a challenge• Be systematic, don’t bite off too much at once• Get it working first

Unit Testing• Make sure everything works• Have good models and keep them up to date

Evaluate implementation options and measure them all!• Optimize quantitatively, not intuitively

16


17





Maxeler PartonInternal tool suite consisting of the following tools:

Multithreaded lightweight timingArithmetic precision simulationAutomatic control / data flow analysisConstraints-bound performance estimation

18

Parton in the Acceleration Process

Analysis

Code

Transformation

Partitioning

Implementation

Result

Sets theoretical performance bounds

Achieve performance

Loop Timing Measurement

Arithmetic precision simulation

Automatic control / data flow analysis

Constraints‐bound performance estimation

19

Automatic control /data flow analysis• Run-time analysis of compiled software• Allows automatic analysis of impenetrable object-oriented C++ software• Traces data dependencies between functions and libraries• Three stage process:

–Trace–Analyse–Visualise

20

Automatic control /data flow analysis

Trace Control + Memory

Flow

Initial Program

Analyse Control Flow and Data

DependenciesVisualise part or all of the program

21

Automatic control /data flow analysis

Dependency through heap memory allocation

Control flow

Data flow (size in bytes)

Nested ‘for’ loop structures

Hot‐spots identified by color

22

Maxeler Loop Graphs

Boxes represent loops

Ellipses represent computation

23

Diamonds represent data


24





MySQL Internals : Performance ReviewCurrent version of MySQL is optimized for:

Minimizing the disk I/O ( B+tree)Caching queries & data ( hash table, memory buffer pool )Finding the best way to execute a query (optimizer)

Current version of MySQL is not optimized for:Data Level Parallelism : SIMD is not usedThread Level Parallelism to process slow queries : a query is processed by a single threadCompute intensive analytical queries

25

MySQL IndexesIndexes in MySQL use B+Trees – allows searches in logarithmic time.

Search starts at the root, traverses downwards, and performs binary search at each node.

Linked list (red) allows rapid in-order traversal.

3 5

3 41 2 5 6 7

26

Accelerating Index Search Accelerating index search would accelerate many queries (insert,

select, update, joins, ...)Full text search will also benefit.Common queries involving a B+tree search :

SELECT * FROM table WHERE id = 25

SELECT title,text FROM page WHERE MATCH(title) AGAINST(‘name’)

Two possible acceleration strategies:Option 1: Use GPU to accelerate a single search query.

Option 2: Use GPU to process many search queries in parallel.

27

Option 1 : Accelerating a Single Search• Use massively parallel architecture of GPU to accelerate a single search.

• Replace binary search in the B+tree by a K‐ary search.

• Maximum performance improvement : log(K)/log(2)

• Needs more memory bandwidth for reading larger blocks

K‐ary search

B+tree

K‐ary search

K‐ary search1 Approach presented in “Parallel Search on Video Cards (2009)”, T Kaldewey, J Hagen, et al.

28

K-ary search : Use K GPU cores to accelerate the binary search

Binary search with one core done in log2(n) steps

4‐ary search with 3 cores done in log4(n) steps

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

5 6 7 8

7

1 2 3 4 5 6 7 8

7 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

7

5 6 7 8

29

GPUs can process hundreds of queries in parallel1.Key features:• Maximize memory bandwidth (parallel reads).• Prefetching to use the SIMD architecture.• Maximize throughput (number of threads).

MySQL architecture implications:• Need modification of the data layout of the binary tree.• Need modification of the query optimizer to gather similar queries.• Needs many search queries to achieve high throughput .

1 0

1 1 0 0

0Query 1

// read

compare

Query 2 Query 3 Query 4 Query 5 ...1 Approach presented in “Fast Architecture Sensitive Tree Search onModern CPUs and GPUs (2010)”, C. Kim, J. Chhugani, et al.

30

• Option 1 optimizes for latency, so speedup is quoted for a single query running with a single thread.

• Option 2 optimizes for throughput, so speedup assumes sufficient queries to keep the GPU fully occupied and is compared to CPU using both cores.

• As option 2 optimizes for throughput, latency may be increased.• Speedup is quoted for overall performance of a full text keyword search on the simple

English Wikipedia database, and not just the index search (accounts for ~71% of runtime).

Speedup vs CPU alone1 Bottleneck

Option 1 (projected) ~2.2x GPU clock frequency.

Option 2 (projected) ~3.2x Number of GPU cores.

Option 2 (initial results) ~2.1x Number of GPU cores.

1 Speedup is quoted for a given system versus the same system not using the GPU.

Performance on Zacate

31

Reverse Time Migration• Geoscience algorithm for imaging the earth’s

subsurface• Runtime of weeks to months on thousands of

CPU cores• Core computational kernel is 3D finite

difference wave propagation

32

Application Structure

33

‘Source’ wave ‘Received’ wave

t=tmax

t=0

Forward extrapolate pressure field

Reverse extrapolate pressure field

×

Cross‐correlate at each time point

RTM Option 1 – Storing pressure fields

34

Extrapolate source wavefield

Load src wave-fields from disk, image with rcver

Extrapolate receiver wavefield

Save wavefields to memory/disk

Post process / save output image

Step 1 Step 2 Step 3

Accelerator

CPU

RTM Option 2 – Recompute pressure fields

35

Step 1 Step 2 Step 3

Accelerator

CPU

Extrapolate source wavefield

Send boundary elements back to source propagator

Extrapolate receiver wavefield

Save boundaries in memory

Post process / save output image

Re-extrapolate source backwards

Imaging

• Option 1 - Store• Higher performance at

low speedups• Limited by disk I/O

• Option 2 – Recompute• Performance scales

• Which is better depends on problem size and relative disk I/O bandwidth

36

Option 2

Option 1: full problem size

Option 1: partial problem

Modelling Performance Impact for Design Alternatives


37





SummaryHave a clear goal about what needs to be fasterIdentify all the code that needs to be acceleratedUnderstand the data and compute requirementsMap the algorithm to the architectureEvaluate all options thoroughlyLeave intuition at the door, measure quantitativelyImplement systematically

38

39

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is noobligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

Documents

Accelerating Applications - AMDdeveloper.amd.com/wordpress/media/2013/06/1801_final.pdfhigh performance computing • Founded 2003, ~65 people, offices in London, UK and Palo Alto,