Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com ›...

Preview:

Citation preview

© 2015 Continuum Analytics- Confidential & Proprietary

Blowing the Doors Off YourBottlenecks with Python on AMD APUs

Stan Seibert Continuum Analytics

December 8, 2015

© 2015 Continuum Analytics- Confidential & Proprietary

My Background• Trained in physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Analytics team at Continuum

2

© 2015 Continuum Analytics- Confidential & Proprietary3

OUR HISTORY

0

35

70

105

140

2012 2013 2014 2015

OpsSales & MktgDevl & Eng

OUR TEAM

Global Community 2M+

Investors General Catalyst | BuildGroup

Global Presence Americas | EMEA

July 2012 V1 | Anaconda

June 2013 10K/mon Anaconda downloads

Sept 2014 100K/mon Anaconda downloads

Enterprise Customers30+

Industries Financial Services Government Health & Life Sciences

Retail & CPG Oil & Gas High Tech

OSS Contributors 75+

OUR BEGINNING Travis Oliphant & Peter Wang co-founded in 2012 Team includes OSS authors: NumPy, SciPy, PyTables, Pandas, Jupyter/IPython Vision foundational tools for next generation data scientists

May 2015 150K/mon Anaconda downloads

May 2014 V2 | Anaconda

© 2015 Continuum Analytics- Confidential & Proprietary

Agenda

1. Numba: A Compiler for Python

2. HSA: Bringing the CPU and GPU together

3. Numba+HSA Examples

4. Conclusion

© 2015 Continuum Analytics- Confidential & Proprietary 5

NUMBA A POWERFUL & FAST PYTHON COMPILER

Designed specifically for math-intensive algorithms and NumPy arrays

Can accelerate Python functions by

2x to 200x

Approaching the speeds of C or

FORTRAN

© 2015 Continuum Analytics- Confidential & Proprietary

Numba

6

A Powerful and Fast Python Compiler

© 2015 Continuum Analytics- Confidential & Proprietary

How Does Numba Work?

7

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM JIT LLVM IR

Lowering

Rewrite IR

Cache

@jitdef do_math(a, b): …>>> do_math(x, y)

© 2015 Continuum Analytics- Confidential & Proprietary

Supported Platforms

8

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.7 and later) • Experimental support for ARMv7 (Raspberry Pi 2) • NumPy 1.6 through 1.9

• Linux (~RHEL 5 and later)

• AMD GPUs supporting HSA

• NVIDIA GPUs that support CUDA

© 2015 Continuum Analytics- Confidential & Proprietary

Questions?

9

© 2015 Continuum Analytics- Confidential & Proprietary

HSA BRINGING THE CPU AND GPU TOGETHER

10

© 2015 Continuum Analytics- Confidential & Proprietary

What is HSA?

11

Heterogeneous System Architecture (HSA) HSA is a multi-vendor standard for creating chips with CPU and GPU cores that work together and share the same memory. This standard includes an API for loading compute kernels, launching tasks, and communicating between CPU and GPU. Compute kernels are written in HSAIL.

© 2015 Continuum Analytics- Confidential & Proprietary

Why HSA?

12

• Manually moving data between CPU and GPU memory spaces adds code complexity and execution overhead

• Traditional GPU programming tends to force algorithms to fit into “all-CPU” or “all-GPU” categories

• HSA makes it easier to let each core do what it is good at: • CPU: low latency sequential calculations • GPU: high throughput data parallel calculations

© 2015 Continuum Analytics- Confidential & Proprietary

The HSA Programming Model

13

GridWork-itemWork-group

© 2015 Continuum Analytics- Confidential & Proprietary

NUMBA & HSA EXAMPLES

14

© 2015 Continuum Analytics- Confidential & Proprietary

Hardware and Software Requirements• Ubuntu Linux 14.04 64-bit

• Kaveri or Carrizo APU(Numba tested with A10-7850K, A10-7800P)

• At least 4 GB of system memory

• Example code on GitHub:https://github.com/ContinuumIO/Numba-HSA-Webinar/

15

• Install drivers from:https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/HSA-Platforms-&-Installation

• Download and install 64-bit Linux Miniconda from: http://conda.pydata.org/miniconda.html

• Run the following commands: conda create -n hsa_webinar python=3.4 \ numba libhlc pandas bokeh matplotlib basemap jupyter source activate hsa_webinar export LD_LIBRARY_PATH=/opt/hsa/lib:$LD_LIBRARAY_PATH jupyter notebook

© 2015 Continuum Analytics- Confidential & Proprietary

Setup Instructions

16

© 2015 Continuum Analytics- Confidential & Proprietary

EXAMPLE #1:CREATING A UFUNC

17

© 2015 Continuum Analytics- Confidential & Proprietary

Sample Data Set• Geographic point data

• Latitude, Longitude in degrees • Distance computations involve a lot of math

• Sample data comes from satellite-observed lightning strikes on Earth, but could easily be: • Geotagged social media posts • GPS tracking information for fleet vehicles • Geocoded customer addresses

18

© 2015 Continuum Analytics- Confidential & Proprietary

Task: Geographic Locality

• Given a large collection of points, what is the distance of each from a target point?

• How many are within a given range?

19

© 2015 Continuum Analytics- Confidential & Proprietary

What is a ufunc?

20

A Universal function (ufunc) is a special function that broadcasts over elements of a NumPy array.

© 2015 Continuum Analytics- Confidential & Proprietary

Parallelizing Ufuncs

• Ufunc computations are inherently parallel

• Numba can auto-parallelize a user-created ufunc for many platforms, including HSA

• Developer does not need to know any details about GPU scheduling

21

© 2015 Continuum Analytics- Confidential & Proprietary

Computing Distance

22http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

© 2015 Continuum Analytics- Confidential & Proprietary

Computing Distance

23http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

Type signature

Device functionSelectufunc target

© 2015 Continuum Analytics- Confidential & Proprietary

Calling the function

24http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

No special syntax to call a GPU ufunc!

© 2015 Continuum Analytics- Confidential & Proprietary

Performance

25

© 2015 Continuum Analytics- Confidential & Proprietary

Performance Tips and Tricks

• Prefer 32-bit over 64-bit data

• GPUs are fast at special math functions

• Don’t force it: If it is easier to do a calculation on the CPU, do it there!

26

© 2015 Continuum Analytics- Confidential & Proprietary

Pro-tip: Compiling a function for CPU and GPU targets

• Use numba.vectorize as a function:

27

© 2015 Continuum Analytics- Confidential & Proprietary

Questions on Example #1?

28

© 2015 Continuum Analytics- Confidential & Proprietary

EXAMPLE #2:CREATING AN HSA KERNEL

29

© 2015 Continuum Analytics- Confidential & Proprietary

Task: Compute Distance Matrix

• Compute the distance between all pairs of points

• Common first step in route planning, clustering, etc.

• Could do this with ufunc, but let’s write a kernel function instead

30

© 2015 Continuum Analytics- Confidential & Proprietary

The HSA Programming Model

31

GridWork-itemWork-group

© 2015 Continuum Analytics- Confidential & Proprietary

Mapping to GPU work-items

32

0 1 2 3 4 5

0 0

1 0

2 0

3 0

4 0

5 0

workitem 0

workitem 1

workitem 2

workitem 3

workitem 4

workitem 5

Note: There are more efficient ways to divide the work than this!

© 2015 Continuum Analytics- Confidential & Proprietary

Creating a Device Function

33

© 2015 Continuum Analytics- Confidential & Proprietary

Creating a Kernel Function

34

© 2015 Continuum Analytics- Confidential & Proprietary

Calling a Kernel

35

© 2015 Continuum Analytics- Confidential & Proprietary

Performance

36

© 2015 Continuum Analytics- Confidential & Proprietary

Performance Tips and Tricks

• Use lots of work-items

• Minimize branch divergence

• Learn from other GPU APIs: OpenCL and CUDA are very similar to HSA

37

© 2015 Continuum Analytics- Confidential & Proprietary

Questions on Example #2?

38

© 2015 Continuum Analytics- Confidential & Proprietary

CONCLUSION

39

© 2015 Continuum Analytics- Confidential & Proprietary

Conclusion• Create high performing CPU or GPU code in Python with Numba!

• HSA lets you process data using the GPU and the CPU, without the overhead of memory copies

• Numba + HSA is a great combination

• The Jupyter notebook used in this demo can be downloaded here: https://github.com/ContinuumIO/Numba-HSA-Webinar

• For more documentation:http://numba.pydata.org/numba-doc/0.22.1/hsa/index.html

40

© 2015 Continuum Analytics- Confidential & Proprietary

What’s Next?

• Boltzmann Initiative:HSA+ for FirePro GPU cards

• HSA code for APUs will be portable to FirePro cards with few changes

• Stay tuned for more updates!

41

© 2015 Continuum Analytics- Confidential & Proprietary

Resources

AMD Developer Central • Additional Developer Resources: developer.amd.com • Follow AMD Developer Central: twitter.com/AMDDevCentral • This and other webinars posted to YouTube: www.youtube.com/user/AMDDevCentral

Continuum Analytics • Website: https://continuum.io • Twitter: @ContinuumIO • For more information on Numba: http://numba.pydata.org • Get help optimizing your Python code! Contact sales@continuum.io for a code

assessment

42

© 2015 Continuum Analytics- Confidential & Proprietary

Q & A

43