43
© 2015 Continuum Analytics- Confidential & Proprietary Blowing the Doors Off Your Bottlenecks with Python on AMD APUs Stan Seibert Continuum Analytics December 8, 2015

Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Blowing the Doors Off YourBottlenecks with Python on AMD APUs

Stan Seibert Continuum Analytics

December 8, 2015

Page 2: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

My Background• Trained in physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Analytics team at Continuum

2

Page 3: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary3

OUR HISTORY

0

35

70

105

140

2012 2013 2014 2015

OpsSales & MktgDevl & Eng

OUR TEAM

Global Community 2M+

Investors General Catalyst | BuildGroup

Global Presence Americas | EMEA

July 2012 V1 | Anaconda

June 2013 10K/mon Anaconda downloads

Sept 2014 100K/mon Anaconda downloads

Enterprise Customers30+

Industries Financial Services Government Health & Life Sciences

Retail & CPG Oil & Gas High Tech

OSS Contributors 75+

OUR BEGINNING Travis Oliphant & Peter Wang co-founded in 2012 Team includes OSS authors: NumPy, SciPy, PyTables, Pandas, Jupyter/IPython Vision foundational tools for next generation data scientists

May 2015 150K/mon Anaconda downloads

May 2014 V2 | Anaconda

Page 4: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Agenda

1. Numba: A Compiler for Python

2. HSA: Bringing the CPU and GPU together

3. Numba+HSA Examples

4. Conclusion

Page 5: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary 5

NUMBA A POWERFUL & FAST PYTHON COMPILER

Page 6: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

Designed specifically for math-intensive algorithms and NumPy arrays

Can accelerate Python functions by

2x to 200x

Approaching the speeds of C or

FORTRAN

© 2015 Continuum Analytics- Confidential & Proprietary

Numba

6

A Powerful and Fast Python Compiler

Page 7: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

How Does Numba Work?

7

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM JIT LLVM IR

Lowering

Rewrite IR

Cache

@jitdef do_math(a, b): …>>> do_math(x, y)

Page 8: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Supported Platforms

8

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.7 and later) • Experimental support for ARMv7 (Raspberry Pi 2) • NumPy 1.6 through 1.9

• Linux (~RHEL 5 and later)

• AMD GPUs supporting HSA

• NVIDIA GPUs that support CUDA

Page 9: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Questions?

9

Page 10: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

HSA BRINGING THE CPU AND GPU TOGETHER

10

Page 11: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

What is HSA?

11

Heterogeneous System Architecture (HSA) HSA is a multi-vendor standard for creating chips with CPU and GPU cores that work together and share the same memory. This standard includes an API for loading compute kernels, launching tasks, and communicating between CPU and GPU. Compute kernels are written in HSAIL.

Page 12: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Why HSA?

12

• Manually moving data between CPU and GPU memory spaces adds code complexity and execution overhead

• Traditional GPU programming tends to force algorithms to fit into “all-CPU” or “all-GPU” categories

• HSA makes it easier to let each core do what it is good at: • CPU: low latency sequential calculations • GPU: high throughput data parallel calculations

Page 13: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

The HSA Programming Model

13

GridWork-itemWork-group

Page 14: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

NUMBA & HSA EXAMPLES

14

Page 15: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Hardware and Software Requirements• Ubuntu Linux 14.04 64-bit

• Kaveri or Carrizo APU(Numba tested with A10-7850K, A10-7800P)

• At least 4 GB of system memory

• Example code on GitHub:https://github.com/ContinuumIO/Numba-HSA-Webinar/

15

Page 16: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

• Install drivers from:https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/HSA-Platforms-&-Installation

• Download and install 64-bit Linux Miniconda from: http://conda.pydata.org/miniconda.html

• Run the following commands: conda create -n hsa_webinar python=3.4 \ numba libhlc pandas bokeh matplotlib basemap jupyter source activate hsa_webinar export LD_LIBRARY_PATH=/opt/hsa/lib:$LD_LIBRARAY_PATH jupyter notebook

© 2015 Continuum Analytics- Confidential & Proprietary

Setup Instructions

16

Page 17: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

EXAMPLE #1:CREATING A UFUNC

17

Page 18: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Sample Data Set• Geographic point data

• Latitude, Longitude in degrees • Distance computations involve a lot of math

• Sample data comes from satellite-observed lightning strikes on Earth, but could easily be: • Geotagged social media posts • GPS tracking information for fleet vehicles • Geocoded customer addresses

18

Page 19: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Task: Geographic Locality

• Given a large collection of points, what is the distance of each from a target point?

• How many are within a given range?

19

Page 20: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

What is a ufunc?

20

A Universal function (ufunc) is a special function that broadcasts over elements of a NumPy array.

Page 21: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Parallelizing Ufuncs

• Ufunc computations are inherently parallel

• Numba can auto-parallelize a user-created ufunc for many platforms, including HSA

• Developer does not need to know any details about GPU scheduling

21

Page 22: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Computing Distance

22http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

Page 23: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Computing Distance

23http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

Type signature

Device functionSelectufunc target

Page 24: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Calling the function

24http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

No special syntax to call a GPU ufunc!

Page 25: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Performance

25

Page 26: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Performance Tips and Tricks

• Prefer 32-bit over 64-bit data

• GPUs are fast at special math functions

• Don’t force it: If it is easier to do a calculation on the CPU, do it there!

26

Page 27: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Pro-tip: Compiling a function for CPU and GPU targets

• Use numba.vectorize as a function:

27

Page 28: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Questions on Example #1?

28

Page 29: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

EXAMPLE #2:CREATING AN HSA KERNEL

29

Page 30: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Task: Compute Distance Matrix

• Compute the distance between all pairs of points

• Common first step in route planning, clustering, etc.

• Could do this with ufunc, but let’s write a kernel function instead

30

Page 31: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

The HSA Programming Model

31

GridWork-itemWork-group

Page 32: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Mapping to GPU work-items

32

0 1 2 3 4 5

0 0

1 0

2 0

3 0

4 0

5 0

workitem 0

workitem 1

workitem 2

workitem 3

workitem 4

workitem 5

Note: There are more efficient ways to divide the work than this!

Page 33: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Creating a Device Function

33

Page 34: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Creating a Kernel Function

34

Page 35: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Calling a Kernel

35

Page 36: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Performance

36

Page 37: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Performance Tips and Tricks

• Use lots of work-items

• Minimize branch divergence

• Learn from other GPU APIs: OpenCL and CUDA are very similar to HSA

37

Page 38: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Questions on Example #2?

38

Page 39: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

CONCLUSION

39

Page 40: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Conclusion• Create high performing CPU or GPU code in Python with Numba!

• HSA lets you process data using the GPU and the CPU, without the overhead of memory copies

• Numba + HSA is a great combination

• The Jupyter notebook used in this demo can be downloaded here: https://github.com/ContinuumIO/Numba-HSA-Webinar

• For more documentation:http://numba.pydata.org/numba-doc/0.22.1/hsa/index.html

40

Page 41: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

What’s Next?

• Boltzmann Initiative:HSA+ for FirePro GPU cards

• HSA code for APUs will be portable to FirePro cards with few changes

• Stay tuned for more updates!

41

Page 42: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Resources

AMD Developer Central • Additional Developer Resources: developer.amd.com • Follow AMD Developer Central: twitter.com/AMDDevCentral • This and other webinars posted to YouTube: www.youtube.com/user/AMDDevCentral

Continuum Analytics • Website: https://continuum.io • Twitter: @ContinuumIO • For more information on Numba: http://numba.pydata.org • Get help optimizing your Python code! Contact [email protected] for a code

assessment

42

Page 43: Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com › ...12_AMD_Python_Webinar_Final.pdf · © 2015 Continuum Analytics- Confidential & Proprietary My

© 2015 Continuum Analytics- Confidential & Proprietary

Q & A

43