Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com ›...

Blowing the Doors Off YourBottlenecks with Python on AMD APUs

Stan Seibert Continuum Analytics

December 8, 2015

My Background• Trained in physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Analytics team at Continuum

OUR HISTORY

2012 2013 2014 2015

OpsSales & MktgDevl & Eng

OUR TEAM

Global Community 2M+

Investors General Catalyst | BuildGroup

Global Presence Americas | EMEA

July 2012 V1 | Anaconda

June 2013 10K/mon Anaconda downloads

Sept 2014 100K/mon Anaconda downloads

Enterprise Customers30+

Industries Financial Services Government Health & Life Sciences

Retail & CPG Oil & Gas High Tech

OSS Contributors 75+

OUR BEGINNING Travis Oliphant & Peter Wang co-founded in 2012 Team includes OSS authors: NumPy, SciPy, PyTables, Pandas, Jupyter/IPython Vision foundational tools for next generation data scientists

May 2015 150K/mon Anaconda downloads

May 2014 V2 | Anaconda

Agenda

1. Numba: A Compiler for Python

2. HSA: Bringing the CPU and GPU together

3. Numba+HSA Examples

4. Conclusion

NUMBA A POWERFUL & FAST PYTHON COMPILER

Designed specifically for math-intensive algorithms and NumPy arrays

Can accelerate Python functions by

2x to 200x

Approaching the speeds of C or

FORTRAN

A Powerful and Fast Python Compiler

How Does Numba Work?

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM JIT LLVM IR

Lowering

Rewrite IR

@jitdef do_math(a, b): …>>> do_math(x, y)

Supported Platforms

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.7 and later) • Experimental support for ARMv7 (Raspberry Pi 2) • NumPy 1.6 through 1.9

• Linux (~RHEL 5 and later)

• AMD GPUs supporting HSA

• NVIDIA GPUs that support CUDA

Questions?

HSA BRINGING THE CPU AND GPU TOGETHER

What is HSA?

Heterogeneous System Architecture (HSA) HSA is a multi-vendor standard for creating chips with CPU and GPU cores that work together and share the same memory. This standard includes an API for loading compute kernels, launching tasks, and communicating between CPU and GPU. Compute kernels are written in HSAIL.

Why HSA?

• Manually moving data between CPU and GPU memory spaces adds code complexity and execution overhead

• Traditional GPU programming tends to force algorithms to fit into “all-CPU” or “all-GPU” categories

• HSA makes it easier to let each core do what it is good at: • CPU: low latency sequential calculations • GPU: high throughput data parallel calculations

The HSA Programming Model

GridWork-itemWork-group

NUMBA & HSA EXAMPLES

Hardware and Software Requirements• Ubuntu Linux 14.04 64-bit

• Kaveri or Carrizo APU(Numba tested with A10-7850K, A10-7800P)

• At least 4 GB of system memory

• Example code on GitHub:https://github.com/ContinuumIO/Numba-HSA-Webinar/

• Install drivers from:https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/HSA-Platforms-&-Installation

• Download and install 64-bit Linux Miniconda from: http://conda.pydata.org/miniconda.html

• Run the following commands: conda create -n hsa_webinar python=3.4 \ numba libhlc pandas bokeh matplotlib basemap jupyter source activate hsa_webinar export LD_LIBRARY_PATH=/opt/hsa/lib:$LD_LIBRARAY_PATH jupyter notebook

Setup Instructions

EXAMPLE #1:CREATING A UFUNC

Sample Data Set• Geographic point data

• Latitude, Longitude in degrees • Distance computations involve a lot of math

• Sample data comes from satellite-observed lightning strikes on Earth, but could easily be: • Geotagged social media posts • GPS tracking information for fleet vehicles • Geocoded customer addresses

Task: Geographic Locality

• Given a large collection of points, what is the distance of each from a target point?

• How many are within a given range?

What is a ufunc?

A Universal function (ufunc) is a special function that broadcasts over elements of a NumPy array.

Parallelizing Ufuncs

• Ufunc computations are inherently parallel

• Numba can auto-parallelize a user-created ufunc for many platforms, including HSA

• Developer does not need to know any details about GPU scheduling

Computing Distance

22http://en.wikipedia.org/wiki/Great-circle_distance#Computational_formulas

Computing Distance

Type signature

Device functionSelectufunc target

Calling the function

No special syntax to call a GPU ufunc!

Performance

Performance Tips and Tricks

• Prefer 32-bit over 64-bit data

• GPUs are fast at special math functions

• Don’t force it: If it is easier to do a calculation on the CPU, do it there!

Pro-tip: Compiling a function for CPU and GPU targets

• Use numba.vectorize as a function:

Questions on Example #1?

EXAMPLE #2:CREATING AN HSA KERNEL

Task: Compute Distance Matrix

• Compute the distance between all pairs of points

• Common first step in route planning, clustering, etc.

• Could do this with ufunc, but let’s write a kernel function instead

The HSA Programming Model

GridWork-itemWork-group

Mapping to GPU work-items

0 1 2 3 4 5

workitem 0

workitem 1

workitem 2

workitem 3

workitem 4

workitem 5

Note: There are more efficient ways to divide the work than this!

Creating a Device Function

Creating a Kernel Function

Calling a Kernel

Performance

Performance Tips and Tricks

• Use lots of work-items

• Minimize branch divergence

• Learn from other GPU APIs: OpenCL and CUDA are very similar to HSA

Questions on Example #2?

CONCLUSION

Conclusion• Create high performing CPU or GPU code in Python with Numba!

• HSA lets you process data using the GPU and the CPU, without the overhead of memory copies

• Numba + HSA is a great combination

• The Jupyter notebook used in this demo can be downloaded here: https://github.com/ContinuumIO/Numba-HSA-Webinar

• For more documentation:http://numba.pydata.org/numba-doc/0.22.1/hsa/index.html

What’s Next?

• Boltzmann Initiative:HSA+ for FirePro GPU cards

• HSA code for APUs will be portable to FirePro cards with few changes

• Stay tuned for more updates!

Resources

AMD Developer Central • Additional Developer Resources: developer.amd.com • Follow AMD Developer Central: twitter.com/AMDDevCentral • This and other webinars posted to YouTube: www.youtube.com/user/AMDDevCentral

Continuum Analytics • Website: https://continuum.io • Twitter: @ContinuumIO • For more information on Numba: http://numba.pydata.org • Get help optimizing your Python code! Contact sales@continuum.io for a code

assessment

Blowing the Doors Off Your Bottlenecks with Python on AMD APUsdeveloper.amd.com ›...

Documents

Bottlenecks Ppt

AMD, Advanced Micro Devices, K86, AMD-K6, AMD Athlon and

Enovate (HFC-245fa) Blowing Agent Solstice Liquid Blowing ... · z Enovate TM (HFC-245fa) Blowing Agent Solstice TM Liquid Blowing Agent Solstice TM Gaseous Blowing Agent Technical

Seven DevOps Bottlenecks

Bottlenecks – where and why

Bottlenecks and Solutions

Whistle Blowing

Identifying Bottlenecks

Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ... • Few calls, lots of data • No data caching • Likely to run into CPU bottlenecks

AMD on AMD: Production Consolidation using VMware and the AMD

Supply Chain Bottlenecks

Bottlenecks Volcanic Winter

People as Bottlenecks

Steam Blowing

Bottlenecks Full Version

Cable blowing machines for tube blowing and cable blowing

Future Dataflow Bottlenecks

Logistics Bottlenecks

2018 Minnesota Statewide Freight Bottlenecks Report · 2018 Minnesota Statewide Freight Bottlenecks Report . Minnesota Statewide Freight Bottlenecks . Transportation Performance Measures

Upstream Patents = Downstream Bottlenecks