PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

Preview:

DESCRIPTION

Fabrizio Milo

Citation preview

PyCUDA: Harnessing the power of GPU with Python

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?

Talk Structure

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?

Talk Structure

PyCon 4 – Florence 2010 – Fabrizio Milo

WHY A GPU ?

PyCon 4 – Florence 2010 – Fabrizio Milo

APPLICATIONS & DEMOS

PyCon 4 – Florence 2010 – Fabrizio Milo

Why GPU?

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ?

Talk Structure

PyCon 4 – Florence 2010 – Fabrizio Milo

How does it works ?

PyCon 4 – Florence 2010 – Fabrizio Milo

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU

PyCon 4 – Florence 2010 – Fabrizio Milo

DRAM

GPU

PyCon 4 – Florence 2010 – Fabrizio Milo

DRAM

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU GPU

PyCon 4 – Florence 2010 – Fabrizio Milo

CUDA

PyCon 4 – Florence 2010 – Fabrizio Milo

Compute Unified Device Architecture

PyCon 4 – Florence 2010 – Fabrizio Milo

CUDA

A Parallel Computing Architecture for NVIDIA GPUs

Direct X Compute

PyCon 4 – Florence 2010 – Fabrizio Milo

Execution Model

CUDA Device Model

PyCon 4 – Florence 2010 – Fabrizio Milo

EXECUTION MODEL

PyCon 4 – Florence 2010 – Fabrizio Milo

Thread

Smallest unit of logic

PyCon 4 – Florence 2010 – Fabrizio Milo

A Block

A Group of Threads

PyCon 4 – Florence 2010 – Fabrizio Milo

A Grid

A Group of Blocks

PyCon 4 – Florence 2010 – Fabrizio Milo

One Block can have many threads

PyCon 4 – Florence 2010 – Fabrizio Milo

One Grid can have many blocks

PyCon 4 – Florence 2010 – Fabrizio Milo

DEVICE MODEL The hardware

PyCon 4 – Florence 2010 – Fabrizio Milo

Scalar Processor

PyCon 4 – Florence 2010 – Fabrizio Milo

Scalar Processor

PyCon 4 – Florence 2010 – Fabrizio Milo

Many Scalar Processors

PyCon 4 – Florence 2010 – Fabrizio Milo

+ Register File

PyCon 4 – Florence 2010 – Fabrizio Milo

+ Shared Memory

PyCon 4 – Florence 2010 – Fabrizio Milo

Multiprocessor

PyCon 4 – Florence 2010 – Fabrizio Milo

Device

PyCon 4 – Florence 2010 – Fabrizio Milo

Real Example: 10-Series Architecture

"  240 Scalar Processor (SP) cores execute kernel threads "  30 Streaming Multiprocessors (SMs) each contain "  8 scalar processors "  1 double precision unit "  Shared memory

PyCon 4 – Florence 2010 – Fabrizio Milo

Software Hardware

Thread

Scalar Processor

Thread Block Multiprocessor

Grid Device

PyCon 4 – Florence 2010 – Fabrizio Milo

Global Memory

PyCon 4 – Florence 2010 – Fabrizio Milo

Global Memory

PyCon 4 – Florence 2010 – Fabrizio Milo

Host - Device

RAM

Global Memory CPU

PyCon 4 – Florence 2010 – Fabrizio Milo

Host – Multi Device

RAM

CPU

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo

Software Hardware

Thread

Scalar Processor

Thread Block Multiprocessor

Grid Device

PyCon 4 – Florence 2010 – Fabrizio Milo

__global__ void multiply_them( float *dest, float *a, float *b )

{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}

Kernel

Thread

PyCon 4 – Florence 2010 – Fabrizio Milo

__global__ void multiply_them( float *dest, float *a, float *b )

{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}

Kernel

Thread

PyCon 4 – Florence 2010 – Fabrizio Milo

__global__ void multiply_them( float *dest, float *a, float *b )

{ const int i = threadIdx.x; dest[i] = a[i] * b[i];}

Kernel

Block

PyCon 4 – Florence 2010 – Fabrizio Milo

__global__ void kernel( … ){ const int idx =

blockIdx.x * blockDim.x + threadIdx.x;…

}

Kernel

Grid

PyCon 4 – Florence 2010 – Fabrizio Milo

.bin

NVCC

How do I Program it ?

GCC

.cubin CPU GPU

Kernel Main Logic

PyCon 4 – Florence 2010 – Fabrizio Milo

.bin

NVCC

How do I Program it ?

GCC

.cubin

CPU

GPU

Kernel Main Logic

..bin .cubin

PyCon 4 – Florence 2010 – Fabrizio Milo

Host - Device

RAM

Global Memory CPU

PyCon 4 – Florence 2010 – Fabrizio Milo

RAM

Global Memory CPU

PyCon 4 – Florence 2010 – Fabrizio Milo

Allocate Memory

cudaMalloc( pointer, size )

PyCon 4 – Florence 2010 – Fabrizio Milo

Copy to device

cudaMalloc( pointer, size )

cudaMemcpy( dest, src, size, direction)

PyCon 4 – Florence 2010 – Fabrizio Milo

Kernel Launch

cudaMalloc( pointer, size )

cudaMemcpy( dest, src, size, direction)

Kernel<<< # blocks, # threads >> (*params)

PyCon 4 – Florence 2010 – Fabrizio Milo

Get Back the Results

cudaMalloc( pointer, size )

cudaMemcpy( dest, src, size, direction)

Kernel<<< # blocks, # threads >> (*params)

cudaMemcpy( dest, src, size, direction)

PyCon 4 – Florence 2010 – Fabrizio Milo

Error Handling

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

PyCon 4 – Florence 2010 – Fabrizio Milo

And soon it becomes …

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

PyCon 4 – Florence 2010 – Fabrizio Milo

And soon it becomes …

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error()}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error()}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCon 4 – Florence 2010 – Fabrizio Milo

1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo

= PYCUDA

+

& ANDREAS KLOCKNER

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

Provide Complete Access

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

AutoMatically Manage

Resources

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

Check and Report Errors

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

Cross Platform

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

Allow Interactive

Use

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda Philosopy

NumPy Integration

PyCon 4 – Florence 2010 – Fabrizio Milo

NUMPY - ARRAY

PyCon 4 – Florence 2010 – Fabrizio Milo

import numpy

my_array = numpy.array([1,] * 100)

1 1 1 1 1 1

99 0

PyCon 4 – Florence 2010 – Fabrizio Milo

import numpy

my_array = numpy.array([1,] * 100)

my_array[3] = 0

0 1 1 1 1 1

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: Workflow

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: Workflow

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: Workflow

PyCon 4 – Florence 2010 – Fabrizio Milo

Memory Allocation

cuda.mem_alloc( size_bytes )

PyCon 4 – Florence 2010 – Fabrizio Milo

Memory Copy

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )

PyCon 4 – Florence 2010 – Fabrizio Milo

Kernel

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )

SourceModule(“””__global__ void multiply_them( float *dest, float *a,

float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)

PyCon 4 – Florence 2010 – Fabrizio Milo

Kernel Launch

mod = SourceModule(“””__global__ void multiply_them( float *dest, float *a,

float *b ){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}”””)

multiply_them = mod.get_function(“multiply_them”)multiply_them ( *args, block=(30, 64, 1))

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCon 4 – Florence 2010 – Fabrizio Milo

DEMO Hello Gpu

PyCon 4 – Florence 2010 – Fabrizio Milo

GPUARRAY

PyCon 4 – Florence 2010 – Fabrizio Milo

gpuarray

PyCon 4 – Florence 2010 – Fabrizio Milo

gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()

PyCuda: GpuArray

PyCon 4 – Florence 2010 – Fabrizio Milo

gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()

PyCuda: GpuArray

+, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product …

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: GpuArray: ElementWise

from pycuda.elementwise import ElementwiseKernel

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: GpuArray: ElementWise

from pycuda.elementwise import ElementwiseKernel

lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”

)

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: GpuArray: ElementWise

from pycuda.elementwise import ElementwiseKernel

lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ”

)

c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu)

assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5

PyCon 4 – Florence 2010 – Fabrizio Milo

Meta-Programming

__kernel_template__ = “””__global__ void kernel( args ){

for (int i=0; i={{ iterations }}; i++){ {{operations}}}

}”””

See for example jinja2

PyCon 4 – Florence 2010 – Fabrizio Milo

Meta-Programming

PyCon 4 – Florence 2010 – Fabrizio Milo

Meta-Programming

Generate Source !

PyCon 4 – Florence 2010 – Fabrizio Milo

Performances ?

PyCon 4 – Florence 2010 – Fabrizio Milo

DEMO mandelbrot

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda: Documentation

PyCon 4 – Florence 2010 – Fabrizio Milo

PyCuda

WebSite: http://mathema.tician.de/software/ pycuda

License: X Consortium License

(no warranty, free for all use)

Dependencies: Python 2.4+, numpy, Boost

PyCon 4 – Florence 2010 – Fabrizio Milo

In the Future …

OPENCL

PyCon 4 – Florence 2010 – Fabrizio Milo

THANK YOU & HAVE FUN !

PyCon 4 – Florence 2010 – Fabrizio Milo

?

Recommended