165
These are confidential sessions—please refrain from streaming, blogging, or taking pictures Fast and energy efficient computation Session 713 Accelerate Framework Geoff Belter Engineer, Vector and Numerics Group

Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

These are confidential sessions—please refrain from streaming, blogging, or taking pictures

Fast and energy efficient computation

Session 713

Accelerate Framework

Geoff BelterEngineer, Vector and Numerics Group

Page 2: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What is it?Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Page 3: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 4: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Session goalsAccelerate Framework

•How Accelerate helps you•Areas of your code likely to benefit from Accelerate•How you use Accelerate

Page 5: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Why is it fast?Accelerate Framework

Page 6: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

Page 7: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

Page 8: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

•Multi-threaded using GCD

Page 9: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Tips for Successful Use of Accelerate

Page 10: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

Page 11: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size

Page 12: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Page 13: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Using Accelerate FrameworkXcode

Page 14: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Using Accelerate FrameworkXcode

Page 15: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Using Accelerate FrameworkXcode

Page 16: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 17: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 18: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 19: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 20: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 21: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

XcodeUsing Accelerate Framework

Page 22: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Command L=lineUsing Accelerate Framework

Page 23: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

cc -framework Accelerate main.c

Command L=lineUsing Accelerate Framework

Page 24: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 25: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Vectorized image processing libraryvImage

Page 26: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImageWhat’s available?

Page 27: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImageWhat’s available?

Page 28: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Additions and Improvements

Page 29: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Additions and Improvements

• Improved conversion support

Page 30: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities

Page 31: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images

Page 32: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images• Streamlined Core Graphics interoperability

Page 33: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics Interoperability

Page 34: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?

Page 35: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?•New utility functions• vImageBuffer_InitWithCGImage• vImageCreateCGImageFromBuffer

Page 36: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 37: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 38: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 39: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 40: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 41: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 42: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 43: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Core Graphics Interoperability

• Convert between vImage_CGImageFormat types• Tips for use

■ Create converter once■ Use many times

vImageConvert_AnyToAny

Page 44: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 45: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 46: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 47: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 48: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 49: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 50: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 51: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 52: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 53: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply

Percentage of time taken

Page 54: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply 0.64%

98.26%

1.10%

Percentage of time taken

Page 55: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
Page 56: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImage vs. OpenCV

Page 57: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

The competitionOpenCV

•Open source computer vision library• Image processing module

Page 58: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Points of Comparison

• Execution time• Energy consumed

Page 59: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp

Speedup over OpenCV

Above one is better

(lanczos)

Page 60: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp 6.41

7.88

23.19

1.62

Speedup over OpenCV

Above one is better

(lanczos)

Page 61: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Energy Consumption and Battery Life

• Fast code tends to■ Decrease energy consumption■ Increase battery life

Page 62: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Energy

t1t0

p2

p1

p0

Time

Page 63: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Time

Unoptimized

Opt

imiz

ed

Page 64: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp

Times less energy than OpenCV

(lanczos)

Above one is better

Page 65: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp 6.96

4.05

6.71

0.75

Times less energy than OpenCV

(lanczos)

Above one is better

Page 66: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
Page 67: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

“Using vImage from the Accelerate framework to dynamically pre-render my sprites. It’s the only way to make it fast. ;-)”

–Twitter User

Page 68: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 69: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Vectorized digital signal processing libraryvDSP

Page 70: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

• Basic operations on arrays■ Add, subtract, multiply, conversion, accumulation, etc.

•Discrete Fourier Transform• Convolution and correlation

vDSPWhat’s available?

Page 71: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

New and Improved in vDSP

Page 72: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

New and Improved in vDSP

•Multi-channel IIR filter

Page 73: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

New and Improved in vDSP

•Multi-channel IIR filter• Improved power of 2 support

■ Discrete Fourier Transform (DFT)■ Discrete Cosine Transform (DCT)

Page 74: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

Page 75: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

Before

Page 76: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

256 384

DFT

Before OS X 10.9, iOS 7

Page 77: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 78: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 79: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 80: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 81: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 82: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
Page 83: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vDSP vs. FFTW

Page 84: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Fastest Fourier Transform in the westFFTW

•One and multi-dimensional transforms• Real and complex data• Parallel

Page 85: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vDSP Speedup over FFTWDFT on iPhone 5

Page 86: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 87: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 88: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

•Used in FaceTime•DFT one of many DSP routines• Percentage of time spent in DFT

DFT In UseAAC Enhanced Low Delay

Page 89: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

DFT In UseAAC Enhanced Low Delay

Page 90: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

Percent time spent in DFT

Page 91: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

70%

30%

vDSP

Percent time spent in DFT

Page 92: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Data Types in vDSP

• Single and double precision• Real and complex• Support for strided data access

Page 93: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

“Wanna do FFT on iOS?Use the Accelerate.framework.

Highly recommended. #ios”

–Twitter User

Page 94: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 95: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Fast Math

Luke ChangEngineer, Vector and Numerics Group

Page 96: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Math for Every Data Length

• Libm for scalar data

float

Page 97: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors

float vFloat

Page 98: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors• vForce for array data

float vFloat

...float []

Page 99: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Standard C math libraryLibm

Page 100: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Libm

• Standard math library in C• Collection of transcendental functions•Operates on scalar data

■ exp[f ]■ log[f ]■ sin[f ]■ cos[f ]■ pow[f ]■ Etc…

Page 101: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What’s New in Libm?

• Extensions to C11, prefixed with “__”•Added in both iOS 7 and OS X 10.9• __exp10[f ]• __sinpi[f ], __cospi[f ], __tanpi[f ]• __sincos[f ], __sincospi[f ]

Page 102: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

__exp10[f]Power of 10

• Commonly used for decibel calculations• Faster than pow(10.0, x)•More accurate than exp(log(10) * x)

■ exp(log(10) * 5.0) =100000.0000000002

Page 103: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

__sinpi[f ], __cospi[f ], __tanpi[f ]Trigonometry in Terms of PI

• cospi(x) means cos( *x)• Faster because argument reduction is simpler•More accurate when working with degrees

■ cos(M_PI*0.5) returns 6.123233995736766e-17■ __cospi(0.5) returns exactly 0.0

Page 104: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

__sincos[f ], __sincospi[f ]Sine-Cosine Pairs

• Compute sine and cosine simultaneously• Faster because argument reduction is only done once• Clang will call __sincos[f ] when possible

Page 105: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

C11 Features

• Some complex values can’t be specified as literals■ (0.0 + INFINITY * I)

• C11 adds CMPLX macro for this purpose■ CMPLX(0.0, INFINITY)

• CMPLXF and CMPLXL are also available

Page 106: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

SIMD vector math libraryvMathLib

Page 107: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vMathLib

• Collection of transcendental functions for SIMD vectors•Operates on SIMD vectors

■ vexp[f ]■ vlog[f ]■ vsin[f ]■ vcos[f ]■ vpow[f ]■ Etc…

Page 108: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Writing your own vector algorithmWhen to Use vMathLib?

•Need transcendental functions in your vector code

Page 109: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Taking sine of a vectorvMathLib Example

•Using Libm

#include <math.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...float *px = (float *)&vx, *py = (float *)&vy;for( i = 0; i < sizeof(vx)/sizeof(px[0]); ++i ) {

py[i] = sinf(px[i]);}...

Page 110: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Taking sine of a vectorvMathLib Example

•Using vMathLib

#include <Accelerate/Accelerate.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);...

Page 111: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Vectorized math libraryvForce

Page 112: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce

• Collection of transcendental functions for arrays•Operates on array data

■ vvexp[f ]■ vvlog[f ]■ vvsin[f ]■ vvcos[f ]■ vvpow[f ]■ Etc…

Page 113: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce Example

• Filling a buffer with sine wave using a for loop

#include <math.h>

float buffer[length];float indices[length];

...

for (int i = 0; i < length; i++){ buffer[i] = sinf(indices[i]);}

Page 114: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce Example

• Filling a buffer with sine wave using vForce

#include <Accelerate/Accelerate.h>

float buffer[length];float indices[length];

...

vvsinf(buffer, indices, &length);

Page 115: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Better PerformanceMeasured on iPhone 5

Page 116: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce

For loop

0 1 2 3 4 5 6

Sines Computed per µs

0 10 20 30 40 50 60

Better PerformanceMeasured on iPhone 5

Bigger is better

56

24

Page 117: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

Page 118: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

14.16

23.37

Page 119: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf

Results computed per µs vForceFor loop

Page 120: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf 10.6

24

9.2

29

22

161

28.8

56

26.6

120

116

0

Results computed per µs vForceFor loop

826

Page 121: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

vForce in Detail

• Supports both float and double•Handles edge cases correctly• Requires minimal data alignment• Supports in-place operation• Improves performance even with small arrays

■ Consider using vForce when more than 16 elements

Page 122: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 123: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Linear Algebra PACKage andBasic Linear Algebra Subprograms

LAPACK and BLAS

Geoff BelterEngineer, Vector and Numerics Group

Page 124: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK Operations

•High-level linear algebra• Solve linear systems•Matrix factorizations• Eigenvalues and eigenvectors

Page 125: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LINPACK

•How fast can you solve a system of linear equations?

•1000 x 1000 matrices

Page 126: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 127: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A” 40

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 128: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 129: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 130: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A” 788

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 131: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

LINPACK benchmark performance in Mflops

Mflops

Page 132: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 133: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

Accelerate on iPhone 4S

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 134: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500

LAPACK

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 135: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0 500 1000 1500 2000

LAPACK

“Brand A”

Bigger is better

Accelerate on iPhone 5

788

LINPACK benchmark performance in Mflops

Mflops

Page 136: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 137: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

3446

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 138: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
Page 139: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

VS.iPadwith Retina Display G5

Power Mac

Page 140: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

iPad with Retina display vs. Power Mac G5LAPACK

• Triumphant return•After 10 years•All fans blazing

Page 141: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

LINPACK benchmark performance in Mflops

Mflops

Page 142: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 143: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 144: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

3686

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 145: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 146: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 147: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 148: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

BLAS Operations

• Low-level linear algebra• Vector

■ Dot product, scalar product, vector sum

•Matrix-vector■ Matrix-vector product, outer product

•Matrix-matrix■ Matrix multiply

Page 149: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 150: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 151: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 152: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Data Types

• Single and double precision• Real and complex•Multiple data layouts

■ Dense, banded, triangular, etc.■ Transpose, conjugate transpose■ Row and column major

Page 153: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

“Playing with the Accelerate.framework today. Having a BLASt.”

–Twitter User

Page 154: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Summary

Page 155: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Lots of functionalityAccelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 156: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Features and benefits

Page 157: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Accelerate FrameworkTo be successful

Page 158: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
Page 159: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

If You Need a Feature

Page 160: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

If You Need a FeatureRequest It

Page 161: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

“Discrete Cosine Transform was my feature request that made it into the

Accelerate Framework. I feel so special!”

–Twitter User

Page 162: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

“Thanks Apple for making the Accelerate Framework.”

–Twitter User

Page 163: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Paul DanboldCore OS Technologies [email protected]

George WarnerDTS Sr. Support [email protected]

DocumentationvImage Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/Introduction/Introduction.html

vDSP Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html

Apple Developer Forumshttp://devforums.apple.com

More Information

Page 164: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)

Labs

Accelerate Lab Core OS Lab AThursday 4:30PM

Page 165: Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)