21
Copyright © 2016 Synopsys Inc. 1 Using the OpenCL C Kernel Language for Embedded Vision Processors Seema Mirchandaney May 3, 2016

"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Embed Size (px)

Citation preview

Page 1: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 1

Using the OpenCL C Kernel Language

for Embedded Vision Processors Seema Mirchandaney

May 3, 2016

Page 2: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 2

• Synopsys Embedded Vision Processor and Tools Exploration

• OpenCL™ C Introduction

• Vectorization in OpenCL C

• Lessons learned

Agenda

Page 3: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 3

Synopsys Embedded Vision

Processor and Tools Exploration

Page 4: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 4

Synopsys EV5x Vision Processors

Page 5: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 5

Embedded Vision SIMD Processor Trends

• Parallelism exploited at multiple levels

• SIMD instructions

• N-way VLIW

• Multi-core

• Challenges for SIMD → irregularity

• Non-contiguous memory reference patterns

• Non uniform data flow — control statements

• Multiple data types (char, short, int, float)

• Data dependences

Page 6: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 6

• Data parallelism not implicit in the language

• Intrinsics used for SIMD operations → low level

• Vendor specific extensions for vector data types → non portable

• Compilers perform inner loop vectorization → limited success

• Pointer aliasing, complex subscripts, limited data

dependence analysis

• Pragmas required to guide the compiler to be ‘smart’ (no

dependence, SIMD width)

• New language extensions → Intel’s SPMD compiler

Programming in C for SIMD

Page 7: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 7

OpenCL C Introduction

Page 8: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 8

• OpenCL C Language Derived from ISO C99

• Disallows standard C99 headers, function pointers, recursion,

variable length arrays, and bit fields

• Important additions to the language for parallelism

• Work items and workgroups

• Vector types up to 16 lanes

• Synchronization

• Address space qualifiers

• Large set of built-ins

Data Parallelism

Page 9: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 9

Work Items and Workgroups

Page 10: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 10

Performance is a challenge

• Explicit vectorization: managed by the programmer

• Implicit vectorization: automatically performed by the compiler

• Differences in execution of work items

GPU vs. CPU vs. SIMD Vision Processors

• One work item simply maps to one hardware thread GPU

Libraries (pthreads, OpenMP or MPI) have to be employed to obtain

the wanted effect

• One work item running on one CPU core and all CPU cores busy

CPU

SIMD

Vision

Processors

Page 11: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 11

• Programming Model — OpenCL kernels + OpenVX

• Advanced Whole function vectorization module

• Extensions for explicit vectorization

• Wider vectors ([u]short32, [u]char64) + operations

• Built-ins for scatter/gather with predication

Language Extensions and Optimizing Compiler

Multiple vector lane modes allow for maximizing performance

16 lanes [int, short2, char4 data types optimized]

32 lanes [short, char2 types data types optimized]

64 lanes [char data type optimized]

Page 12: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 12

Vectorization in OpenCL C

Page 13: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 13

• What’s there in OpenCL C for explicit vectorization?

• Vector data types

• Built-ins for vector data types

• Relational built-ins that enable vector predication (any, all, select)

• Basic control flow operations on vectors

Explicit Vectorization

Page 14: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 14

Kernel Example

kernel X(global int *a, global int* b, int n, int cval) {

tid = get_global_id(0);

int val =0;

for (int i=0; i< n; ++i) {

if (val < cval) { varying scalar (val)

val += a[b[tid]; non-consecutive load (gather)

}

else if (val > cval*2) divergent control flow

val += b[tid];

}

a[tid] = val; Kernel represent parallelism

}

Page 15: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 15

Explicit Vectorization in OpenCL C

kernel X(global int *a, global int* b, int n, int cval) {

int4 tid= {gid,….,gid+3};

int4 val =0;

for (int i=0; i< n; ++i) {

int4 mask = val < ((int4)cval);

int4 valg= gather4(a, b, tid,…);

int4 val1= val + valg;

val = mask? val1: val;

int4 mval = (int4)cval * (int4) 2;

int4 maske = (val > mval);

maske =maske & ~mask;

int4 bl = vload4(b…);

val1 = val + bl;

val = maske? val: val1;

}

vstore4(a,….);

}

Page 16: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 16

• OpenCL C kernels → parallelism expressed

• DSP-based SIMD architectures pose a challenge to balance portability

and performance

• Explicit vectorization restricts portability

• Existing DSP based architectures with varying SIMD extensions

• Detailed knowledge of hardware required to achieve performance

• Extensions may be required to support the hardware features

Portable Programming with Performance

Page 17: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 17

• Requires compiler support beyond traditional inner loop vectorization

• Main idea

• Transform a kernel to a multi work item kernel(SIMD lanes)

• Transform accesses to ‘thread id’ (ID of a work item) to return a

vector of w (num lanes) consecutive values

• Transform each operation into its vector counterpart

• Adapts well to DSP processors with extensive SIMD instruction sets

Implicit / Whole Function Vectorization (WFV)

Page 18: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 18

Lessons Learned

Page 19: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 19

• Experiments used Synopsys ARC® MetaWare Research compiler and

simulator

• Wide vectors with multiple data types

• Predicated scatter/gather built-ins

• Cross lane reductions/shuffles

• SIMD based optimized built-ins library

• Explicit vectorization -> output after WFV

Experience with Kernels

Benchmark

OpenCL C with

Extensions

Performance

relative to optimized

assembly versions

HoG linear

SVM

1.12

Integral Image 1.11

Median filter 1.03

Histogram 1.02

Page 20: "Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 20

• Whole Function Vectorization essential for

• Complex kernels with control flow, non contiguous memory

references

• SIMD extensions for

• irregular memory references

• Predicated execution

• Predicate registers versus predicate stack

• Re-use predicate registers across data types

Experience with Complex Kernels