"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys

Copyright © 2016 Synopsys Inc. 1

Using the OpenCL C Kernel Language

for Embedded Vision Processors Seema Mirchandaney

May 3, 2016


• Synopsys Embedded Vision Processor and Tools Exploration

• OpenCL™ C Introduction

• Vectorization in OpenCL C

• Lessons learned

Agenda


Synopsys Embedded Vision

Processor and Tools Exploration


Synopsys EV5x Vision Processors


Embedded Vision SIMD Processor Trends

• Parallelism exploited at multiple levels

• SIMD instructions

• N-way VLIW

• Multi-core

• Challenges for SIMD → irregularity

• Non-contiguous memory reference patterns

• Non uniform data flow — control statements

• Multiple data types (char, short, int, float)

• Data dependences


• Data parallelism not implicit in the language

• Intrinsics used for SIMD operations → low level

• Vendor specific extensions for vector data types → non portable

• Compilers perform inner loop vectorization → limited success

• Pointer aliasing, complex subscripts, limited data

dependence analysis

• Pragmas required to guide the compiler to be ‘smart’ (no

dependence, SIMD width)

• New language extensions → Intel’s SPMD compiler

Programming in C for SIMD


OpenCL C Introduction


• OpenCL C Language Derived from ISO C99

• Disallows standard C99 headers, function pointers, recursion,

variable length arrays, and bit fields

• Important additions to the language for parallelism

• Work items and workgroups

• Vector types up to 16 lanes

• Synchronization

• Address space qualifiers

• Large set of built-ins

Data Parallelism


Work Items and Workgroups


Performance is a challenge

• Explicit vectorization: managed by the programmer

• Implicit vectorization: automatically performed by the compiler

• Differences in execution of work items

GPU vs. CPU vs. SIMD Vision Processors

• One work item simply maps to one hardware thread GPU

Libraries (pthreads, OpenMP or MPI) have to be employed to obtain

the wanted effect

• One work item running on one CPU core and all CPU cores busy

CPU

SIMD

Vision

Processors


• Programming Model — OpenCL kernels + OpenVX

• Advanced Whole function vectorization module

• Extensions for explicit vectorization

• Wider vectors ([u]short32, [u]char64) + operations

• Built-ins for scatter/gather with predication

Language Extensions and Optimizing Compiler

Multiple vector lane modes allow for maximizing performance

16 lanes [int, short2, char4 data types optimized]

32 lanes [short, char2 types data types optimized]

64 lanes [char data type optimized]


Vectorization in OpenCL C


• What’s there in OpenCL C for explicit vectorization?

• Vector data types

• Built-ins for vector data types

• Relational built-ins that enable vector predication (any, all, select)

• Basic control flow operations on vectors

Explicit Vectorization


Kernel Example

kernel X(global int *a, global int* b, int n, int cval) {

tid = get_global_id(0);

int val =0;

for (int i=0; i< n; ++i) {

if (val < cval) { varying scalar (val)

val += a[b[tid]; non-consecutive load (gather)

}

else if (val > cval*2) divergent control flow

val += b[tid];

}

a[tid] = val; Kernel represent parallelism

}


Explicit Vectorization in OpenCL C

kernel X(global int *a, global int* b, int n, int cval) {

int4 tid= {gid,….,gid+3};

int4 val =0;

for (int i=0; i< n; ++i) {

int4 mask = val < ((int4)cval);

int4 valg= gather4(a, b, tid,…);

int4 val1= val + valg;

val = mask? val1: val;

int4 mval = (int4)cval * (int4) 2;

int4 maske = (val > mval);

maske =maske & ~mask;

int4 bl = vload4(b…);

val1 = val + bl;

val = maske? val: val1;

}

vstore4(a,….);

}


• OpenCL C kernels → parallelism expressed

• DSP-based SIMD architectures pose a challenge to balance portability

and performance

• Explicit vectorization restricts portability

• Existing DSP based architectures with varying SIMD extensions

• Detailed knowledge of hardware required to achieve performance

• Extensions may be required to support the hardware features

Portable Programming with Performance


• Requires compiler support beyond traditional inner loop vectorization

• Main idea

• Transform a kernel to a multi work item kernel(SIMD lanes)

• Transform accesses to ‘thread id’ (ID of a work item) to return a

vector of w (num lanes) consecutive values

• Transform each operation into its vector counterpart

• Adapts well to DSP processors with extensive SIMD instruction sets

Implicit / Whole Function Vectorization (WFV)


Lessons Learned


• Experiments used Synopsys ARC® MetaWare Research compiler and

simulator

• Wide vectors with multiple data types

• Predicated scatter/gather built-ins

• Cross lane reductions/shuffles

• SIMD based optimized built-ins library

• Explicit vectorization -> output after WFV

Experience with Kernels

Benchmark

OpenCL C with

Extensions

Performance

relative to optimized

assembly versions

HoG linear

SVM

1.12

Integral Image 1.11

Median filter 1.03

Histogram 1.02


• Whole Function Vectorization essential for

• Complex kernels with control flow, non contiguous memory

references

• SIMD extensions for

• irregular memory references

• Predicated execution

• Predicate registers versus predicate stack

• Re-use predicate registers across data types

Experience with Complex Kernels


• Synopsys DesignWare® EV Family Of Vision Processors

• https://www.synopsys.com/dw/ipdir.php?ds=ev52-ev54

• Whole Function Vectorization

• http://www.intel-vci.uni-saarland.de/uploads/tx_sibibtex/10_01.pdf

• OpenCL C Khronos specification

• https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf

• Intel SPMD program compiler

• https://ispc.github.io/

Resources

https://www.synopsys.com/dw/ipdir.php?ds=ev52-ev54



http://www.intel-vci.uni-saarland.de/uploads/tx_sibibtex/10_01.pdf





https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf





https://ispc.github.io/

https://ispc.github.io/

Technology

"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presentation from Synopsys