Upload
embedded-vision-alliance
View
255
Download
1
Embed Size (px)
Citation preview
Copyright © 2016 Synopsys Inc. 1
Using the OpenCL C Kernel Language
for Embedded Vision Processors Seema Mirchandaney
May 3, 2016
Copyright © 2016 Synopsys Inc. 2
• Synopsys Embedded Vision Processor and Tools Exploration
• OpenCL™ C Introduction
• Vectorization in OpenCL C
• Lessons learned
Agenda
Copyright © 2016 Synopsys Inc. 3
Synopsys Embedded Vision
Processor and Tools Exploration
Copyright © 2016 Synopsys Inc. 4
Synopsys EV5x Vision Processors
Copyright © 2016 Synopsys Inc. 5
Embedded Vision SIMD Processor Trends
• Parallelism exploited at multiple levels
• SIMD instructions
• N-way VLIW
• Multi-core
• Challenges for SIMD → irregularity
• Non-contiguous memory reference patterns
• Non uniform data flow — control statements
• Multiple data types (char, short, int, float)
• Data dependences
Copyright © 2016 Synopsys Inc. 6
• Data parallelism not implicit in the language
• Intrinsics used for SIMD operations → low level
• Vendor specific extensions for vector data types → non portable
• Compilers perform inner loop vectorization → limited success
• Pointer aliasing, complex subscripts, limited data
dependence analysis
• Pragmas required to guide the compiler to be ‘smart’ (no
dependence, SIMD width)
• New language extensions → Intel’s SPMD compiler
Programming in C for SIMD
Copyright © 2016 Synopsys Inc. 7
OpenCL C Introduction
Copyright © 2016 Synopsys Inc. 8
• OpenCL C Language Derived from ISO C99
• Disallows standard C99 headers, function pointers, recursion,
variable length arrays, and bit fields
• Important additions to the language for parallelism
• Work items and workgroups
• Vector types up to 16 lanes
• Synchronization
• Address space qualifiers
• Large set of built-ins
Data Parallelism
Copyright © 2016 Synopsys Inc. 9
Work Items and Workgroups
Copyright © 2016 Synopsys Inc. 10
Performance is a challenge
• Explicit vectorization: managed by the programmer
• Implicit vectorization: automatically performed by the compiler
• Differences in execution of work items
GPU vs. CPU vs. SIMD Vision Processors
• One work item simply maps to one hardware thread GPU
Libraries (pthreads, OpenMP or MPI) have to be employed to obtain
the wanted effect
• One work item running on one CPU core and all CPU cores busy
CPU
SIMD
Vision
Processors
Copyright © 2016 Synopsys Inc. 11
• Programming Model — OpenCL kernels + OpenVX
• Advanced Whole function vectorization module
• Extensions for explicit vectorization
• Wider vectors ([u]short32, [u]char64) + operations
• Built-ins for scatter/gather with predication
Language Extensions and Optimizing Compiler
Multiple vector lane modes allow for maximizing performance
16 lanes [int, short2, char4 data types optimized]
32 lanes [short, char2 types data types optimized]
64 lanes [char data type optimized]
Copyright © 2016 Synopsys Inc. 12
Vectorization in OpenCL C
Copyright © 2016 Synopsys Inc. 13
• What’s there in OpenCL C for explicit vectorization?
• Vector data types
• Built-ins for vector data types
• Relational built-ins that enable vector predication (any, all, select)
• Basic control flow operations on vectors
Explicit Vectorization
Copyright © 2016 Synopsys Inc. 14
Kernel Example
kernel X(global int *a, global int* b, int n, int cval) {
tid = get_global_id(0);
int val =0;
for (int i=0; i< n; ++i) {
if (val < cval) { varying scalar (val)
val += a[b[tid]; non-consecutive load (gather)
}
else if (val > cval*2) divergent control flow
val += b[tid];
}
a[tid] = val; Kernel represent parallelism
}
Copyright © 2016 Synopsys Inc. 15
Explicit Vectorization in OpenCL C
kernel X(global int *a, global int* b, int n, int cval) {
int4 tid= {gid,….,gid+3};
int4 val =0;
for (int i=0; i< n; ++i) {
int4 mask = val < ((int4)cval);
int4 valg= gather4(a, b, tid,…);
int4 val1= val + valg;
val = mask? val1: val;
int4 mval = (int4)cval * (int4) 2;
int4 maske = (val > mval);
maske =maske & ~mask;
int4 bl = vload4(b…);
val1 = val + bl;
val = maske? val: val1;
}
vstore4(a,….);
}
Copyright © 2016 Synopsys Inc. 16
• OpenCL C kernels → parallelism expressed
• DSP-based SIMD architectures pose a challenge to balance portability
and performance
• Explicit vectorization restricts portability
• Existing DSP based architectures with varying SIMD extensions
• Detailed knowledge of hardware required to achieve performance
• Extensions may be required to support the hardware features
Portable Programming with Performance
Copyright © 2016 Synopsys Inc. 17
• Requires compiler support beyond traditional inner loop vectorization
• Main idea
• Transform a kernel to a multi work item kernel(SIMD lanes)
• Transform accesses to ‘thread id’ (ID of a work item) to return a
vector of w (num lanes) consecutive values
• Transform each operation into its vector counterpart
• Adapts well to DSP processors with extensive SIMD instruction sets
Implicit / Whole Function Vectorization (WFV)
Copyright © 2016 Synopsys Inc. 18
Lessons Learned
Copyright © 2016 Synopsys Inc. 19
• Experiments used Synopsys ARC® MetaWare Research compiler and
simulator
• Wide vectors with multiple data types
• Predicated scatter/gather built-ins
• Cross lane reductions/shuffles
• SIMD based optimized built-ins library
• Explicit vectorization -> output after WFV
Experience with Kernels
Benchmark
OpenCL C with
Extensions
Performance
relative to optimized
assembly versions
HoG linear
SVM
1.12
Integral Image 1.11
Median filter 1.03
Histogram 1.02
Copyright © 2016 Synopsys Inc. 20
• Whole Function Vectorization essential for
• Complex kernels with control flow, non contiguous memory
references
• SIMD extensions for
• irregular memory references
• Predicated execution
• Predicate registers versus predicate stack
• Re-use predicate registers across data types
Experience with Complex Kernels
Copyright © 2016 Synopsys Inc. 21
• Synopsys DesignWare® EV Family Of Vision Processors
• https://www.synopsys.com/dw/ipdir.php?ds=ev52-ev54
• Whole Function Vectorization
• http://www.intel-vci.uni-saarland.de/uploads/tx_sibibtex/10_01.pdf
• OpenCL C Khronos specification
• https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf
• Intel SPMD program compiler
• https://ispc.github.io/
Resources