Upload
lamnguyet
View
217
Download
2
Embed Size (px)
Citation preview
© Copyright 2009 XilinxCopyright 2012 Xilinx
Using NEON for Parallel Data Processing
Zynq-7000
Hardware Architecture
Speaker: Leon QinTitle: Processor SpecialistDate: Oct, 2012
© Copyright 2009 XilinxCopyright 2011 Xilinx
Zynq-7000 Block Diagram
2x GigE
with DMA
2x USB
with DMA
2x SDIO
with DMA
Static Memory Controller
Quad-SPI, NAND, NOR
Dynamic Memory Controller
DDR3, DDR2, LPDDR2
AMBA® Switches
Programmable
Logic:System Gates,
DSP, RAM
XADC PCIe
Multi-Standards I/Os (3.3V & High Speed 1.8V)
Mu
lti-
Sta
nd
ard
s I/O
s (
3.3
V &
Hig
h S
peed
1.8
V)
Multi Gigabit Transceivers
I/O
MUXMIO
ARM® CoreSight™ Multi-core & Trace Debug
512 KB L2 Cache
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
2x SPI
2x I2C
2x CAN
2x UART
GPIO
Processing System
AMBA® Switches
AMBA® Switches
AMBA® Switches
S_AXI_HP0
S_AXI_HP1
S_AXI_HP2
S_AXI_HP3
S_AXI_ACP
M_AXI_GP0/1S_AXI_GP0/1EMIO
Page 2
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 3
ARM Architecture Evolution
VFPv3
Jazelle®
VFPv2
SIMD
Thumb®-2
NEON™Adv SIMD
TrustZone™
Thumb-EE
Thumb-2 Only
V5 V6 V7 A&R V7 M
Improved Media and
DSP
Key Technology
Additions by
Architecture Generation
Execution Environments:
Improvedmemory use
Key Technology
Additions by
Architecture Generation
ARM9
ARM10
ARM11
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
General purpose SIMD processing useful for many applications
Supports widest range multimedia codecs used for internet applications
– Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, …
– Ideal solution for normal size „internet streaming‟ decode of various formats
Fewer cycles needed
– Neon will give 60-150% performance boost on complex video codecs
– Simple DSP algorithms can show larger performance boost (4x-8x)
– Balance of computation and memory access is required
– Processor can sleep sooner => overall dynamic power saving
Why NEON?
Page 4
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON is a mature advanced SIMD technology.
– SIMD exist on many 32-bit arch
• PowerPC has AltiVec, while x86 has MMX/SSE/AVX
– Can significantly accelerate parallelable repetitive operations on large data sets.
Beneficial to many DSP or multimedia algorithms
– Clean orthogonal vector architecture, applicable to a wide range of data intensive computation
– audio, video, and image processing codecs.
– Not just for codecs – also applicable to 2D & 3D graphics etc
– Color-space conversion.
– Physics simulations.
– Error correction(such as Reed Solomon codecs, CRCs), elliptic curve
cryptography, etc.
Why NEON?
Page 5
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON advantages
– Easy programming & debug
– Fully coherent with CPU, no cache maintenance operations
– Part of ARM arch - no hardware or software integration required
– Ecosystem support off-the-shelf, no porting required
DSP/FPGA advantages
– Runs parallel with CPU, few CPU cycles required
– More „realtime‟ - no OS/cache variability
– Fixed function or limited codec support
– Potentially higher performance (e.g. 1080p Full HD video)
NEON vs. DSP/FPGA offload
Page 6
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON Agenda
Page 7
NEON Hardware overview
NEON Instruction set overview
NEON Software Support
NEON improves performance
Page 7
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 8
What is NEON?
NEON is a wide SIMD data processing architecture
– Extension of the ARM instruction set
– 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
NEON Instructions perform “Packed SIMD” processing
– Registers are considered as vectors of elements of the same data type
– Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
– Instructions perform the same operation in all lanes
Dn
Dm
Dd
Lane
Source RegistersSource
Registers
Operation
Destination Register
ElementsElementsElements
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Example SIMD instruction – Vector ADD
Page 9
Larger register size
Register split into equal size any type elements
Operation performed on same element of each register
VADD.U16 D2, D1, D0
0x1001 0x1234 0x7 0xAB
0xFF0 0x5678 0xFFF8 0xCD
0x1FF1 0x68AC 0xFFFF 0x178
+
=
D0
D2
D1
+ ++
===
015314763
Page 9
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 10
Neon Data Types
NEON natively supports a set of common data types
– Integer and Fixed-Point: 8-bit, 16-bit, 32-bit and 64-bit
– 32-bit Single-precision Floating-point; 8 and 16-bit polynomial
Data types are represented using a bit-size and format letter
VADD.U16 D2, D1, D0
Not all data types available in all sizes
.S8
.U8
.P8
.I8.8
.S16
.U16
.P16
.I16.16
.S32
.U32
.F32
.I32.32
.S64
.U64.I64.64
Signed, Unsigned Integers;
Polynomials
32-bit Signed, Unsigned
Integers; Floats
8/16-bit Signed, Unsigned Integers;
Polynomials
64-bit Signed,
Unsigned Integers;
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 11
NEON Registers
NEON provides a 256-byte register file
– Distinct from the core registers
– Extension to the VFPv2 register file (VFPv3)
Two explicitly aliased views
– 32 x 64-bit registers (D0-D31)
– 16 x 128-bit registers (Q0-Q15)
Enables register trade-off
– Vector length
– Available registers
Also uses the summary flags in the VFP FPSCR
– Adds a QC integer saturation summary flag
– No per-lane flags, so „carry‟ handled using wider result (16bit+16bit -> 32-bit)
Q0
Q1
Q15
:
D0
D1
D2
D3
:
D30
D31
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 12
Vectors and Scalars
Registers hold one or more elements of the same data type
– Vn can be used to reference either a 64-bit Dn or 128-bit Qn register
– A register, data type combination describes a vector of elements
Some instructions can reference individual scalar elements
– Scalar elements are referenced using the array notation Vn[x]
Array ordering is always from the least significant bit
64-bit 128-bit
I64 D0
D7
Dn
63 0
S32 S32 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8
Qn
127 0
F32 F32 F32 F32 Q0
Q7
F32 F32 F32 F32 Q0
Q0[0]Q0[1]Q0[2]Q0[3]
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON Agenda
Page 13
NEON Hardware overview
NEON Instruction set overview
NEON Software Support
NEON improves performance
Page 13
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 14
Instruction Syntax
V{<mod>}<op>{<shape>}{<cond>}{.<dt>}(<dest>}, src1, src2
<mod> - Instruction ModifiersQ indicates the operation uses saturating arithmetic (e.g. VQADD)
H indicates the operation halves the result (e.g. VHADD)
D indicates the operation doubles the result (e.g. VQDMUL)
R indicates the operation performs rounding (e.g. VRHADD)
<op> - Instruction Operation (e.g. ADD,MUL, MLA, MAX, SHR, SHL, MOV)
<shape> - ShapeL – The result is double the width of both operands
W – The result and first operand are double the width of the last operand
N – The result is half the width of both operands
<cond> - Conditional, used with IT instruction
<.dt> - Data type
<dest> - Destination, <src1> - Source operand 1, <src2>
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Q Modifier
– Saturating instructions that saturate set the “Cumulative saturation” flag (QC bit) in the
Floating-point Status and Control Register (FPSCR)
– The QC flag is sticky
• Use VMRS and VMSR instructions to read and to clear the flag
H Modifier
– Halves the result
– Can only be used on addition and subtraction instructions
• VHADD, VHSUB and VRHADD
D Modifier
– Only available for saturating variants “long” and “high half” multiplies
– VQDMLAL, VQDMLSL, VQDMULH, VQDMULL, VQRDMULH
R Modifier
– Always “Round to Nearest”, as defined in the IEEE 754 standard
– Available on instructions that include a right shift
• Including Halving and “high half” instructionsPage 15
Instruction Modifiers and Shapes
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Long and Wide – Input elements are promoted before operation
Narrow – Input elements are demoted before operation
Page 16
Instruction Shapes
Dn
Dm
Qd
L Shape
N Shape
Qn
Qd
Dm
W ShapeQn
Qm
Dd
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 17
Multiple 1-Element Structure Access
VLD1, VST1 provide standard array access
– An array of structures containing a single component is a basic array
– List can contain 1, 2, 3 or 4 consecutive registers
– Transfer multiple consecutive 8, 16, 32 or 64-bit elements
x4x5x6x7
x0x1x2x3 D3D4
x7
x5
x4
x3
x2
x1
x0
x6
[R1]
+2
+4
+6
+8
:
+10
+12
+14
VST1.16 {D3,D4}, [R1]
x0x1x2x3 D7
x3
x2
x1
x0[R4]
+2
+4
+6
:
VLD1.16 {D7}, [R4], R3
+R3
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 18
Addition: Basic
NEON supports various useful forms of basic addition
Normal Addition - VADD, VSUB
– Floating-point
– Integer (8-bit to 64-bit elements)
– 64-bit and 128-bit registers
Long Addition - VADDL, VSUBL
– Promotes both inputs before operation
– Signed/unsigned (8-bit to 32-bit source elements)
Wide Addition - VADDW, VSUBW
– Promotes one input before operation
– Signed/unsigned (8-bit 32-bit source elements)
VADD.I16 D0, D1, D2
VSUB.F32 Q7, Q1, Q4
VADD.I8 Q15, Q14, Q15
VSUB.I64 D0, D30, D5
VADDW.U8 Q1, Q7, D8
VSUBW.S16 Q8, Q1, D5
VADDL.U16 Q1, D7, D8
VSUBL.S32 Q8, D1, D5
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 19
Example – adding all lanes
Input in Q0 (D0 and D1)
u16 input values
Now Q0 contains 4x u32 values
(with 15 headroom bits)
Reducing/folding operation
needs 1 bit of headroom
Result is u64 in D0VPADDL.U32 D0, D0
VPADD.U32 D0, D0, D1
VPADDL.U16 Q0, Q0
DO
DO
DO
DO
DO
DO
D1
D1
D1
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 20
Summing a vector
+
+
+
+
+
+
+
+
+
+
DO
DO
DO
D1
+
+
+
+
+
+
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Some NEON clever features
Page 21
Some NEON clever features
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 22
Data Movement: Table Lookup
Uses byte indexes to control byte look up in a table
– Table is a list of 1,2,3 or 4 adjacent registers
VTBL : out of range indexes generate 0 result
VTBX : out of range indexes leave destination unchanged
{D1,D2}b
D30826138411
D0
0 acghjkmop deiln
dai0niel
f
3
VTBL.8 D0, {D1, D2}, D3
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 23
Element Load Store Instructions
All treat memory as an array of structures (AoS)
– SIMD registers are treated as structure of arrays (SoA)
– Enables interleaving/de-interleaving for efficient SIMD processing
– Transfer up to 256-bits in a single instruction
Three forms of Element Load Store instructions are provided
Forms distinguished by type of register list provided
– Multiple Structure Access e.g. {D0, D1}
– Single Structure Access e.g. {D0[2], D1[2]}
– Single Structure Load to all lanes e.g. {D0[], D1[]}
x0y0z0x1y1z1x2y2z2x3
3-element structureelement
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 24
Multiple 2-Element Structure Access
VLD2, VST2 provide access to multiple 2-element structures
– List can contain 2 or 4 registers
– Transfer multiple consecutive 8, 16, or 32-bit 2-element structures
VLD2.16 {D2,D3}, [R1]
y0y1y2y3
x0x1x2x3 D2D3
x3
x2
x1
x0
y0
y1
y2
y3
[R1]
+2
+4
+6
+8
:
+10
+12
+14
VLD2.16 {D0,D1,D2,D3}, [R3]!
x3
x2
x1
x0
y0
y1
y2
[R3]
+2
+4
+6
+8
:
+10
+12
x7
y7
+28
+30
!
:x4x5x6x7
x0x1x2x3 D0D1
y4y5y6y7
y0y1y2y3 D2D3
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 25
Multiple 3/4-Element Structure Access
VLD3/4, VST3/4 provide access to 3 or 4-element structures
– Lists contain 3/4 registers; optional space for building 128-bit vectors
– Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
x2
x1
x0
y0
z0
y1
y3
z1
z3
D3D4
[R1]
+2
+4
+6
+8
:
+10
+12
+20
VST3.16 {D3,D4,D5}, [R1]
y0y1y2y3
x0x1x2x3
z0z1z2z3 D5
+22
:x2
x1
x0
y0
z0
y1
y3
z1
z3
D3D4
[R1]
+2
+4
+6
+8
:
+10
+12
+20
VLD3.16 {D0,D2,D4}, [R1]!
y0y1y2y3
+22
:
D0D1
x0x1x2x3
z0z1z2z3
D2
!
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 26
Logical
NEON supports bitwise logical operations
VAND, VBIC, VEORR, VORN, VORR
– Bitwise logical operation
– Independent of data type
– 64-bit and 128-bit registers
VBIT, VBIF, VBSL
– Bitwise multiplex operations
– Insert True, Insert False, Select
– 3 versions overwrite different registers
– 64-bit and 128-bit registers
– Used with masks to provide selection
VAND D0, D0, D1
VORR Q0, Q1, Q15
VEOR Q7, Q1, Q15
VORN D15, D14, D1
VBIC D0, D30, D2
0 1 0 1 1 0
D0
D1
D2
D1
VBIT D1, D0, D2
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 27
NEON instruction summary
A comprehensive set of data prcoessing instructions
Form a general purpose SIMD instruction set suitable for compilers
NEON operations fall in to the following categories
Addition / Subtraction ( Saturating, Halving, Rounding)
MIN, MAX, NEG, MOV, ABS, ABD, …
Multiplication (MUL, MLA, MLS, …)
Comparison and Selection
Logic (AND , ORR, EOR, BIC, ORN, …)
Bitfield
Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step
Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP. TRN, …)
Many more…
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 28
NEON instruction reference
Official NEON instruction Set reference is “Advanced SIMD” in
ARM Architecture Reference Manual v7 A & R edition
Available to partners on www.arm.com
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 29
Further Reading
Documentation
– ARM “ARM” v7 A&R
– “NEON Support in the Realview compiler” white paper
– “NEON optimizations in Android” white paper
– Realview Compiler Guide (for intrinsics)
– ARM Cortex-A Programmers‟ Guide (from www.arm.com downloads)
– Software blogs on www.arm.com
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON Agenda
Page 30
NEON Hardware overview
NEON Instruction set overview
NEON Software Support
NEON improves performance
Page 30
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 31
How to use NEON
NOEN optimized Open source Libraries
– OpenMAX DL (Development Layer): APIs contain a comprehensive set of
audio, video and imaging functions that can be used for a wide range of
accelerated codec functionality such as MPEG-4, H.264, MP3, AAC and
JPEG.
– Broad open source support for NEON
Vectorizing Compilers– Exploits NEON SIMD automatically with existing C source code
NEON Intrinsics– C function call interface to NEON operations
– Supports all data types and operations supported by NEON
Assembler Code– For those who really want to optimize at the lowest level
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 32
OpenMAX DL v1.0 Library Summary
Audio Domain
MP3
AAC
Signal Processing Domain
FIR
IIR
FFT
Dot Product
Video Domain
– MPEG-4 simple profile
– H.264 baseline
Still Image Domain
– JPEG
Image Processing Domain
– Colorspace conversion
– De-blocking / de-ringing
– Rotation, scaling, compositing
Spec from: http://www.khronos.org/openmax/
Opensource implementation for ARM11 & NEON available from:
http://www.arm.com/zh/community/multimedia/standards-apis.php
NOTE: OpenMax DL provides low level data processing functions, not the complete codecs
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Google – WebM – 11,000 lines NEON assembler!
Bluez – official Linux Bluetooth protocol stack
– NEON sbc audio encoder
Pixman (part of cairo 2D graphics library)
– Compositing/alpha blending
ffmpeg – libavcodec
– LGPL media player used in many Linux distros
– NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
– NEON Audio: AAC, Vorbis, WMA
x264 – Google Summer Of Code 2009
– GPL H.264 encoder – e.g. for video conferencing
Android – NEON optimizations
– Skia library, S32A_D565_Opaque 5x faster using NEON
Eigen2 – C++ vector math / linear algebra template library
Theorarm – libtheora NEON version (optimized by Google)
libjpeg – optimized JPEG decode (IJG library)
FFTW – NEON enabled FFT library
LLVM – code generation backend used by Android Renderscript
NEON in opensource
Page 33
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 34
Automatic Vectorization: ARM Compiler
Automatic vectorization can generate code targeted for NEON from
ordinary C source code
– Less effort to produce efficient code
– Portable - no compiler-specific source code features need to be used
To enable automatic vectorization on armcc, use these options together:
--vectorize - enable vectorization
--cpu Cortex-A9 - provide a CPU option with NEON support
-O2 or -O3 - select high optimization level.
-Otime - optimize for speed over space
--fpmode fast - the precision of vectorized floating-point operations is the same
as VFP RunFast mode
--diag_warning=optimizations to obtain useful diagnostics from the
compiler on what it could or could not optimize/vectorize
Note:
– When you specify --vectorize, automatic vectorization is enabled only if you also specify -
Otime and an optimization level of -O2 or -O3.
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 35
Automatic Vectorization: GNU tools
To enable automatic vectorization on GCC/g++, use these options
together:
-mcpu=cortex-a9 -Specify a suitable ARMv7-A processor
-mfpu=neon - enable NEON support
-ftree-vectorize - support SIMD on many arch
- O3 implies -ftree-vectorize
-mvectorize-with-neon-quad - By default, GCC 4.4 only vectorize for
doubleword
-mfloat-abi=softfp -Can use “hard” for more efficient floating
point parameter passing, but all code must
be compiled with this option
Understand more with -ftree-vectorizer-verbose
Takes an integer value specifying the level of detail to provide, where 1 enables
additional printouts and higher values add even more information.
What vectorization the compiler is performing, or what is unable to perform because of
possible dependencies
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Automatic Vectorization - how it works
void add_int(int * __restrict pa,
int * __restrict pb,
unsigned int n, int x)
{
unsigned int i;
for(i = 0; i < (n & ~3); i++)
pa[i] = pb[i] + x;
}
1. Analyze each loop:
Are pointer accesses safe for
vectorization?
What data types are being used?
How do they map onto NEON
vector registers?
How many loop iterations are
there?
void add_int(int *pa, int *pb,
unsigned n, int x)
{
unsigned int i;
for (i = ((n & ~3) >> 2); i; i--)
{
*(pa + 0) = *(pb + 0) + x;
*(pa + 1) = *(pb + 1) + x;
*(pa + 2) = *(pb + 2) + x;
*(pa + 3) = *(pb + 3) + x;
pa += 4; pb += 4;
}
}
2. Unroll the loop to the appropriate number of
iterations, and perform other transformations
like pointerization
3. Map each unrolled operation onto a
NEON vector lane, and generate
corresponding NEON instructions
Page 36
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
ARM RVDS & gcc vectorising compiler
|L1.16|
VLD1.32 {d0,d1},[r0]!
SUBS r3,r3,#1
VLD1.32 {d2,d3},[r1]!
VADD.I32 q0,q0,q1
VST1.32 {d0,d1},[r2]!
BNE |L1.16|
armcc -S --cpu cortex-a8
-O3 -Otime --vectorize test.cint a[256], b[256], c[256];
foo () {
int i;
for (i=0; i<256; i++){
a[i] = b[i] + c[i];
}
} gcc -S -O3 -mcpu=cortex-a8
-mfpu=neon -ftree-vectorize
-ftree-vectorizer-verbose=6
test.c
.L2:
add r1, r0, ip
add r3, r0, lr
add r2, r0, r4
add r0, r0, #8
cmp r0, #1024
fldd d7, [r3, #0]
fldd d6, [r2, #0]
vadd.i32 d7, d7, d6
fstd d7, [r1, #0]
bne .L2
Page 37
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
The goal is to try to make the code simple, straight forward, and parallel, so that the
compiler can easily convert the code to NEON assembly
Loops can be modified for better vectorizing:
– Short, simple loops work the best (even if it means multiple loops in your code)
– Avoid breaks / loop-carried dependencies / conditions inside loops
– Try to make the number of iteration a power of 2
– Try to make sure the number of iteration is known to the compiler
– Functions called inside a lop should be inlined
Pointer issues:
– Using arrays with indexing vectorizes better than using pointer
– Indirect addressing (multiple indexing or de-reference) doesn‟t vectorize
– Use __restricet key word to tell the compiler that pointers does not reference
overlapping areas of memory
Use suitable data types
– For best performance, always use the smallest data type that can hold the required values
Page 38
Tuning C/C++ Code for Vectorizing
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 39
NEON Intrinsics
Available in armcc, GCC/g++, and llvm. Syntax is the same, so source
code that uses intrinsics can be compiled by any of these compilers.
Advantage:
– Provide low-level access to NEON instructions. Compiler do hard work like:
Register allocation.
Code scheduling, or re-ordering instructions.
The C compilers can reorder code to ensure the minimum number of stalls
according to a specific processor.
Disadvantage:
– Possibly the compiler output is not exactly the code you want, so there is still
some possibility of improvement when moving to NEON assembler code.
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 40
NEON Intrinsics - Example
Include intrinsics header file
#include <arm_neon.h>
Use special NEON data types which correspond to D and Q registers, e.g.
int8x8_t D-register containing 8x 8-bit elements
int16x4_t D-register containing 4x 16-bit elements
int32x4_t Q-register containing 4x 32-bit elements
Use special intrinsics versions of NEON instructions
vin1 = vld1q_s32(ptr);
vout = vaddq_s32(vin1, vin2);
vst1q_s32(vout, ptr);
Strongly typed!
– Use vreinterpret_s16_s32( ) to change the type
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 41
NEON Intrinsics - Example
NEON intrinsic
Command line with GCC
arm-none-linux-gnueabi-gcc -mfpu=neon intrinsic.c
Command line with RVCT
armcc --cpu=Cortex-A9 intrinsic.c
#include <arm_neon.h>
uint32x4_t double_elements(uint32x4_t input)
{
return(vaddq_u32(input, input));
}
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 42
NEON Intrinsics : reference
For information about the intrinsic functions and vector data types, see the:
RealView Compilation Tools Compiler Reference Guide, available from
– http://infocenter.arm.com
GCC documentation, available from
– http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 43
NEON Assembler Code
Advantage:
– Careful hand-scheduling is recommended to get the best out of any NEON
assembler code you write, especially for performance-critical applications
Disadvantage:
– Need to be aware of some underlying hardware features, like pipelining and
scheduling issues, memory access behavior and scheduling hazards.
– Optimization is processor dependent.
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 XilinxPage 44
NEON Assembler Code - Example
Gas
Command: arm-none-linux-gnueabi-as -mfpu=neon asm.s
RVCT
Command: armasm --cpu=Cortex-A9 asm.s
.text
.arm
.global double_elements
double_elements:
vadd.i32 q0,q0,q0
bx lr
.end
AREA RO, CODE, READONLY
ARM
EXPORT double_elements
double_elements
VADD.I32 Q0, Q0, Q0
BX LR
END
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON data engine is disabled at reset
– Needs to be enabled in software before use
Enabling NEON requires 2 steps
1.Enable access to coprocessors 10 and 11 and allow Neon instructions
MRC p15, 0x0, r0, c1, c0, 2 ; Read CP15 CPACR
ORR r0, r0, #(0x0f << 20) ; Full access rights
MCR p15, 0x0, r0, c1, c0, 2 ; Write CP15 CPACR
ISB
2.Enable NEON and VFP
MOV r0, #0x40000000 ; set bit 30
VMSR FPEXC, r0 ; write r0 to Floating Point Exception
; Register
Enabling NEON before using
Page 45
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON Agenda
Page 46
NEON Hardware overview
NEON Instruction set overview
NEON Software Support
NEON improves performance
Page 46
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
FFT time No NEON
(v6 SIMD asm)
With NEON
(v7 NEON asm)
Cortex-A8 500MHz
Actual silicon
15.2 us 3.8 us
(x 4.0 performance)
NEON in Audio
FFT: 256-point, 16-bit signed complex numbers
– FFT is a key component of AAC, Voice/pattern recognition etc.
– Hand optimized assembler in both cases
Extreme example: FFT in ffmpeg: 12x faster
– C code -> handwitten asm
– Scalar -> vector processing
– Single-precision floating point on Cortex-A8 (VFPlite -> NEON)
Page 47
© Copyright 2009 XilinxCopyright 2011 Xilinx
ARM RVDS vectorizing compiler
• RVDS 4.0 professional includes auto-vectorizing armcc
– armcc --vectorize --cpu=Cortex-A8 x.c
• Up to 4x performance increase for benchmarks, with no source code changes(no source code changes are permitted for benchmarking)
• Simple source code changes can yield significant improvements above this
– Use C „__restrict‟ keyword to work around C pointer aliasing issues
– Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization
ARM vs NEON (Vectorize) on Cortex-A8
100% 100%
135%
169%
20%
70%
120%
170%
Telecom Consumer
ARM NEON
Improved
vectorization in
latest RVDS 4.0
Page 48
© Copyright 2009 XilinxCopyright 2011 Xilinx
FFmpeg/libav performance
git.ffmpeg.org
snapshot 21-Sep-09
YouTube HQ video decode
480x270, 30fps
Including AAC audio
Real silicon measurements
– OMAP3 Beagleboard
– ARM A9TC
NEON ~2x overall
performance
ffmpeg performance (relative to realtime)
0
0.5
1
1.5
2
2.5
3
Cortex-A8 256KB L2 500MHz Cortex-A9 512KB L2 400MHz
v7vfp
v7neon
Page 49
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
Using ffmpeg (LGPL opensource codec) with extensive NEON optimizations
Dolby reference code also available optimized from Dolby and other vendors
AC3 decode MHz
MHz required
(lower is better)
ffmpeg from git.libav.org
Checkout from 14-Nov-2011
./configure --extra-cflags=“-mcpu=cortex-a9
-mfpu=neon -mfloat-abi=softfp”
Benchmarked on 500MHz Cortex-A9
(ARM Versatile Express)
Samples from:
samples.mplayerhq.hu/A-codecs
0.0
10.0
20.0
30.0
40.0
50.0
60.0
v7neonv7fpu
v7fpu --disable-asm
8.2
16.5 17.8
22.4
46.450.7
blade_runner (48KHz 192Kbit/s stereo)
Broadway-5.1-48KHz-448kbit.ac3
Page 50
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
android_external_jpeg optimized for v6 and v7neon
16Mpixel digital camera test image - takes only 1.7s to decode on 500MHz
A9+NEON (improved from 4.5s unoptimized)
JPEG decode
0
20
40
60
80
100
120
140
160
djpeg djpeg.v6.opt djpeg.v7neon.opt
140
83
52
cycles/pixel - lower is better
Code from:
https://github.com/mansr
ARM Versatile Express 500MHz Cortex-A9
Ubuntu 11.04 Linux
djpeg -outfile /dev/null testfile.jpg
Also libjpeg-turbo via Linaro
http://libjpeg-turbo.virtualgl.org/
Page 51
© Copyright 2009 XilinxCopyright 2011 XilinxCopyright 2011 Xilinx
NEON will become standard on general purpose apps & media-
centric devices.
– NEON option across the Cortex-A roadmap
– NEON ideal for use by open OS systems with downloaded apps
Full enabling technology to support Neon
– Compilers, profilers, debuggers, libraries all available now
– Key differentiator: easy to program, popular with software engineers
Strong ARM NEON ecosystem
Complementary to DSP/FPGA
NEON Summary
Page 52
© Copyright 2009 XilinxCopyright 2012 Xilinx
Zynq-7000
Hardware Architecture