Wikitude ARM workshop

Martin Lechner, CTO

Utilizing NEON for Accelerated Computer Vision

Processing in Augmented Reality Scenarios

Who is Wikitude?

2

Wikitude is the World leading Augmented Reality Ecosystem

● World Class team & technology

● A large and active developer community

● Leading developer and editorial tools for implementing AR applications

● High-profile monetization and distribution network

● Makers of the AR-Standard “ARML 2.0”

+45.000 Registered AR developers

+1.500 AR apps

+100 Countries

Wikitude’s Main Products

3

Wikitude SDK

Studio

Cloud Recognition

Targets API

Publishing App

Powered by World-Class AR Technology

4

+ AR Content

creation through

World-class in-house IP bundled well managed and proven product suite

5

● 2D Natural Feature Tracking

● Tracking in 6 Degrees of Freedom

● 3D scene and 3D object

recognition and tracking

● Fully integrated in existing Wikitude

SDK and product suite

● Focus on both Indoor and Outdoor

scenarios

● Improved robustness for

－ Changing lighting conditions

－ Moving objects

－ Low textured environments

Wikitude Computer Vision


6

● Optimized for mobile computing

－ Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)

－ Vector Processing/SIMD (ARM NEON™)

－ OpenGL ES (ARM Mali™)

－ GPU Compute/OpenCL (ARM Mali)


7

● Optimized for mobile computing

－ Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)

－ Vector Processing/SIMD (ARM NEON™)

－ OpenGL ES (ARM Mali™)

－ GPU Compute/OpenCL (ARM Mali)

Why utilizing NEON in Image Processing?

8

• Development time well spent!

- Most state of the art mobile devices run on chips based

on ARMv7 or ARMv8 architectures

- Most of them include NEON instruction set

• Image Processing: a perfect match for SIMD

- Computationally expensive on CPU

- Can run in parallel

- Simple operations

- Multiple data sets (pixels or pixel ranges) must be

applied to the same operation

How to code for NEON

9

Intrinsics

• C library

- Contains vector data types and functions

(intrinsics)

• Code is converted to NEON code

• Easier to write and read

• Might result in not highly optimized

code

Assembler

• Assembler code as you would

expect it …

• A bit harder to maintain

• Full control over the optimizations

Why utilizing NEON?

10

Computer Vision process is a pipeline that contains

lots of functions to be SIMD-optimized

1. Recognition

Convert camera image to greyscale

Downsampling

Analyzing every pixel (range) in the image and

perform operations (e.g. Gradient Image)

2. Tracking

Calculate image similarities, e.g. Sum of Squared Differences

(SSD)

Matrix Operations (pose calculation)

Example: Calculate Patch Cross Correlation

11

a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]

… … … … … … … …

a[7] b[7] c[7] d[7] e[7] f[7] g[7] h[7]

a’[0] b’[0] c’[0] d’[0] e’[0] f’[0] g’[0] h’[0]

… … … … … … … …

a’[7] b’[7] c’[7] d’[7] e’[7] f’[7] g’[7] h’[7]

a[0]*a’[0] + … + h[0]*h’[0] +

+ … +

+ a[7]*a’[7] + … + h[7]*h’[7] sqrSum =

One Step: Calculate Squared Sum of Patches

Wrapper Logic

12

double calculateSqrSum (…){

int sqrSum;

#if defined(NEON_AVAILABLE)

if(!(size%8)){

// too complex with assembler

sqrSum = calculateSqrSum_neon_intrinsics(…);

else {

sqrSum = calculateSqrSum_neon_assembly(…);

}

#else

sqrSum = calculateSqrSum_impl(…);

#endif

return sqrSum;

}

C++ Implementation

13

int sqrSum = 0;

// width of images

int rowPtrBase1 = 0;

int rowPtrBase2 = 0;

// counter value

int rowPtr1 = 0;

int rowPtr2 = 0;

for (…){}

return sqrSum;

for(int rowIdx = 0; rowIdx < 8; rowIdx++ ){

rowPtr1 = rowPtrBase1;

rowPtr2 = rowPtrBase2;

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];








rowPtrBase1 += strideWindow;

rowPtrBase2 += strideTemplate;

}

Intrinsics

14

uint8x8_t a_loaded;

uint8x8_t b_loaded;

uint16x8_t res_loaded;

uint32x4_t allSum = vdupq_n_u32(0);

for(int rowIdx = 0; rowIdx < size; rowIdx++){

for (uint32_t i = 0; i < size; i += 8) {

//load row into neon registers

a_loaded = vld1_u8(&(image1[rowIdx+i]));

b_loaded = vld1_u8(&(image2[rowIdx+i]));

//multiply values

res_loaded = vmull_u8(a_loaded, b_loaded);

//pairwise add and accumulate result

allSum = vpadalq_u16(allSum,res_loaded);

}

}

return vgetq_lane_u32(allSum,0) + vgetq_lane_u32(allSum,1) + vgetq_lane_u32(allSum,2) + vgetq_lane_u32(allSum,3) ;

1 Row of Pixels (8x8 bits)

Pair-wise multiplied Vector (8x16 bits)

Pair-wise added and accumulated Vector (4x32 bits)

Intrinsics - Algorithm

15

a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]

a[1] b[1] c[1] d[1] e[1] f[1] g[1] h[1]

… … … … … … … …

a’[0

]

b’[0

] c’[0]

d’[0

]

e’[0

] f’[0]

g’[0

]

h’[0

]

a’[1

]

b’[1

] c’[1]

d’[1

]

e’[1

] f’[1]

g’[1

]

h’[1

]

… … … … … … … …

a’’[0] =

a[0]*a’[0]

b’’[0] =

b[0] *b’[0]

c’’[0] =

c[0] *c’[0]

d’’[0] =

d[0] *d’[0]

e’’[0] =

e[0] *e’[0]

f’’[0] =

f[0] *f’[0]

g’’[0] =

g[0] *g’[0]

h’’[0] =

h[0] *h’[0]

a’’[1] =

a[1]*a’[1]

b’’[1] =

b[1] *b’[1]

c’’[1] =

c[1] *c’[1]

d’’[1] =

d[1] *d’[1]

e’’[1] =

e[1] *e’[1]

f’’[1] =

f[1] *f’[1]

g’’[1] =

g[1] *g’[1]

h’’[1] =

h[1] *h’[1]

… … … … … … … …

a’’’ = 0 + a’’[0] + b’’[0] b’’’ = 0 + c’’[0] + d’’[0] c’’’ = 0 + e’’[0] + f’’[0] d’’’ = 0 + g’’[0] + h’’[0]

a’’’ = a’’’ + a’’[7] + b’’[7] b’’’ = b’’’ + c’’[7] + d’’[7] c’’’ = c’’’ + e’’[7] + f’’[7] d’’’ = d’’’ + g’’[7] + h’’[7]

…

sqrSum = a’’’ + b’’’ + c’’’ + d’’’

Assembly

16

#ifdef __aarch64__

#ifdef __APPLE__

#define IMAGE_LINE_0 v16

#else

#define IMAGE_LINE_0 V16.8B

#endif

#else

#define IMAGE_LINE_0 d16

#endif

LOAD_LINE IMAGE_LINE_0, 0


CALC_LINE IMAGE_LINE_0, PATCH_LINE_0, 0






Assembly Macros (Essential Parts)

17

.macro LOAD_LINE IMAGE_LINE line

#ifdef __aarch64__

#ifdef __APPLE__

LD1.8B { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#else

LD1 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#endif

#else

vld1.u8 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#endif

.endm

Runtimes

18

Test Set:

• Nexus 4

• 1000 patches, 8x8 pixels each

• 4 test runs, calculate average runtime

C++ Intrinsics Assembly

Absolute

Runtime 15,49 ms 12,84 ms 7,89 ms

Relative Runtime 100% 82,89% 50,94%

Speedup 0% 17,11% 49,06%

What should run on NEON?

19

Just because you can run an algorithm on

NEON doesn’t mean you should …

1. Analyze your bottlenecks

- Use Profiling!

- Does it make sense to optimize the bottlenecks?

2. Analyze what bottlenecks can be optimized

3. Is the current implementation already optimized

- Check for flaws in the code, e.g. it copies too much data etc.

4. Build Prototypes with NEON Intrinsics

5. If still not fast enough, use Assembler

Other ways to optimize

20

OpenCL

• Run Code on GPU

• Low level API framework standardized by Khronos

• Similar considerations: Analyze your code and optimization potential first!

• Not (widely) supported on mobile platforms yet

Martin Lechner, CTO

[email protected]

Thank you!

21