21
Martin Lechner, CTO Utilizing NEON for Accelerated Computer Vision Processing in Augmented Reality Scenarios

Wikitude ARM workshop

Embed Size (px)

Citation preview

Page 1: Wikitude ARM workshop

Martin Lechner, CTO

Utilizing NEON for Accelerated Computer Vision

Processing in Augmented Reality Scenarios

Page 2: Wikitude ARM workshop

Who is Wikitude?

2

Wikitude is the World leading Augmented Reality Ecosystem

● World Class team & technology

● A large and active developer community

● Leading developer and editorial tools for implementing AR applications

● High-profile monetization and distribution network

● Makers of the AR-Standard “ARML 2.0”

+45.000 Registered AR developers

+1.500 AR apps

+100 Countries

Page 3: Wikitude ARM workshop

Wikitude’s Main Products

3

Wikitude SDK

Studio

Cloud Recognition

Targets API

Publishing App

Page 4: Wikitude ARM workshop

Powered by World-Class AR Technology

4

+ AR Content

creation through

World-class in-house IP bundled well managed and proven product suite

Page 5: Wikitude ARM workshop

5

● 2D Natural Feature Tracking

● Tracking in 6 Degrees of Freedom

● 3D scene and 3D object

recognition and tracking

● Fully integrated in existing Wikitude

SDK and product suite

● Focus on both Indoor and Outdoor

scenarios

● Improved robustness for

- Changing lighting conditions

- Moving objects

- Low textured environments

Wikitude Computer Vision

Page 6: Wikitude ARM workshop

Wikitude Computer Vision

6

● Optimized for mobile computing

- Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)

- Vector Processing/SIMD (ARM NEON™)

- OpenGL ES (ARM Mali™)

- GPU Compute/OpenCL (ARM Mali)

Page 7: Wikitude ARM workshop

Wikitude Computer Vision

7

● Optimized for mobile computing

- Mobile CPU Architectures (ARMv6, ARMv7, ARMv8)

- Vector Processing/SIMD (ARM NEON™)

- OpenGL ES (ARM Mali™)

- GPU Compute/OpenCL (ARM Mali)

Page 8: Wikitude ARM workshop

Why utilizing NEON in Image Processing?

8

• Development time well spent!

- Most state of the art mobile devices run on chips based

on ARMv7 or ARMv8 architectures

- Most of them include NEON instruction set

• Image Processing: a perfect match for SIMD

- Computationally expensive on CPU

- Can run in parallel

- Simple operations

- Multiple data sets (pixels or pixel ranges) must be

applied to the same operation

Page 9: Wikitude ARM workshop

How to code for NEON

9

Intrinsics

• C library

- Contains vector data types and functions

(intrinsics)

• Code is converted to NEON code

• Easier to write and read

• Might result in not highly optimized

code

Assembler

• Assembler code as you would

expect it …

• A bit harder to maintain

• Full control over the optimizations

Page 10: Wikitude ARM workshop

Why utilizing NEON?

10

Computer Vision process is a pipeline that contains

lots of functions to be SIMD-optimized

1. Recognition

Convert camera image to greyscale

Downsampling

Analyzing every pixel (range) in the image and

perform operations (e.g. Gradient Image)

2. Tracking

Calculate image similarities, e.g. Sum of Squared Differences

(SSD)

Matrix Operations (pose calculation)

Page 11: Wikitude ARM workshop

Example: Calculate Patch Cross Correlation

11

a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]

… … … … … … … …

a[7] b[7] c[7] d[7] e[7] f[7] g[7] h[7]

a’[0] b’[0] c’[0] d’[0] e’[0] f’[0] g’[0] h’[0]

… … … … … … … …

a’[7] b’[7] c’[7] d’[7] e’[7] f’[7] g’[7] h’[7]

a[0]*a’[0] + … + h[0]*h’[0] +

+ … +

+ a[7]*a’[7] + … + h[7]*h’[7] sqrSum =

One Step: Calculate Squared Sum of Patches

Page 12: Wikitude ARM workshop

Wrapper Logic

12

double calculateSqrSum (…){

int sqrSum;

#if defined(NEON_AVAILABLE)

if(!(size%8)){

// too complex with assembler

sqrSum = calculateSqrSum_neon_intrinsics(…);

else {

sqrSum = calculateSqrSum_neon_assembly(…);

}

#else

sqrSum = calculateSqrSum_impl(…);

#endif

return sqrSum;

}

Page 13: Wikitude ARM workshop

C++ Implementation

13

int sqrSum = 0;

// width of images

int rowPtrBase1 = 0;

int rowPtrBase2 = 0;

// counter value

int rowPtr1 = 0;

int rowPtr2 = 0;

for (…){}

return sqrSum;

for(int rowIdx = 0; rowIdx < 8; rowIdx++ ){

rowPtr1 = rowPtrBase1;

rowPtr2 = rowPtrBase2;

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

sqrSum += img1[rowPtr1++] * img2[rowPtr2++];

rowPtrBase1 += strideWindow;

rowPtrBase2 += strideTemplate;

}

Page 14: Wikitude ARM workshop

Intrinsics

14

uint8x8_t a_loaded;

uint8x8_t b_loaded;

uint16x8_t res_loaded;

uint32x4_t allSum = vdupq_n_u32(0);

for(int rowIdx = 0; rowIdx < size; rowIdx++){

for (uint32_t i = 0; i < size; i += 8) {

//load row into neon registers

a_loaded = vld1_u8(&(image1[rowIdx+i]));

b_loaded = vld1_u8(&(image2[rowIdx+i]));

//multiply values

res_loaded = vmull_u8(a_loaded, b_loaded);

//pairwise add and accumulate result

allSum = vpadalq_u16(allSum,res_loaded);

}

}

return vgetq_lane_u32(allSum,0) + vgetq_lane_u32(allSum,1) + vgetq_lane_u32(allSum,2) + vgetq_lane_u32(allSum,3) ;

1 Row of Pixels (8x8 bits)

Pair-wise multiplied Vector (8x16 bits)

Pair-wise added and accumulated Vector (4x32 bits)

Page 15: Wikitude ARM workshop

Intrinsics - Algorithm

15

a[0] b[0] c[0] d[0] e[0] f[0] g[0] h[0]

a[1] b[1] c[1] d[1] e[1] f[1] g[1] h[1]

… … … … … … … …

a’[0

]

b’[0

] c’[0]

d’[0

]

e’[0

] f’[0]

g’[0

]

h’[0

]

a’[1

]

b’[1

] c’[1]

d’[1

]

e’[1

] f’[1]

g’[1

]

h’[1

]

… … … … … … … …

a’’[0] =

a[0]*a’[0]

b’’[0] =

b[0] *b’[0]

c’’[0] =

c[0] *c’[0]

d’’[0] =

d[0] *d’[0]

e’’[0] =

e[0] *e’[0]

f’’[0] =

f[0] *f’[0]

g’’[0] =

g[0] *g’[0]

h’’[0] =

h[0] *h’[0]

a’’[1] =

a[1]*a’[1]

b’’[1] =

b[1] *b’[1]

c’’[1] =

c[1] *c’[1]

d’’[1] =

d[1] *d’[1]

e’’[1] =

e[1] *e’[1]

f’’[1] =

f[1] *f’[1]

g’’[1] =

g[1] *g’[1]

h’’[1] =

h[1] *h’[1]

… … … … … … … …

a’’’ = 0 + a’’[0] + b’’[0] b’’’ = 0 + c’’[0] + d’’[0] c’’’ = 0 + e’’[0] + f’’[0] d’’’ = 0 + g’’[0] + h’’[0]

a’’’ = a’’’ + a’’[7] + b’’[7] b’’’ = b’’’ + c’’[7] + d’’[7] c’’’ = c’’’ + e’’[7] + f’’[7] d’’’ = d’’’ + g’’[7] + h’’[7]

sqrSum = a’’’ + b’’’ + c’’’ + d’’’

Page 16: Wikitude ARM workshop

Assembly

16

#ifdef __aarch64__

#ifdef __APPLE__

#define IMAGE_LINE_0 v16

#else

#define IMAGE_LINE_0 V16.8B

#endif

#else

#define IMAGE_LINE_0 d16

#endif

LOAD_LINE IMAGE_LINE_0, 0

LOAD_LINE IMAGE_LINE_1, 1

CALC_LINE IMAGE_LINE_0, PATCH_LINE_0, 0

CALC_LINE IMAGE_LINE_1, PATCH_LINE_1, 1

LOAD_LINE IMAGE_LINE_2, 2

LOAD_LINE IMAGE_LINE_3, 3

CALC_LINE IMAGE_LINE_2, PATCH_LINE_2, 2

CALC_LINE IMAGE_LINE_3, PATCH_LINE_3, 3

Page 17: Wikitude ARM workshop

Assembly Macros (Essential Parts)

17

.macro LOAD_LINE IMAGE_LINE line

#ifdef __aarch64__

#ifdef __APPLE__

LD1.8B { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#else

LD1 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#endif

#else

vld1.u8 { \IMAGE_LINE }, [IMAGE_PTR], STRIDE

#endif

.endm

Page 18: Wikitude ARM workshop

Runtimes

18

Test Set:

• Nexus 4

• 1000 patches, 8x8 pixels each

• 4 test runs, calculate average runtime

C++ Intrinsics Assembly

Absolute

Runtime 15,49 ms 12,84 ms 7,89 ms

Relative Runtime 100% 82,89% 50,94%

Speedup 0% 17,11% 49,06%

Page 19: Wikitude ARM workshop

What should run on NEON?

19

Just because you can run an algorithm on

NEON doesn’t mean you should …

1. Analyze your bottlenecks

- Use Profiling!

- Does it make sense to optimize the bottlenecks?

2. Analyze what bottlenecks can be optimized

3. Is the current implementation already optimized

- Check for flaws in the code, e.g. it copies too much data etc.

4. Build Prototypes with NEON Intrinsics

5. If still not fast enough, use Assembler

Page 20: Wikitude ARM workshop

Other ways to optimize

20

OpenCL

• Run Code on GPU

• Low level API framework standardized by Khronos

• Similar considerations: Analyze your code and optimization potential first!

• Not (widely) supported on mobile platforms yet

Page 21: Wikitude ARM workshop

Martin Lechner, CTO

[email protected]

Thank you!

21