Download pdf - "Designing and Selecting Instruction Sets for Vision," a Presentation From Cadence

Copyright © 2015 Cadence Design Systems 1

Chris Rowen

12 May 2015

Designing and Selecting Instruction Sets

for Vision


• A top design automation supplier:

• analog, digital and system verification tools

• interface, processor and protocol verification IP

• Leading supplier of DSP and other data-rich embedded processing cores

and software—Xtensa® Innovation Platform℠

• IVP family: advanced imaging and vision DSP cores with almost 1000

library functions and applications

• Massively parallel SIMD/VLIW processors with automated

configurability and extensibility of ISA, memory, and interface

• One of Fortune Magazine’s “Top 100 Places to Work”

Cadence in Nutshell


• The Vision Performance Challenge

• The Vision Instruction Set Puzzle

• Application Diversity Drives ISA Flexibility

• The Hardwired Accelerator Problem

• Examples:

• Pedestrian Detection

• Lane Departure Warning

• Convolutional Neural Network

• Wrap-up

Outline


ADAS Processing Requirements are high

VGA: approaching 100 GOPs

The Vision Performance Challenge

Source: SoC for car navigation systems with a 53.3 GOPS image recognition engine, Hot Chips 21 (2009)


• Complexity grows an order of

magnitude for full HD processing

• Accelerating algorithmic sophistication

• Scaling best addressed by

• more parallelism

• application specific optimizations

• architectural enhancements

• move to advanced process nodes

• A good architecture

• accelerates core functions

• supports a wide range of

application specific optimizations

The Vision Performance Challenge

1080p60 ADAS is a teraOp problem

0

5

10

15

20

25

30

QVGA VGA HD Full HD

Computation increase with resolution

(brute force approach)


Key dimensions:

• Local memory bandwidth

• Memory hierarchy for

data streaming

• Data types

• SIMD/vector organization

• Scalar operation

bandwidth

• Instruction issue

parallelism (VLIW)

• Vision-specific operations

• Multi-processor support

The Vision Instruction Set Puzzle

What to look for:

1.High local memory bandwidth

2.Effective latency hiding for DDR access

3.Data-types: 8b,16b, 32b fixed-point,

floating point

4.Sustained ops/cycle from combination

of VLIW and SIMD

5.Vision -specific operations: 2D data

access, histogram, convolution, search,

non-linear functions

6.Automatic compiler inference of

vectors, complex operations

7.Scale-up with custom operations

8.Scale-up with parallel cores


• Real design is full of trade-

offs:

• Memory reference vs. ALU

ops

• Multiplies vs. other ALU ops

• Mix of scalar vs. vector ops

• Vector computation vs. data

reorganization

• Measured a set of 45 major

kernels and applications in

vision and imaging

• Look at key ratios to assess

trends

Application Diversity Drives ISA Flexibility

Functions include:

• Face detection

• Fast9

• SURF

• Oriented FAST and Rotated BRIEF feature

detector (ORB)

• Harris Corners

• H.265 Motion Compensation

• Haar Cascade and Classifiers

• Optical Flow

• Affine transform

• Perspective Warp

• Various Filters—bilateral, denoising

• High Dynamic Range

• Color Space and format conversions

• Histogram equalization


• Typically several ALU ops per

load operation

• Wide range of ALU : Load/store

ratio (1:2 to 5:1)

• Many important functions don’t

do multiplies

• A fraction have very heavy

multiply usage—e.g.

convolutions

• ISA should handle wide range of

ratios efficiently


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.5 1 1.5 2 2.5Ve

cto

r L

oa

d/S

tore

Op

s

pe

r In

str

uctio

n

Vector ALU Ops per Instruction

Memory ops vs ALU ops

0

0.5

1

1.5

2

2.5

0 0.25 0.5 0.75 1Oth

er A

LU

V

ecto

r O

ps

pe

r In

str

uctio

n

Multiply Vector Ops per Instruction

Multiply ops vs Other ALU ops


• A successful architecture

maximizes the fraction of

kernels that can be vectorized

• A small number of functions may

still use scalar ops heavily

• On-the-fly data reorganization

may be important in a few

kernels

• ALU : Reorg ratio varies from

10:1 to 1:1

• Efficient data reorganization

boosts benefit of vectorization


0

0.5

1

1.5

2

2.5

3

3.5

4

0 0.5 1 1.5 2 2.5 3 3.5

Vecto

r O

ps p

er

Instr

uctio

n

Scalar Ops per Instruction

Scalar ops vs Vector ops

0

0.25

0.5

0.75

1

0 0.5 1 1.5 2 2.5

Da

ta R

eo

rg V

ecto

r O

ps

pe

r In

str

uctio

n

ALU Vector Ops per Instruction

ALU ops vs Data Reorganization ops


• Certain tasks beg for immense performance—

hardwired functions are tempting

• Issue 1: Changes in algorithms hard to anticipate.

Hardwired functions often under-used on deployed

systems

• Issue 2: Hardwired functions difficult to control from

software—operation start/stop, memory

management, context switching

• Techniques to improve hardwired functions:

• Flexible chaining of interface to hardwired blocks

• more reusable primitives

• Direct incorporation into processor ISA

• Instruction-mapped instead of memory-mapped

The Hardwired Accelerator Problem

Processor

Accelerator


Pedestrian Detection Application Example

Key Functions % of

Processing

Pyramid generation 10%

Gradient magnitude

and orientation

calculation 25%

Histogram of

Gradients

calculation 25%

Histogram

normalization 5%

SVM Classifier 35%

• Fractional co-ordinate calculations (16b co-

ordinates)

• Pixel Interpolations (8b values)

• Finite differences or Sobel (8b pixels)

• Sum of squares (8/16b gradients)

• Squareroot (16/32b values)

• Divide (8/16b values)

• Arctan (8/16b values)

• Magnitude projection on bins (16b values)

• Weighted histograms (16b values)

• L1 (sum) or L2 (sum of squares) (16b values)

• Squareroot (32b values)

• Divide (16b values)

• Multiply accumulate (16b values)

A good architecture supports a wide variety of operations and precisions


• Camera system parameters (resolution,

field of view, focal length) determine

person height vs. distance

• Dynamically tradeoff detection latency

for far-away pedestrians based on

vehicle speed—higher resolution levels

may not need high frame rate !

Pedestrian Detection Application Example

Ref: “Pedestrian Detection: An Evaluation of the State of the Art”, IEEE Transactions on Analysis and Machine

Intelligence, Volume: 34 , Issue: 4

h

f D H

fov

Detection resolution

Using Pinhole camera model:

ℎ𝑝𝑖𝑥𝑒𝑙𝑠 =𝐻

𝐷𝑓𝑝𝑖𝑥𝑒𝑙𝑠

=𝐻

𝐷

𝐼𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡

tan (𝑓𝑜𝑣

2)


Lane Departure Warning Processing Functions

Camera

Input

Pre

processing

Feature

extractio

n

Post

processing Tracking

Road and

vehicle

model

• Color

conversion

• Noise removal

• Contrast

enhancement

• Steerable/

Gabor filters

• Image

segmentation

(Intensity,

color)

• Pyramid

generation

• Perspective

warp

• Edge detection

(Sobel, Canny)

• Edge

magnitude and

orientation

• Edge

directional

response

• Thresholding

• Morphology

• Corner

detection

(Harris, Fast,

..)

• Hough

transform

• Neural

network

• Template

matching

and updating

• Road model

fitting

• Outlier

removal

• Connected

components

Vehicle

data

(speed,

steering)

Constant

curvature,

Parabolic

• Kalman filter

High computations

A wide variety of functions are used and must be well supported


Cascade Classifier Application Example

• Parallelism drops quickly in

traditional SIMD implementation

after early stages of cascade

• Need architectural approach to

exploit available parallelism:

• Distributed detection

windows Distributed

features within a detection

window

• Switch type of parallelism as you

progress through the cascade:

• parallelize over pixels in

window

• parallelize over windows

• parallelize over features

• A good architecture supports

many types of parallelism

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10111213141516171819202122

Conventional S

IMD

para

llelis

m

Cascade Stages

Parallelism in conventional SIMD processor

in different cascade stages


• Key computational kernels in CNN are

• Convolution (Highest cost)

• Subsampling (box filter, max pooling)

• Non-linear function (Tanh, Sigmoid)

• For practical implementations a range of tradeoffs are possible for convolutions

• Precision tradeoffs

• Separable kernels

• Symmetric kernels

Convolutional Neural Network (CNN) Example

A good architecture supports a range of options for fast

convolutions

Input Convolution Non-

linearity Sub-

sampling Repeat … Classifier

Result – face

identified

Convolutions model

locally receptive

visual cortex cells by

sampling a small

region and generating

features

Non-linearity

like tanh

function

models on-off

behavior of

neurons

Subsampling

models cells

with larger

receptive fields

(provide local

invariance)

Repeat

previous

steps for

neural

network

layers

Final

classifier

stage


How to Choose a Vision Processor ISA:

• Measure on your real application—don’t just look at paper feeds-and-

speeds

• Expect massive parallelism

• Look for balance and versatility in available operations

• Consider not just raw ops rate, but also ability to handle complex data

organization and on-the-fly reorganization

• The compiler is part of the ISA—look at efficiency, robustness and

analysis tools

• Judge hardwired accelerators by reusability on possible future

applications

• Look for multi-processor support in hardware and software

Wrap-Up


• More readily-available imaging/vision source code, including OpenVX

graphs

• Open reference video streams for testing vision apps

• More substance and less hype around CNN and ADAS

• Standard input data sets

• Standard description of neural networks

• Reference trained parameters

Wish List


• Cadence Imaging/Vision Products:

http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-

processing

• Some Cadence Vision Partners

• Morpho: http://www.morphoinc.com/en/

• Almalence: http://www.almalence.com/

• Irida Labs: http://www.iridalabs.gr/

• Ittiam: http://www.ittiam.com/

• Dream Chip: http://www.dreamchip.de/

• OpenVX: https://www.khronos.org/openvx/

Resources

Cadence, Xtensa and Tensilica are registered trademarks of Cadence Design Systems, Inc. All

other trademarks and logos are the property of their respective holders.

http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-processing









http://www.morphoinc.com/en/

http://www.morphoinc.com/en/

http://www.almalence.com/

http://www.almalence.com/

http://www.iridalabs.gr/

http://www.iridalabs.gr/

http://www.ittiam.com/

http://www.ittiam.com/

http://www.dreamchip.de/

http://www.dreamchip.de/

https://www.khronos.org/openvx/