SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

Preview:

DESCRIPTION

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs). Markus Willems Product Manager Synopsys. SoC Design. What to do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced - PowerPoint PPT Presentation

Citation preview

SoC Subsystem Acceleration using Application-Specific

Processors (ASIPs)Markus Willems

Product ManagerSynopsys

• What to do when the performance of your main processor is insufficient?

– Go multicore?• Application mapping difficult,

resource utilisation unbalanced– Add hardwired accelerators?

• Balanced but inflexible

SoC Design

• What to do when the performance of your main processor is insufficient?

SoC Design

ASIPs: application-specific processors• Anything between general-purpose P and hardwired data-path• Deploys classic hardware tricks (parallelism and customized datapaths) while

retaining programmability – Hardware efficiency with software programmability

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Architectural Optimization Space

ASIP architectural optimization space

Parallelism Speciali-zation

Architectural Optimization Space

Parallelism

Instruction-level

parallelism (ILP)

Data-level

parallelism

Task-level

parallelism

Orthogonalinstructionset (VLIW)

Encoded instruction

set

Vector processing

(SIMD)Multicore Multi-

threading

Architectural Optimization Space

Specialization

App.-specificdata types

App.-specificinstructions

Connectivity & storage matching application’s data-

flow

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs, sub-ranges

Multiple mem’s,sub-ranges

Jumps, subroutines,interrupts, HW do-loops, residual

control, predication…

Direct, indirect, post-modification, indexed,

stack indirect…Any exoticoperator

Integer, fractional, floating-point, bits, complex, vector…

Single or multi-cycle

Relative or absolute, address range, delay slots…

Pipeline

IP Designer: ASIP Design and Programming

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Synopsys - Full Spectrum Processor Technology Provider

32-bit ARC HS ProcessorsHigh-Performance for Embedded Applications

10-stage pipeline

Instruction CCM

Instruction Cache

DataCache

DataCCM

ARCv2 ISA / DSP

User Defined Extensions

ARC Floating Point Unit

MAC & SIMD

Multi-plier ALU Divider Late

ALUReal-TimeTrace

Memory Protection Unit

JTAG

Optional

• Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in

28-nm process*• HS Family products

– HS34 CCM, HS36 CCM plus I&D cache– HS234, HS236 dual-core– HS434, HS436 quad-core

• Configurable so each instance can be optimized for performance and power

• Custom instructions enable integration of proprietary hardware

*Worst case 28-nm silicon and conditions

• Pedestrian detection• Standard feature in luxury vehicles• Moving to mid-size and compact vehicles

in the next 5-10 years, also due to legislation efforts

• Implementation requirements• Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades)

• Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)

Pedestrian Detection and HOG

Histogram Of Oriented Gradients

Gradient ComputationApply Sobel operators: and

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Scale to Multiple Resolutions

Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames.

Histogram Of Oriented Gradients

The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients.

Histogram Computation

Normalization of the Histograms(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization

Support Vector MachineLinear classification of histogramsfor every 64x128 windows position.

Non-Max SuppressionCluster multi-scale dense scan of detection windows and select unique

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Grey scaleconversion

HOG Functional Validation on ARC HS

(640 x 480 pixels)

AXI local interconnectDMA,Sync& I/ODCCM

Dedicated Streaming Interconnect (FIFOs)

D D

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

HSSubs. ctrl

ASIP1 ASIP2 ASIPn…

• OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame

1

ARC HSG cycles

% # ARC HSequivalent

0.1 0.2% 0.07

1.6 2.3% 1.0

17.3 26% 10.8

31.9 47% 20.0

1.2 1.8% 0.8

15.7 23% 9.8

0.004 0.01% 0.002

Profiling (640 x 480 pixels, at 30 FPS)

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Grey scaleconversion

Task Assignment #2

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

D D DASIP1 ASIP2

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

ASIP4

2

L3 Ext. DRAM

ASIP Example: HISTOGRAM

• Vector-slot next to existing scalar instructions (VLIW)• 16x(8/16)-bit vector register files• 16x8-bit SRAM interface• 16x8-bit FIFO interfaces• Vector arithmetic instructions• Special registers and instructions to compute histograms

4x size increase & 200x speedup (relative to RISC template)

Implemented in less than 1 week

Grey scaleconversion

Task Assignment #3

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1 ASIP2 ASIP3 ASIP4

3

L3 Ext. DRAM

Grey scaleconversion

Task Assignment #4

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

4

L3 Ext. DRAM

Grey scaleconversion

Task Assignment #4

AXI local interconnectDMA,Sync& I/O

Dedicated Streaming Interconnect (FIFOs)

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

4’

HS DCCM L2 SRAM

L3 Ext. DRAM

ComparisonPlatformconfiguration

#HS(MHz)

#ASIP(MHz)

ARCFunctions

ASIPFunctions

HS ~40 0 All None

HS + ASIPs 2(1600)

2.5(500)

GreyscaleRescalingNormalizationNon-max suppr.Display

GradientHistogramSVM

HS + ASIPs 1(1600)

3.5(500)

GreyscaleRescalingNon-max suppr.Display

GradientHistogramNormalizationSVM

HS + ASIPs 1(500)

4(500)

GreyscaleNon-max suppr.Display

RescalingGradientHistogramNormalizationSVM

12

3

4

• 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM

• 30 frames/second at 500 MHz • Functionally identical to OpenCV reference• TSMC 28nm• ASIP gate count: 330k gates• ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD

usage• Power/performance/area via ASIPs

• Scaling due to multi-core, specialization and SIMD usage

• Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture

23

Final Results

Scenario: Need for Flexible FEC Core

• Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi

• Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area

DVB-X?LDPC-A

UMTSTurbo-B

.11nLDPC-C

.16eLDPC-D

3GPP-LTEturbo-A

FlexFEC(turbo/LDPC/Vit)

.11nVit

Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6

ILP: 2 FU (scalar+vector unit)

ILP: 6 FU (1 scalar+5 vector units)No duplication for arithmetic functionalityFor exploiting ILP to increase throughput

2 FUs for local memory access

Fast Area/Performance Trade-off(40nm logical synthesis Processor only)

2 3 4 5 60

10

20

30

40

50

60

70

80

90

100

ldpc - layer 6ldpc - layer 8turbo - betaturbo - output

Total number of processor functional units

cycl

e co

unt

0.177 sqmm 0.189 sqmm

Architectural ExplorationFU Utilization: 2 5

layer6 layer7 layer8 alpha beta output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

scalarvector

layer6 layer7 layer8 alpha beta output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

scalarvector aluvector specvector vmemvector bg vmem

Vector slot separated in different FUs without overlapping functionality

Local memory access congestion

Architectural ExplorationMore Balanced FU Utilization: 5 6

ldpc - layer6 ldpc - layer7 ldpc - layer8 turbo - alpha turbo - beta turbo - output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

scalarvector aluvector specvector vmemvector vmem2vector bg vmem

Latest IP Available from IMEC

Blox-LDPC ASIP

adInstances available

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Conclusion• ASIPs enable programmable accelerators

• IP Designer enables efficient design and programming of ASIPs

• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators

• ASIPs enable balanced multicore SoC architectures

Recommended