SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

SoC Subsystem Acceleration using Application-Specific

Processors (ASIPs)Markus Willems

Product ManagerSynopsys

• What to do when the performance of your main processor is insufficient?

– Go multicore?• Application mapping difficult,

resource utilisation unbalanced– Add hardwired accelerators?

• Balanced but inflexible

SoC Design

• What to do when the performance of your main processor is insufficient?

SoC Design

ASIPs: application-specific processors• Anything between general-purpose P and hardwired data-path• Deploys classic hardware tricks (parallelism and customized datapaths) while

retaining programmability – Hardware efficiency with software programmability

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Architectural Optimization Space

ASIP architectural optimization space

Parallelism Speciali-zation

Parallelism

Instruction-level

parallelism (ILP)

Data-level

parallelism

Task-level

parallelism

Orthogonalinstructionset (VLIW)

Encoded instruction

Vector processing

(SIMD)Multicore Multi-

threading

Specialization

App.-specificdata types

App.-specificinstructions

Connectivity & storage matching application’s data-

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs, sub-ranges

Multiple mem’s,sub-ranges

Jumps, subroutines,interrupts, HW do-loops, residual

control, predication…

Direct, indirect, post-modification, indexed,

stack indirect…Any exoticoperator

Integer, fractional, floating-point, bits, complex, vector…

Single or multi-cycle

Relative or absolute, address range, delay slots…

Pipeline

IP Designer: ASIP Design and Programming

Synopsys - Full Spectrum Processor Technology Provider

32-bit ARC HS ProcessorsHigh-Performance for Embedded Applications

10-stage pipeline

Instruction CCM

Instruction Cache

DataCache

DataCCM

ARCv2 ISA / DSP

User Defined Extensions

ARC Floating Point Unit

MAC & SIMD

Multi-plier ALU Divider Late

ALUReal-TimeTrace

Memory Protection Unit

Optional

• Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in

28-nm process*• HS Family products

– HS34 CCM, HS36 CCM plus I&D cache– HS234, HS236 dual-core– HS434, HS436 quad-core

• Configurable so each instance can be optimized for performance and power

• Custom instructions enable integration of proprietary hardware

*Worst case 28-nm silicon and conditions

• Pedestrian detection• Standard feature in luxury vehicles• Moving to mid-size and compact vehicles

in the next 5-10 years, also due to legislation efforts

• Implementation requirements• Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades)

• Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)

Pedestrian Detection and HOG

Histogram Of Oriented Gradients

Gradient ComputationApply Sobel operators: and

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

Normalization of the histograms

SVM per window position

Non-max suppression

Scale to Multiple Resolutions

Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames.

Histogram Of Oriented Gradients

The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients.

Histogram Computation

Normalization of the Histograms(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization

Support Vector MachineLinear classification of histogramsfor every 64x128 windows position.

Non-Max SuppressionCluster multi-scale dense scan of detection windows and select unique

Non-max suppression

Grey scaleconversion

HOG Functional Validation on ARC HS

(640 x 480 pixels)

AXI local interconnectDMA,Sync& I/ODCCM

Dedicated Streaming Interconnect (FIFOs)

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

HSSubs. ctrl

ASIP1 ASIP2 ASIPn…

• OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame

ARC HSG cycles

% # ARC HSequivalent

0.1 0.2% 0.07

1.6 2.3% 1.0

17.3 26% 10.8

31.9 47% 20.0

1.2 1.8% 0.8

15.7 23% 9.8

0.004 0.01% 0.002

Profiling (640 x 480 pixels, at 30 FPS)

Non-max suppression

Task Assignment #2

AXI local interconnectDMA,Sync& I/OHS DCCM

Subs. ctrl

D D DASIP1 ASIP2

Non-maxsuppression

L3 Ext. DRAM

ASIP Example: HISTOGRAM

• Vector-slot next to existing scalar instructions (VLIW)• 16x(8/16)-bit vector register files• 16x8-bit SRAM interface• 16x8-bit FIFO interfaces• Vector arithmetic instructions• Special registers and instructions to compute histograms

4x size increase & 200x speedup (relative to RISC template)

Implemented in less than 1 week

Task Assignment #3

Subs. ctrl

Non-maxsuppression

D D DDASIP1 ASIP2 ASIP3 ASIP4

L3 Ext. DRAM

Task Assignment #4

Subs. ctrl

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

L3 Ext. DRAM

Task Assignment #4

AXI local interconnectDMA,Sync& I/O

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

HS DCCM L2 SRAM

L3 Ext. DRAM

ComparisonPlatformconfiguration

#HS(MHz)

#ASIP(MHz)

ARCFunctions

ASIPFunctions

HS ~40 0 All None

HS + ASIPs 2(1600)

2.5(500)

GreyscaleRescalingNormalizationNon-max suppr.Display

GradientHistogramSVM

HS + ASIPs 1(1600)

3.5(500)

GreyscaleRescalingNon-max suppr.Display

GradientHistogramNormalizationSVM

HS + ASIPs 1(500)

4(500)

GreyscaleNon-max suppr.Display

RescalingGradientHistogramNormalizationSVM

• 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM

• 30 frames/second at 500 MHz • Functionally identical to OpenCV reference• TSMC 28nm• ASIP gate count: 330k gates• ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD

usage• Power/performance/area via ASIPs

• Scaling due to multi-core, specialization and SIMD usage

• Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture

Final Results

Scenario: Need for Flexible FEC Core

• Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi

• Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area

DVB-X?LDPC-A

UMTSTurbo-B

.11nLDPC-C

.16eLDPC-D

3GPP-LTEturbo-A

FlexFEC(turbo/LDPC/Vit)

.11nVit

Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6

ILP: 2 FU (scalar+vector unit)

ILP: 6 FU (1 scalar+5 vector units)No duplication for arithmetic functionalityFor exploiting ILP to increase throughput

2 FUs for local memory access

Fast Area/Performance Trade-off(40nm logical synthesis Processor only)

2 3 4 5 60

ldpc - layer 6ldpc - layer 8turbo - betaturbo - output

Total number of processor functional units

0.177 sqmm 0.189 sqmm

Architectural ExplorationFU Utilization: 2 5

layer6 layer7 layer8 alpha beta output0.0

scalarvector

layer6 layer7 layer8 alpha beta output0.0

scalarvector aluvector specvector vmemvector bg vmem

Vector slot separated in different FUs without overlapping functionality

Local memory access congestion

Architectural ExplorationMore Balanced FU Utilization: 5 6

ldpc - layer6 ldpc - layer7 ldpc - layer8 turbo - alpha turbo - beta turbo - output0.0

scalarvector aluvector specvector vmemvector vmem2vector bg vmem

Highly Efficient C-compilationVast Majority of 6 FU Used

Latest IP Available from IMEC

Blox-LDPC ASIP

adInstances available

Conclusion• ASIPs enable programmable accelerators

• IP Designer enables efficient design and programming of ASIPs

• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators

• ASIPs enable balanced multicore SoC architectures

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

Documents

The Memory Subsystem - University of California, … Memory...Design - The Memory Subsystem 2 The Memory Subsystem Except for the CPU, the most important subsystem in the computer

Test Plan for Remote Sensing Information Subsystem ...€¦ · Geographic Information Subsystem (GIS), and (3) the Natural Resources Analytical Subsystem (NRAS). These Subsystems

Slope Subsystem

SUBSYSTEM SPECIFICATION

AS400 Subsystem Configuration

DK-Iteration Robust Control Design of a Wind Turbine · divided into 4 subsystems: Aerodynamics subsystem, struc-tural subsystem, electrical subsystem and actuator subsystem. Figure

Network Switching Subsystem

COURSE CO-ORDINATOR...1. OFFICE of Academic Services subsystem 2. CHAIR’s / DEAN’s OFFICE subsystem 3. COURSE CO-ORDINATOR’s subsystem 4. Coursework/Examination Mark Entry subsystem

GSM Mobile Computing IT644. GSM System Architecture Network Subsystem MSC ?? Radio Subsystem BTS, BSC Operation Support Subsystem

Signal Chain Frontend Subsystem Interface Control Document · Signal Chain Frontend Subsystem Interface Control ... Signal Chain Frontend Subsystem Interface Control Document .

P lanning and costing for the a cceleration of actions for

Physical Science Coach Kelsoe Pages 342–348 S ECTION 11–3: A CCELERATION

Alarm Handling Subsystem (AHS) Requirements ...property.mq.edu.au/documents/MUP_Alarm_Subsystem_Design...Alarm Handling Subsystem (AHS) Requirements Specification MUP Alarm Subsystem

Array Subsystem

GEOSPATIAL AND MOBILITY GIS MAINTENANCE MANAGEMENT … · Subsystem Management of Parks and Gardens. Business Subsystem Conservation. Cartographic Inventory Subsystem. Mobility Subsystem

ENG6530 Reconfigurable Computing Systems Application Specific Application Specific Instruction Processors “ASIPS” or “Reconfigurable Processors”

Design of Application Specific Processor ArchitecturesApplication specific processors (ASIPs) „As the performance of conventional microprocessors improves, they first meet and then

Software Subsystem

Linux Networking Subsystem

€¦ · Bid/Post System Day-Ahead Subsystem Real-Time Scheduling (RTS) Subsystem Real-Time Commitment (RTC) Real-Time Dispatch (RTD) Settlement Subsystem