39
Platform Platform - - Based Based Behavior Behavior - - Level and System Level and System - - Level Synthesis Level Synthesis Prof. Jason Cong Prof. Jason Cong [email protected] [email protected] UCLA Computer Science Department UCLA Computer Science Department

Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

  • Upload
    others

  • View
    9

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

PlatformPlatform--Based Based BehaviorBehavior--Level and SystemLevel and System--Level SynthesisLevel Synthesis

Prof. Jason CongProf. Jason [email protected]@cs.ucla.edu

UCLA Computer Science DepartmentUCLA Computer Science Department

Page 2: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

OutlineOutlineMotivationMotivation

xPilot xPilot system frameworksystem framework

BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding

SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs

ConclusionsConclusions

Page 3: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

ASICsASICs SOC Example: Philips SOC Example: Philips NexperiaNexperia

Philips Philips NexperiaNexperia SoCSoC platform for platform for highhigh--end digital videoend digital video

ACCESSCTL.

MIPS

MPEG

VLIW

VIDEO

MSP

TM-xxxxD$I$

TriMedia CPU

DEVICE IP BLOCK

DEVICE IP BLOCK

DEVICE IP BLOCK

.

.

.

DVP SYSTEM SILICON

DEVICE IP BLOCK

PRxxxxD$I$

MIPS CPU

DEVICE IP BLOCK.

.

.DEVICE IP BLOCK

PI B

US

SDRAM

MMI

DVP

MEM

OR

Y B

US

PI B

US

TriMedia™MIPS™GeneralGeneral--purpose purpose scalable RISC scalable RISC processorprocessor

50 to 300+ MHz50 to 300+ MHz3232--bit or 64bit or 64--bitbit

Library of device IP Library of device IP blocksblocks

Image Image coprocessorscoprocessorsDSPsDSPsUARTUART13941394USBUSB

……

Scalable VLIW media Scalable VLIW media processor:processor:

100 to 300+ MHz100 to 300+ MHz3232--bit or 64bit or 64--bitbit

NexperiaNexperia™™systemsystembusesbuses

3232--128 bit128 bit

Courtesy PhilipsCourtesy Philips

Page 4: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

FieldField--Programmable SOC Example: Programmable SOC Example: Xilinx VirtexXilinx Virtex--4 FPGA4 FPGA

Courtesy XilinxCourtesy Xilinx

PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture)

Micro-Blaze

Soft core µProc

MicroBlaze 180MHz< ~1300 LUTs166 DMIPS

IBM

Core

Conn

ect™

Bus

IP

IP

H.264/AVC hardware blocks

Page 5: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Needs for Electronic SystemNeeds for Electronic System--Level (ESL) Design AutomationLevel (ESL) Design Automation

Need executable models for systemNeed executable models for system--level specificationlevel specification

Need common specification for SW/HW coNeed common specification for SW/HW co--designdesign

Need better complexity managementNeed better complexity management

Page 6: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

ESL LandscapeESL LandscapeModelingModeling

SystemC SystemC ---- OpenSourceOpenSourceSystemVerilogSystemVerilog

Simulation and VerificationSimulation and VerificationBehaviorBehavior--level simulation & verificationlevel simulation & verificationSystemSystem--level simulation & verificationlevel simulation & verificationSystemC provides behaviorSystemC provides behavior--level and systemlevel and system--level synthesis capabilities for level synthesis capabilities for free free ---- rapidly gaining popularityrapidly gaining popularity

SynthesisSynthesisBehaviorBehavior--level synthesis: from behavior specification (e.g. C, SystemC, olevel synthesis: from behavior specification (e.g. C, SystemC, or r MatlabMatlab) to RTL or ) to RTL or netlistsnetlistsSystemSystem--level synthesis: from system specification to system implementatlevel synthesis: from system specification to system implementationion

Page 7: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

xPilot: PlatformxPilot: Platform--Based Based Synthesis SystemSynthesis System

xPilot

Behavioral SynthesisProcessor & Architecture

Synthesis

SSDM(System-Level

Synthesis Data Model)

Embedded SoC

Interface Synthesis

Analysis

Mapping

Profiling

Processor Cores+ Executables

Drivers + Glue LogicCustom Logic

xPilot Front EndxPilot Front End

SystemC/CSystemC/C Platform Description Platform Description & Constraints& Constraints

Uniqueness of xPilotUniqueness of xPilotPlatformPlatform--based synthesis and optimizationbased synthesis and optimizationCommunicationCommunication--centric synthesis with interconnect optimizationcentric synthesis with interconnect optimization

Page 8: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

OutlineOutlineMotivationMotivation

xPilot xPilot system frameworksystem framework

BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding

SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs

ConclusionsConclusions

Page 9: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

xPilot: BehavioralxPilot: Behavioral--toto--RTL Synthesis Flow RTL Synthesis Flow Behavioral spec.

in C/SystemC

RTL + constraints

SSDMSSDM

µArch-generation & RTL/constraints generation

Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …

FPGAs/ASICsFPGAs/ASICs

Frontendcompiler

Frontendcompiler

Platform description

Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding

Page 10: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

xPilot AdvantagesxPilot AdvantagesAdvanced algorithms for platformAdvanced algorithms for platform--based, communicationbased, communication--centric optimizationcentric optimization

E.g. a versatile scheduling engine based on solving system of E.g. a versatile scheduling engine based on solving system of difference constraints (SDC)difference constraints (SDC)

PlatformPlatform--based behavior and system synthesisbased behavior and system synthesisE.g. resource binding based on distributed register architectureE.g. resource binding based on distributed register architecture

Communication/interconnectCommunication/interconnect--centric approachcentric approachE.g. behavior and communication coE.g. behavior and communication co--optimization optimization

Complete validation through final P&R on FPGAsComplete validation through final P&R on FPGAs

Page 11: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Advanced Behavior System Algorithms:Advanced Behavior System Algorithms:Example: Versatile Scheduling Example: Versatile Scheduling Algorithm Based on SDCAlgorithm Based on SDC

Scheduling problem in behavioral synthesis is NPScheduling problem in behavioral synthesis is NP--Complete under general design constraintsComplete under general design constraints

ILPILP--based solutions are versatile but very inefficientbased solutions are versatile but very inefficientExponential time complexityExponential time complexity

Our solution: An efficient and versatile scheduler Our solution: An efficient and versatile scheduler based on SDC (system of difference constraints)based on SDC (system of difference constraints)

Applicable to a broad spectrum of applicationsApplicable to a broad spectrum of applications•• Computation/DataComputation/Data--intensive, controlintensive, control--intensive, memoryintensive, memory--

intensive, partially timed.intensive, partially timed.•• Salable to largeSalable to large--size designs (finishes in a few seconds)size designs (finishes in a few seconds)

Amenable to a rich set of scheduling constraints:Amenable to a rich set of scheduling constraints:•• Resource constraints, latency constraints, frequency Resource constraints, latency constraints, frequency

constraints, relative IO timing constraints.constraints, relative IO timing constraints.Capable of a variety of synthesis optimizations:Capable of a variety of synthesis optimizations:•• Operation chaining, pipelining, multiOperation chaining, pipelining, multi--cycle cycle

communication, incremental scheduling, etc.communication, incremental scheduling, etc.

+4

+2

*5

*1

+3

CS0

* +

+3

*1

*5

+2

+4

CS1

Page 12: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Scheduling Scheduling −− Our ApproachOur ApproachOverall approachOverall approach

Current objective: highCurrent objective: high--performanceperformanceUse a system of integer difference constraints to Use a system of integer difference constraints to express all kinds of scheduling constraintsexpress all kinds of scheduling constraintsRepresent the design objective in a linear functionRepresent the design objective in a linear function

Dependency constraint Dependency constraint •• vv11 vv33 : : xx33 –– xx11 ≥ ≥ 00•• vv22 vv33 : : xx33 –– xx22 ≥ ≥ 00•• vv33 vv55 : : xx44 –– xx33 ≥ ≥ 00•• vv44 vv55 : : xx55 –– xx44 ≥ ≥ 00

Frequency constraint Frequency constraint •• <<vv22 ,, vv55> : > : xx55 –– xx22 ≥ ≥ 11

Resource constraintResource constraint•• <<vv22 ,, vv33>: >: xx33 –– xx22 ≥ ≥ 11

+ *

*

+v1 v2

v3

v4

v5

Platform characterization:Platform characterization:•• adder (+/adder (+/––) 2ns) 2ns•• multipilermultipiler (*): 5ns(*): 5ns

Target cycle time: 10nsTarget cycle time: 10nsResource constraint: Only Resource constraint: Only ONE multiplier is availableONE multiplier is available

1 0 -1 0 00 1 -1 0 00 0 1 -1 00 0 0 1 -10 1 0 0 -1

X1X2X3X4X5

0-100-1

A x bTotally Totally unimodularunimodular matrix: matrix: guarantees integral solutionsguarantees integral solutions

Page 13: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Platform Modeling & CharacterizationPlatform Modeling & CharacterizationTarget platform specificationTarget platform specification

HighHigh--level resource library with level resource library with delay/latency/area/power curve for delay/latency/area/power curve for various input/bitwidth configurationsvarious input/bitwidth configurations•• Functional units: adders, ALUs, Functional units: adders, ALUs,

multipliers, comparators, etc.multipliers, comparators, etc.•• Connectors: Connectors: muxmux, , demuxdemux, etc., etc.•• Memories: registers, synchronous Memories: registers, synchronous

memories, etc.memories, etc.

Chip layout descriptionChip layout description•• OnOn--chip resource distributionschip resource distributions•• OnOn--chip interconnect delay/power chip interconnect delay/power

estimationestimation4.74.73.83.82.82.8

3.73.72.92.92.02.0

2.82.81.81.80.580.58

3X3 Delay Matrix for Stratix-EP1S40

ALU

Two binding solutions for Two binding solutions for same behavior:same behavior:Which one is better?Which one is better?Answer is platformAnswer is platform--dependent:dependent:

How large/fast are the How large/fast are the MUX and ALU?MUX and ALU?

MUX

ALU ALU

Page 14: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

CommunicationCommunication-- and Interconnectand Interconnect--Centric Synthesis: Centric Synthesis: Example: Example: Use of Distributed RegisterUse of Distributed Register--File ArchitecturesFile Architectures

A scheduled DFG A scheduled DFG with register binding with register binding indicated on each indicated on each variable (assume variable (assume oneone--functional unit functional unit constraint)constraint)

11

22

44

33

11

22

33 22 4411

Binding using Binding using discrete registers discrete registers

Binding using a Binding using a register file: more register file: more efficient design!efficient design!

Island AData-Routing

LogicLocalRegister

File

LocalRegister

File

FUP MUX

Functional Unit PoolMUL ALU

ALU’

Island CIsland C Island B

Input Buffers

Distributed registerDistributed register--file file micromicro--architecture:architecture:

Efficiently use onEfficiently use on--chip chip embedded memoriesembedded memories

Fully explore operation and Fully explore operation and datadata--transfer parallelismtransfer parallelism

Page 15: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Distributed RegisterDistributed Register--File MicroarchitectureFile Microarchitecture

Island A

Data-RoutingLogicLocal

RegisterFile

LocalRegister

File

FUP MUX

Functional Unit Pool

MULALU

ALU’

Island C

Island B

Input Buffers

1,456 1,056720448 336 Dist. RAM(Kb)

168 144 120 96 56 #18Kb BRAM

8000 6000 4000 3000 2000 Xilinx XC-2V

FP-SoC

Island A

Island B

Island C

On-chip memory blocks

On-chip RAM resource on Virtex II

Page 16: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Resource Binding for DRFResource Binding for DRF--MicroarchitectureMicroarchitectureFacts under simplified Facts under simplified assumptionsassumptions

Operations bound onto an island Operations bound onto an island form a chain in the given form a chain in the given scheduled DFGscheduled DFGInterInter--chain data transfers may chain data transfers may share a physical intershare a physical inter--island island connectionconnection

The The number of internumber of inter--island island connections (IIC) connections (IIC) is crucial to is crucial to the QoR of a DRFM instancethe QoR of a DRFM instance

v1

v2

v4

v3

v5 v8 v10

BB CC DD

1

2

3

4

v7

v6

v9

InterInter--island connections = 5island connections = 5(A,B)=(A,D)=1(A,B)=(A,D)=1(A,C)=1, two data transfers (A,C)=1, two data transfers share one connectionshare one connection(C,D)=2(C,D)=2

AAIslandIsland(Chain)(Chain)

Intra-island transfers

Inter-island transfers

Page 17: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Example: Behavior and Communication CoExample: Behavior and Communication Co--Optimization Optimization in Platformin Platform--Based Interface SynthesisBased Interface Synthesis

Focus on sequential communication media (SCM)Focus on sequential communication media (SCM)FIFOsFIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnectCoreConnect. Altera Avalon, etc.) . Altera Avalon, etc.) Order may have dramatic impact on performanceOrder may have dramatic impact on performance•• Best order should guarantee that no data transmission on criticaBest order should guarantee that no data transmission on critical path are delayed l path are delayed

by nonby non--critical transmissioncritical transmissionInterface synthesis for SCMInterface synthesis for SCM

Consider both behavior and communication to determine the optimaConsider both behavior and communication to determine the optimal transmission l transmission orderorder

for (int i=0; i <8; i++) {S1: data[i] = …;

}

int s07 = data[0] + data[7];

Int s16 = data[1] + data[6];…..

Custom Logic 1 Custom logic 2

DCT example

P1 P2

C

PE1 PE2

FIFO

data[8]

Page 18: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Proposed SCM CoProposed SCM Co--Optimization Design FlowOptimization Design Flow

SCOOP (SCM COSCOOP (SCM CO--Optimization)Optimization)

SystemSystem--Level Synthesis Level Synthesis Data ModelData Model

Code transformation and Code transformation and interface generationinterface generation

Drivers + Glue Drivers + Glue LogicsLogics

Front EndFront End

Process NetworkProcess NetworkPlatform Description & Platform Description &

ConstraintsConstraints

Communication Communication order detectionorder detection

Indices compression Indices compression for loop reorderingfor loop reordering

Process Process BehaviorBehavior

Page 19: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Initial Results of Interface SynthesisInitial Results of Interface Synthesis

030043.04%10841903Dot019232.26%420620Masking

648013.25%419483DCT2209616.91%339408Mat_mul0010.45%617689DWT005.63%134142Haar0010.77%290325DCT1

AfterBeforeReductionSCOOPTrad.DesignsRAs CompressTotal latency (Cycle#)

An average of 26% improvement in total latency can be achieved.

Target for sequential communication channelsTarget for sequential communication channelsIn particular, FSL in In particular, FSL in VirtexIIVirtexII

Consider two communicating processesConsider two communicating processes

Page 20: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

SystemC/CSystemC/C--toto--RTL Design FlowRTL Design Flow

xPilot xPilot behavioral behavioral synthesissynthesis

SSDM/CDFGSSDM/CDFGBehavioral synthesisBehavioral synthesis

RTL generationRTL generationSSDM/FSMDSSDM/FSMD

FSM with DatapathFSM with Datapathin VHDLin VHDL

Floorplan and/or multiFloorplan and/or multi--cycle path constraintscycle path constraints

SSDM(System-Level

Synthesis Data Model)

SystemC/C specificationSystemC/C specification

FrontFront--end compilerend compiler

Platform description Platform description & constraints& constraints

RTL synthesisRTL synthesis

ASICsASICs/FPGAs platform/FPGAs platform

Page 21: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Preliminary Results of xPilotPreliminary Results of xPilot−−Shorter Simulation/Verification CycleShorter Simulation/Verification Cycle

From other projects:From other projects:Simulation speed on behavior model 100X faster than Simulation speed on behavior model 100X faster than RTLRTL--based method based method [NEC, ASPDAC04][NEC, ASPDAC04]

Our experience:Our experience:MotionMotion--compensation module in a Mpeg4compensation module in a Mpeg4--decoder decoder •• Behavior level (in C language) simulation Behavior level (in C language) simulation

Less than Less than 1 second per frame1 second per frame•• RTL SystemC simulationRTL SystemC simulation

About About 310 second per frame310 second per frame

Page 22: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Preliminary ResultsPreliminary Results of xPilot of xPilot −−Better Complexity ManagementBetter Complexity Management

Significant code size reductionSignificant code size reductionRTL design RTL design Behavioral design: 10x code size reductionBehavioral design: 10x code size reduction

VHDL code generated by UCLA xPilot targeting Altera VHDL code generated by UCLA xPilot targeting Altera Stratix platformStratix platform

Page 23: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Preliminary Results of xPilot Preliminary Results of xPilot −−Rapid System ExplorationRapid System Exploration

Quick evaluation of different hardware/software Quick evaluation of different hardware/software boundariesboundaries

Example: Motion-JPEG implementation-All HW implementation-All SW implementation (using embedded processors)-SW/HW co-design: optimal partitioning?

-Repeated manual RTL coding is not solution!

Page 24: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Preliminary Results on MotionPreliminary Results on Motion--JPEG ExampleJPEG Example

Encoded JPEG Images

RAW Im

ages

Xilinx XUP Board

Preprocess QuantDCT Huffman

Table ModificationOR

0.1170.117

0.1890.189

Exe Time Exe Time (ms)(ms)

126126

126126

Fmax Fmax ((MHZ)MHZ)

14800 14800 ((--38%)38%)

2381223812

Cycle#Cycle#

63456345Model #2Model #2

43064306Model #1Model #1

Area Area (Slice#)(Slice#)

SystemSystem

Preprocess Quant Huffman

Table Modification

HW-DCT

Model #1 : 5 Microblazes

FSL-based communication

Model #2 : 4 Microblazes

+ DCT on FPGA fabrics

Page 25: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Preliminary Result of xPilot Preliminary Result of xPilot −−Better Better QoRQoR (Comparison with UCI/UCSD SPARK)(Comparison with UCI/UCSD SPARK)

1.27 1.27 1.27 1.27 n/an/a2.742.740.480.480.660.661.00 1.00 11111111Ave RatioAve Ratio

1.25 1.25 98.81 98.81 5656173217321002100297997979.30 79.30 334944942256225613231323DIRDIR

1.11 1.11 110.38 110.38 3030128212821207120788788799.40 99.40 004794791857185710621062MCMMCM

1.21 1.21 131.93 131.93 1919659659484484356356109.17 109.17 00220220996996574574LEELEE

1.22 1.22 133.51 133.51 1515588588464464357357109.29 109.29 0026526511571157660660WANGWANG

1.58 1.58 146.84 146.84 161656456441641633133192.85 92.85 00247247981981588588PRPR

(FF)(FF)(LUT)(LUT)(FF)(FF)(LUT)(LUT)(MHz)(MHz)DSPDSP

SliceSliceSliceSliceSliceSlice(MHz)(MHz)DSPDSP

SliceSliceSliceSliceSliceSlice

xPilot xPilot /SPARK/SPARK

FmaxFmaxResource UsageResource UsageFmaxFmaxResource UsageResource Usage

Delay Delay Ratio Ratio

xPilotxPilotSPARKSPARK

DesignsDesigns

Device setting: Xilinx VirtexDevice setting: Xilinx Virtex--II pro (xc2v4000 II pro (xc2v4000 --6)6)

Target frequency: 200 MHzTarget frequency: 200 MHz

Page 26: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

OutlineOutlineMotivationMotivation

xPilot xPilot system frameworksystem framework

BehaviorBehavior--level synthesis in xPilotlevel synthesis in xPilotAdvantages of behavioral synthesisAdvantages of behavioral synthesisSchedulingSchedulingResource bindingResource binding

SystemSystem--level synthesis in xPilotlevel synthesis in xPilotSynthesis for ASIP platformsSynthesis for ASIP platformsDesign exploration for heterogeneous Design exploration for heterogeneous MPSoCsMPSoCs

ConclusionsConclusions

Page 27: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Design Exploration for Heterogeneous Design Exploration for Heterogeneous MPSoCMPSoC PlatformsPlatformsHeterogeneous Heterogeneous MPSoCsMPSoCs explorationexploration

ProcessorsProcessors•• Heterogeneous vs. homogeneousHeterogeneous vs. homogeneous•• GeneralGeneral--purpose vs. applicationpurpose vs. application--specificspecific

OnOn--chip communication architecture (OCA)chip communication architecture (OCA)•• Bus (e.g. AMBA, Bus (e.g. AMBA, CoreConnectCoreConnect), packet switching network ), packet switching network

(e.g. Alpha 21364)(e.g. Alpha 21364)Memory hierarchyMemory hierarchy

µP

Communication Network

µP OSDriver

tasksµP

NetworkInterfaceNetwork

Interface

NetworkInterfaceNetwork

Interface

IP µP FPGA µP

NetworkInterfaceNetwork

Interface

NetworkInterfaceNetwork

Interface

DSPµP µP OSDriver

tasks

NetworkInterfaceNetwork

Interface

µP µP OSDriver

tasks

NetworkInterfaceNetwork

Interface

Page 28: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Configurable Configurable SoCSoC PlatformsPlatformsGeneral purpose processor cores + programmable fabricGeneral purpose processor cores + programmable fabric

Tight integration using extended instructions (Tight integration using extended instructions (ASIPsASIPs))•• Example: Altera Example: Altera NiosNios / / NiosNios IIII

Loose integration using Loose integration using FIFOsFIFOs/busses for communications/busses for communications•• Example: Xilinx MicroBlaze, etc.Example: Xilinx MicroBlaze, etc.

Custom instruction logic for Nios II [source: www.altera.com]

Xilinx MicroBlaze[source: www.xilinx.com]

Page 29: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

ASIP Compilation: Problem StatementASIP Compilation: Problem Statement

1( )i

i Narea p A

≤ ≤

<∑

Given:Given:CDFG G(V, E)CDFG G(V, E)The basic instruction set The basic instruction set IIPattern constraints:Pattern constraints:•• Number of inputs Number of inputs ||PI(piPI(pi)| )| ≤≤ Nin;Nin;•• Number of outputs Number of outputs ||PO(piPO(pi)| = 1)| = 1;;•• Total area Total area

Objective:Objective:Generate a pattern library Generate a pattern library PPMap G to the extended instruction set Map G to the extended instruction set II∪∪PP, so that the total execution time , so that the total execution time is minimizedis minimized

* *

+

+

*a c e

t6

+

d

t1 = a * b;

t2 = b * c;;

t3 = d * e;

t4 = t1 + t2;

t5 = t2 + t3;

t6 = t5 + t4;

ext-inst1(MAC1: 2 cycles)

ext-inst2(MAC2: 2 cycles)

* 2 clock cycles + 1 clock cycle

t4 t5

Performance speedup = 9 / 5 = 1.8X

b

t4 = ext-inst1(a, b, c);

t5 = ext-inst2(b, c, d, e);

t6 = t4 + t5;

Page 30: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Target Core Processor ModelTarget Core Processor Model

Inst Cache

Reg File

Memory

MUX

4

Adder

Resu

ltPC

RS1

RS2

Core Processor

ID / EX

EX / MEM

MEM / WB

IF / ID

ALUOP1

OP2

Core processor modelCore processor modelClassic singleClassic single--issue pipelined RISC core (fetch / decode / execute / issue pipelined RISC core (fetch / decode / execute / memmem / / writewrite--back)back)

•• The number of input and output operands of an instruction is preThe number of input and output operands of an instruction is pre--determineddetermined•• An instruction reads the core register file during the execute sAn instruction reads the core register file during the execute stage, and commits tage, and commits

the result during the writethe result during the write--back stageback stage

CustomLogic

Page 31: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

ASIP Compilation FlowASIP Compilation Flow

FrontFront--end compilationend compilation

Backend compilationBackend compilation

1. Pattern generation1. Pattern generation2. Pattern selection2. Pattern selection

3. Application mapping &3. Application mapping &Graph coveringGraph covering

Pattern GenerationSatisfying input/output constraints

Pattern SelectionSelect a subset to maximize the potential speedup while satisfying the resource constraint

Application MappingGraph covering tominimize the total execution time

C codeC code µ µArchArch

constraintconstraint

CDFGCDFG

Pattern libraryPattern library

OptimizedOptimizedCDFGCDFG

Optimized assemblyOptimized assembly

Page 32: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Experimental Results on Altera Experimental Results on Altera NiosNios

-1.77%-2.54%-2.75 3.08 Average

560.00%02.76%1863.224.754mcm160.00%00.80%543.023.282dir140.00%01.05%711.751.572pr80.15%1,0240.76%512.142.402fir400.71%4,7363.79%2553.733.187iir169.79%65,5366.06%4082.653.289fft_br

DSP BlockMemoryLENiosEstimation

Resource OverheadSpeedupExtended Instruction#

---

560.00%02.76%1863.224.754160.00%00.80%543.023.282140.00%01.05%711.751.57280.15%1,0240.76%512.142.402400.71%4,7363.79%2553.733.187169.79%65,5366.06%4082.653.289

LENios

Altera Altera NiosNios is used for ASIP implementation is used for ASIP implementation 5 extended instruction formats5 extended instruction formatsup to 2048 instructions for each formatup to 2048 instructions for each format

Small DSP applications are taken as benchmarkSmall DSP applications are taken as benchmark

Page 33: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Data bandwidth problemData bandwidth problem•• Limited register file bandwidth (two read ports, one write port)Limited register file bandwidth (two read ports, one write port)•• ~40% of the ideal performance speedup will be lost~40% of the ideal performance speedup will be lostShadowShadow--registerregister--based architectural extensionbased architectural extension

Core registers are augmented by an extra set of shadow registersCore registers are augmented by an extra set of shadow registers•• Conditionally written during writeConditionally written during write--back stage back stage •• Low power/area overheadLow power/area overhead

Novel shadowNovel shadow--register binding algorithms are developedregister binding algorithms are developed

Inst Cache

Reg File

Memory

MUX

4

AdderRe

sult

PC

RS1

RS2

Core Processor

ID / EX

EX / MEM

MEM / WB

IF / ID

ALU

HashingUnit

HashingUnit

OP1

OP2

CustomLogic

SR1SR1

SRKSRK

…k = hash(j)

Architecture Extension for Architecture Extension for ASIPsASIPs

Page 34: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

Ongoing Work : Mapping for Heterogeneous Integration Ongoing Work : Mapping for Heterogeneous Integration with Multiple Processing Coreswith Multiple Processing Cores

Given:Given:A library of processing cores A library of processing cores P P and communication library and communication library C C Task graph Task graph GG((VV, , EE))•• For each For each v v in in VV, execution time , execution time tt((vv, , ppii) on ) on ppii

•• For each (For each (u, vu, v) in ) in EE, communication data size , communication data size ss((uu,,vv))Throughput constraintThroughput constraint

Problem:Problem:Select and instantiate the processing elements and communicationSelect and instantiate the processing elements and communication channels channels from from P P andand C C respectivelyrespectivelyMap the tasks onto the processing elements and communications toMap the tasks onto the processing elements and communications to the the channels so thatchannels so that

•• The optimal latency is achieved subject to the throughput constrThe optimal latency is achieved subject to the throughput constraintaint•• The implementation cost is minimizedThe implementation cost is minimized

Page 35: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

MPEGMPEG--4 Simple Profile Decoder: Architecture Profiling 4 Simple Profile Decoder: Architecture Profiling

220

1901

508

1092

312

358

287

Orig. C line #

textureUpdate.cTexture Update

texture_idct.cTexture/IDCT

texture_vld.c

parser.cParser/VLD

Motion-Compensation.c

Motion Comp.

displayControl.cDisplay Controller

copyControl.cCopy Controller

Orig. CSource File

Module Name

18.1%18.1%Texture/IDCTTexture/IDCT

15.7%15.7%Motion Comp.Motion Comp.

3.6%3.6%Copy ControllerCopy Controller

59.0%59.0%Parser/VLDParser/VLD

•• Runtime Profiling (PowerPC/XUP board)Runtime Profiling (PowerPC/XUP board)

•• C specification overviewC specification overview

Page 36: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

MPEGMPEG--4 Simple Profile Decoder: 4 Simple Profile Decoder: HypridHyprid HW/SW HW/SW ImpmentationImpmentation

Software blocks running on PowerPC

HW block HW block Integrated with Integrated with PowerPC single PowerPC single process design:process design:

15% speed 15% speed improvementimprovement

Page 37: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

MPEGMPEG--4 Simple Profile Decoder: 4 Simple Profile Decoder: Alternate ImplementationsAlternate Implementations

3.533.533.063.061.181.180.590.59ThroughputThroughput(Frame per Second)(Frame per Second)

+ 15.3%+ 15.3%+ 68.4%+ 68.4%+ 209%+ 209%--ImprovementImprovement

Single Single PowerPCPowerPC77--uBlazeuBlaze Single PowerPC w/Single PowerPC w/

HW Motion Comp.HW Motion Comp.Single Single uBlazeuBlaze

• xPilot Synthesis Report of HW blocks

3353357.9137.913441551 (1696, 1931)1551 (1696, 1931)4475447582278227160160Texture UpdateTexture Update2802807.9637.96326261877 (2376, 2438)1877 (2376, 2438)2731273195349534200200Block IDCTBlock IDCT5055057.977.9722986 (1111, 1017)986 (1111, 1017)5655565599039903210210Motion Comp.Motion Comp.

RTL RTL VHDLVHDL

RTL RTL SystemCSystemCCC

Latency Latency (Cycles)(Cycles)

Clock Clock period (ns)period (ns)MULMULSlices ( Slices ( FFsFFs, , LUTsLUTs))

Line countsLine counts

Page 38: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

ConclusionsConclusionsxPilot has fairly mature and advanced behavior synthesis capabilxPilot has fairly mature and advanced behavior synthesis capability ity from C or SystemC to RTL code with necessary design constraintsfrom C or SystemC to RTL code with necessary design constraints

xPilot advantages includexPilot advantages includePlatformPlatform--based behavior and system synthesisbased behavior and system synthesisCommunication/interconnectCommunication/interconnect--centric approachcentric approachAdvanced algorithms for platformAdvanced algorithms for platform--based, communicationbased, communication--centric optimizationcentric optimizationPromising results demonstrated on available FPGAsPromising results demonstrated on available FPGAs

xPilot system synthesis capabilitiesxPilot system synthesis capabilitiesPerformance simulation of multiPerformance simulation of multi--processor systemsprocessor systemsExploration the efficient use of (multiple) onExploration the efficient use of (multiple) on--chip processorschip processorsCompilation and optimization for reconfigurable processorsCompilation and optimization for reconfigurable processors

Page 39: Platform-Based Behavior-Level and System-Level Synthesiscadlab.cs.ucla.edu/soc/docs/edp_april_2006.pdfPlatform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong

AcknowledgementsAcknowledgementsWe would like to thank the supports from We would like to thank the supports from

GigascaleGigascale Systems Research Center (GSRC) Systems Research Center (GSRC) National Science Foundation (NSF)National Science Foundation (NSF)Semiconductor Research Corporation (SRC)Semiconductor Research Corporation (SRC)Industrial sponsors under the California MICRO programs (Altera,Industrial sponsors under the California MICRO programs (Altera, Xilinx)Xilinx)

Team members:Team members:

Yiping FanYiping Fan Zhiru ZhangZhiru ZhangWei JiangWei JiangGuoling HanGuoling Han