26
MPSoC Design Flow: MPSoC Design Flow: Case Study for H.264 Case Study for H.264 Kai Kai Huang Huang [email protected] [email protected] Institute of VLSI Design, Institute of VLSI Design, Zhejiang University, China Zhejiang University, China August, ICDFN 2007 August, ICDFN 2007 Simulink Simulink - - Based MPSoC Design Flow: Case Study of MJPEG and H.264 Based MPSoC Design Flow: Case Study of MJPEG and H.264 Published at DAC 2007 Published at DAC 2007

MPSoC Design Flow: Case Study for H - UCLA

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

MPSoC Design Flow: MPSoC Design Flow: Case Study for H.264 Case Study for H.264 ††

Kai Kai [email protected]@vlsi.zju.edu.cn

Institute of VLSI Design, Institute of VLSI Design, Zhejiang University, ChinaZhejiang University, China

August, ICDFN 2007August, ICDFN 2007

†† ””SimulinkSimulink--Based MPSoC Design Flow: Case Study of MJPEG and H.264Based MPSoC Design Flow: Case Study of MJPEG and H.264””Published at DAC 2007Published at DAC 2007

22

System System DesignDesign

GAPGAP

MotivationMotivation

SystemC SystemC

Within a wide range of abstraction levelsWithin a wide range of abstraction levels

The prevailing environment The prevailing environment Modeling and simulating complex systems at algorithm levelModeling and simulating complex systems at algorithm level

Intrinsic low level languageIntrinsic low level languageNOT easy to specify the complex system NOT easy to specify the complex system at algorithm levelat algorithm level

Simulink Simulink

A preferred HW/SW codesign languageA preferred HW/SW codesign language

An open issue: Algorithm/Architecture mapping for MPSoCAn open issue: Algorithm/Architecture mapping for MPSoC

Transaction Accurate Transaction Accurate LevelLevel

Virtual Architecture Virtual Architecture LevelLevel

System level design: Key solution to complex MPSoCSystem level design: Key solution to complex MPSoC

Virtual Prototype LevelVirtual Prototype Level

Physical implementationPhysical implementation

ApplicationApplication

AlgorithmAlgorithm ArchitectureArchitectureSimulinkSimulink

SystemCSystemC

Algorithm/Architecture mapping Algorithm/Architecture mapping (System Level)(System Level)Protocol Accurate Protocol Accurate (Virtual Architecture Level)(Virtual Architecture Level)Synchronization Accurate Synchronization Accurate (Transaction Level)(Transaction Level)Cycle Accurate Cycle Accurate (Virtual Prototype Level)(Virtual Prototype Level)

33

ObjectiveObjective

1)1) System level MPSoC design flowSystem level MPSoC design flow

2) Case study for multimedia applications2) Case study for multimedia applicationsFeasibility and efficiency of proposed design flowFeasibility and efficiency of proposed design flow

System functional validationSystem functional validationHW/SW coHW/SW co--design/codesign/co--verificationverificationPerformance analysis for architecture explorationPerformance analysis for architecture exploration

Combine Simulink with SystemCCombine Simulink with SystemCSimulink for highSimulink for high--level algorithm modelinglevel algorithm modelingSystemC for lowSystemC for low--level HW/SW designlevel HW/SW design

Concurrent HW/SW designConcurrent HW/SW designSeamless refinement at different level abstraction modelsSeamless refinement at different level abstraction modelsSystematic and automated HW/SW code generationSystematic and automated HW/SW code generation

44

ContentContent

Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

HW and SW Mixed ModelHW and SW Mixed ModelDesign StepsDesign Steps

Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

Conclusions & Future WorksConclusions & Future Works

55

Overall Design Flow Overall Design Flow

Step iStep i

MixedMixed HW and SW ModelHW and SW Model

66

Target CPUDesign

HW and SW Mixed Models HW and SW Mixed Models

Seamless refinement at four abstraction levels:Simulink Combined Algorithm and Architecture Model (Simulink CAAM)

OS and HW codesignHW-SW Codesign

HighHigh LowLow

System level model (Simulink CAAM)

Virtual architectureModel (SystemC)

Transaction accurate (SystemC)

Virtual prototype(SystemC)

port portport port

HdSApp

CPU SS3Thread Main 2

App

Thread Main 1

Abstract ChannelsAbstract Channels

port portport port

HdSApp

CPU SS2Thread Main 2

App

Thread Main 1

port portport port

HdSApp

CPU SS1Thread Main 2

App

Thread Main 1

Interconnecting BusInterconnecting Bus Interconnecting BusInterconnecting Bus

Hardware Hardware DesignDesign

Software Software DesignDesign

77

Step 1 : Simulink ModelingStep 1 : Simulink Modeling

Application C/C++Application C/C++ Into Into a set of a set of modular functions:modular functions:

UserUser--defined Simulink blocksdefined Simulink blocks(e.g. S(e.g. S--function)function)prepre--defined Simulink blocks defined Simulink blocks (e.g. (e.g. mathematical operationmathematical operation))

Simulink Simulink ModelingModeling

88

Step 2 : Application/Step 2 : Application/Architecture mappingArchitecture mapping

HW Libary.Comp. Subsystems.Comm. Channels

Application

Simulink modeling

Simulink application model

Application/Architecture mapping

Simulink CAAM

Simulink parsing

Colif CAAM

1

2

3

Virtual Architecture Model

Transaction Accurate Model

Virtual Prototype Model

HW Architecture Gen. Multithread Code Gen.54

SW Libary.Thread Library.HdS Library

F3 F4

F1 F2

F5

IAS1

IAS0

Fsw

F6

F7

Z-1

Z-k

F0

F8

F9

F10

F11

Z-1

Z-1

HW architecture TemplateHW architecture TemplateCPU 1 SSCPU 1 SS

CPU 2 SSCPU 2 SSCPU CPU 3 SS3 SS

GFIFOGFIFOGFIFOGFIFOGFIFOGFIFO

Inte

r-SS

Com

m

Inte

r-SS

Com

m

1) 1) Architecture layer: CPU SSs, Inter-SS comm.

2) Subsystem layer: Threads, Intra-SS comm.

3) Thread layer: Simulink blocks, links

Simulink CAAMSimulink CAAM

99

Step 3 : Simulink ParsingStep 3 : Simulink Parsing

HW Libary.Comp. Subsystems.Comm. Channels

Application

Simulink modeling

Simulink application model

Application/Architecture mapping

Simulink CAAM

Simulink parsing

Colif CAAM

1

2

3

Virtual Architecture Model

Transaction Accurate Model

Virtual Prototype Model

HW Architecture Gen. Multithread Code Gen.54

SW Libary.Thread Library.HdS Library

Colif is a XMLColif is a XML--based metabased meta--modelmodelwell-defined data structuresModules, channels and ports

Simulink CAAMSimulink CAAM

One-to-one correspondence (Simulink)Simulink Port to Send/Receive Block

Colif CAAMColif CAAM

1010

Step 4 : HW Architecture Step 4 : HW Architecture GenerationGeneration

HW Libary.Comp. Subsystems.Comm. Channels

Application

Simulink modeling

Simulink application model

Application/Architecture mapping

Simulink CAAM

Simulink parsing

Colif CAAM

1

2

3

Virtual Architecture Model

Transaction Accurate Model

Virtual Prototype Model

HW Architecture Gen. Multithread Code Gen.54

SW Libary.Thread Library.HdS Library

VP HW platform

Global shared bus

1111

Step 5: Multithreaded Code Step 5: Multithreaded Code Generation Generation

HW Libary.Comp. Subsystems.Comm. Channels

Application

Simulink modeling

Simulink application model

Application/Architecture mapping

Simulink CAAM

Simulink parsing

Colif CAAM

1

2

3

Virtual Architecture Model

Transaction Accurate Model

Virtual Prototype Model

HW Architecture Gen. Multithread Code Gen.54

SW Libary.Thread Library.HdS Library

******Copy Removal and Buffer Copy Removal and Buffer sharing are used for sharing are used for memory optimizationmemory optimization

1212

ContentContent

Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps

Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

Conclusions & Future WorksConclusions & Future Works

1313

VLDVLD REC & DFREC & DFMC/SCMC/SC

H.264 Baseline DecoderH.264 Baseline Decoder

4th 8x8 LumaIQ/IT

LuminanceDF

4th 8x8 Luma REC

4th 8x8 LumaMC/SC

3rd 8x8 LumaIQ/IT

LuminanceDF

3rd 8x8 Luma REC

3rd 8x8 LumaMC/SC

2nd 8x8 LumaIQ/IT

LuminanceDF

2nd 8x8 Luma REC

2nd 8x8 LumaMC/SC

Chroma V VLD Chroma V

DF Chroma V

REC

Chroma VIQ/IT

Chroma VMC/SC

Global ctrl

MacroblockVLD

Chroma U VLD

Chroma UIQ/IT

Chroma UDF

Chroma UREC

Chroma UMC/SC

Luminance VLD

1st 8x8 LumaIQ/IT

LuminanceDF

1st 8x8 Luma REC

1st 8x8 LumaMC/SC

2 times

4 times

Receives an encoded video bit stream and performs iterative Receives an encoded video bit stream and performs iterative executions ofexecutions of MacroblockMacroblock level functions:level functions:VLD:VLD: Variable Length Decoder Variable Length Decoder IQ:IQ: Inverse Quantization Inverse Quantization IT:IT: Inverse Transform Inverse Transform SC:SC: Spatial Compensation Spatial Compensation MC:MC: Motion Compensation Motion Compensation REC:REC: ReconstructionReconstructionDF:DF: Deblock FilterDeblock Filter

Chroma DecodingChroma Decoding

LumaLuma DecodingDecoding

8x88x8

16x16

8x88x816x16

8x88x816x16

1414

Simulink ModelSimulink Model

Chroma U

Decoding

Chroma VDecoding

Luma first 8x8 block SC/MC

Luma second 8x8 block SC/MC

Luma third 8x8 block SC/MC

Luma fourth 8x8 block SC/MC

LumaLuma RECREC

VLDVLD83 83 S-Functions24 24 delays 286 286 data links43 43 if-action-subsystems5 5 for-iteration subsystems101 101 pre-defined Simulink blocks

1515

A Simulink CAAM ExampleA Simulink CAAM Example

GFIFOGFIFO

4 CPU SS4 CPU SS

4 Threads4 Threads

InterInter--SS COM: SS COM: GFIFOGFIFO

Processor :Processor :

ARM7ARM7

GFIFOGFIFO GFIFOGFIFO

LumaLumaMC/SC/RECMC/SC/REC

Chroma Chroma DecodingDecoding

LumaLumaDFDF

Global CTRL Global CTRL & VLD& VLD

GFIFOGFIFO

1616

ExperimentExperiment: : Simulation Simulation TTimeime

11066s11066s6.5K/s6.5K/s

325s325s224K/s224K/s

2.8s2.8s26.0M/s26.0M/s

20.0s20.0s3.6M/s3.6M/s

0.5s0.5s146M/s146M/s

VPVPTATAVAVASimulinkSimulinkRTWRTW

An experiment for An experiment for decoding 10 frames QCIF decoding 10 frames QCIF FOREMAN FOREMAN ::

Four ARM7 Processors (VP)Four ARM7 Processors (VP)GFIFO (TA, VP)GFIFO (TA, VP)RTW (A sequential program running on host machine) RTW (A sequential program running on host machine)

VPVP is too long to debug the is too long to debug the whole whole systemsystem..ItIt’’s necessary to make use of TA for HW/SW cos necessary to make use of TA for HW/SW co--simulation.simulation.

1717

Experiment: Performance Optimization Experiment: Performance Optimization with Different Architectures with Different Architectures

F1

F2

F3 F4F5

F6 F7 F8

SS1

A popular and simple task A popular and simple task partition strategy was used partition strategy was used in this experimentin this experiment

Computation-based

Step by Step

1818

Experiment Result of H.264 Baseline Experiment Result of H.264 Baseline Decoder with Three ArchitecturesDecoder with Three Architectures

Luma MC/SC and Luma RecCPU6

Luma IQ/ITCPU5CPU4

Luma DFCPU4CPU3

Luma VLDCPU3CPU4

Chroma decodingCPU2CPU2

CPU2

Global control and MB VLDCPU1CPU1CPU1

Function Block6ARM4ARM2ARM(a) Execution Time150

102 100

020406080

100120140160

2ARM 4ARM 6ARM

Tota

l exe

cutio

n cy

cles

(c) H264_GFIFO_4ARM

14

3422

80

8 4 4 4

78

6274

16

0102030405060708090

CPU1 CPU2 CPU3 CPU4

(b) H264_GFIFO_2ARM

9

88

5 5

86

70

20

40

60

80

100

Tradeoff between performance, cost and flexibility:Tradeoff between performance, cost and flexibility:

-- HighHigh--performance dedicated processor? performance dedicated processor?

-- FineFine--granularity task partition?granularity task partition?

1919

ContentContent

Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow

HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps

Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment

Conclusions & Future WorksConclusions & Future Works

2020

Conclusions & Future WorksConclusions & Future Works

Proposed a Simulink based MPSoC design flow Proposed a Simulink based MPSoC design flow For automated concurrent hardwareFor automated concurrent hardware--software design and software design and verification. verification. Refine Simulink CAAM to three different abstraction level modelsRefine Simulink CAAM to three different abstraction level models(VA, TA, VP)(VA, TA, VP)

In the case study of H.264 decoderIn the case study of H.264 decoderThe feasibility and efficiency of proposed design flowThe feasibility and efficiency of proposed design flow

Functional evaluationFunctional evaluationHW/SW codesignHW/SW codesignPerformance analysis for architecture explorationPerformance analysis for architecture exploration

Plan to improve the current design flow :Plan to improve the current design flow :DDedicated instruction setsedicated instruction sets..Communication protocol with DMA.Communication protocol with DMA.AAutomatic design space exploration.utomatic design space exploration.

2121

AcknowledgeAcknowledge

Thanks to:Thanks to:Prof. Ahmed Jerraya Prof. Ahmed Jerraya ((CEA-LETI, MINATEC, France))SangSang--Il Han Il Han (Seoul National University, Korea)Katalin Popovici Katalin Popovici (TIMA Laboratory, France)Lisane BrisolaraLisane Brisolara(Federal University of Rio Grande do Sul, Brazil)

2222

Thank you very much Thank you very much for your attention !for your attention !

2323

APPENDIXAPPENDIX

2424

Simulink CAAM of MSimulink CAAM of M--JPEG DecoderJPEG Decoder

Architecture Layer

Subsystem Layer

Thread Layer

Four threads and three CPUsFour threads and three CPUsARM processor and GFIFO/SWFIFOARM processor and GFIFO/SWFIFO

7 S7 S--FunctionsFunctions7 Delays7 Delays26 Links26 Links4 IASs4 IASs

2525

Inter/IntraInter/Intra--Subsystem CommunicationSubsystem Communication

Distributed memory serverwith DMA

@ MSAP@ MSAPLocal memoriesDMS

Software FIFO via local memories@ mailbox@ mailboxLocal memoriesLFIFO

Software FIFO Software FIFO via global via global shared shared memorymemory

@ mailbox@ mailbox@ mailbox @ mailbox Shared memoryShared memoryGFIFOGFIFO

Hardware FIFO@ hardware FIFO@ hardware FIFOHardware queueHWFIFO

Shared memory@ local memory@ local memoryLocal memorySHM

Software FIFO @ local memory@ local memoryLocal memorySWFIFO

DescriptionReceiver sync. addrSender sync. addrData bufferProtocol

2626

Code and Data memory size of H.264 Code and Data memory size of H.264 decoderdecoder

Multi-thread with copy removal and buffer sharingM3

Multi-thread with copy removal.M2

Multi-thread without optimization optionsM1

Single-thread with copy removal and buffer sharingS3

Single-thread with copy removal.S2

Single-thread without optimization optionsS1

Data memory size (Kbyte)

ConstantChannelBuffer

100(27.0K)

98.8(26.6K)

58.1(15.7K)

29.1(7.9K)

110.3(29.7K)

74.4(20.0K)

35.3(9.5K)

0.0

20.0

40.0

60.0

80.0

100.0

120.0

RTW S1 S2 S3 M1 M2 M3

100(79.0K)

97.7(77.2K)

79.0(62.4K)

78.7(62.1K)

125.9(99.5K)

105.3(83.2K)

105.9(83.6K)

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

RTW S1 S2 S3 M1 M2 M3App. library HdS library Thread+main

Code memory size (Kbyte)