Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
MPSoC Design Flow: MPSoC Design Flow: Case Study for H.264 Case Study for H.264 ††
Kai Kai [email protected]@vlsi.zju.edu.cn
Institute of VLSI Design, Institute of VLSI Design, Zhejiang University, ChinaZhejiang University, China
August, ICDFN 2007August, ICDFN 2007
†† ””SimulinkSimulink--Based MPSoC Design Flow: Case Study of MJPEG and H.264Based MPSoC Design Flow: Case Study of MJPEG and H.264””Published at DAC 2007Published at DAC 2007
22
System System DesignDesign
GAPGAP
MotivationMotivation
SystemC SystemC
Within a wide range of abstraction levelsWithin a wide range of abstraction levels
The prevailing environment The prevailing environment Modeling and simulating complex systems at algorithm levelModeling and simulating complex systems at algorithm level
Intrinsic low level languageIntrinsic low level languageNOT easy to specify the complex system NOT easy to specify the complex system at algorithm levelat algorithm level
Simulink Simulink
A preferred HW/SW codesign languageA preferred HW/SW codesign language
An open issue: Algorithm/Architecture mapping for MPSoCAn open issue: Algorithm/Architecture mapping for MPSoC
Transaction Accurate Transaction Accurate LevelLevel
Virtual Architecture Virtual Architecture LevelLevel
System level design: Key solution to complex MPSoCSystem level design: Key solution to complex MPSoC
Virtual Prototype LevelVirtual Prototype Level
Physical implementationPhysical implementation
ApplicationApplication
AlgorithmAlgorithm ArchitectureArchitectureSimulinkSimulink
SystemCSystemC
Algorithm/Architecture mapping Algorithm/Architecture mapping (System Level)(System Level)Protocol Accurate Protocol Accurate (Virtual Architecture Level)(Virtual Architecture Level)Synchronization Accurate Synchronization Accurate (Transaction Level)(Transaction Level)Cycle Accurate Cycle Accurate (Virtual Prototype Level)(Virtual Prototype Level)
33
ObjectiveObjective
1)1) System level MPSoC design flowSystem level MPSoC design flow
2) Case study for multimedia applications2) Case study for multimedia applicationsFeasibility and efficiency of proposed design flowFeasibility and efficiency of proposed design flow
System functional validationSystem functional validationHW/SW coHW/SW co--design/codesign/co--verificationverificationPerformance analysis for architecture explorationPerformance analysis for architecture exploration
Combine Simulink with SystemCCombine Simulink with SystemCSimulink for highSimulink for high--level algorithm modelinglevel algorithm modelingSystemC for lowSystemC for low--level HW/SW designlevel HW/SW design
Concurrent HW/SW designConcurrent HW/SW designSeamless refinement at different level abstraction modelsSeamless refinement at different level abstraction modelsSystematic and automated HW/SW code generationSystematic and automated HW/SW code generation
44
ContentContent
Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow
HW and SW Mixed ModelHW and SW Mixed ModelDesign StepsDesign Steps
Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment
Conclusions & Future WorksConclusions & Future Works
66
Target CPUDesign
HW and SW Mixed Models HW and SW Mixed Models
Seamless refinement at four abstraction levels:Simulink Combined Algorithm and Architecture Model (Simulink CAAM)
OS and HW codesignHW-SW Codesign
HighHigh LowLow
System level model (Simulink CAAM)
Virtual architectureModel (SystemC)
Transaction accurate (SystemC)
Virtual prototype(SystemC)
port portport port
HdSApp
CPU SS3Thread Main 2
App
Thread Main 1
Abstract ChannelsAbstract Channels
port portport port
HdSApp
CPU SS2Thread Main 2
App
Thread Main 1
port portport port
HdSApp
CPU SS1Thread Main 2
App
Thread Main 1
Interconnecting BusInterconnecting Bus Interconnecting BusInterconnecting Bus
Hardware Hardware DesignDesign
Software Software DesignDesign
77
Step 1 : Simulink ModelingStep 1 : Simulink Modeling
Application C/C++Application C/C++ Into Into a set of a set of modular functions:modular functions:
UserUser--defined Simulink blocksdefined Simulink blocks(e.g. S(e.g. S--function)function)prepre--defined Simulink blocks defined Simulink blocks (e.g. (e.g. mathematical operationmathematical operation))
Simulink Simulink ModelingModeling
88
Step 2 : Application/Step 2 : Application/Architecture mappingArchitecture mapping
HW Libary.Comp. Subsystems.Comm. Channels
Application
Simulink modeling
Simulink application model
Application/Architecture mapping
Simulink CAAM
Simulink parsing
Colif CAAM
1
2
3
Virtual Architecture Model
Transaction Accurate Model
Virtual Prototype Model
HW Architecture Gen. Multithread Code Gen.54
SW Libary.Thread Library.HdS Library
F3 F4
F1 F2
F5
IAS1
IAS0
Fsw
F6
F7
Z-1
Z-k
F0
F8
F9
F10
F11
Z-1
Z-1
HW architecture TemplateHW architecture TemplateCPU 1 SSCPU 1 SS
CPU 2 SSCPU 2 SSCPU CPU 3 SS3 SS
GFIFOGFIFOGFIFOGFIFOGFIFOGFIFO
Inte
r-SS
Com
m
Inte
r-SS
Com
m
1) 1) Architecture layer: CPU SSs, Inter-SS comm.
2) Subsystem layer: Threads, Intra-SS comm.
3) Thread layer: Simulink blocks, links
Simulink CAAMSimulink CAAM
99
Step 3 : Simulink ParsingStep 3 : Simulink Parsing
HW Libary.Comp. Subsystems.Comm. Channels
Application
Simulink modeling
Simulink application model
Application/Architecture mapping
Simulink CAAM
Simulink parsing
Colif CAAM
1
2
3
Virtual Architecture Model
Transaction Accurate Model
Virtual Prototype Model
HW Architecture Gen. Multithread Code Gen.54
SW Libary.Thread Library.HdS Library
Colif is a XMLColif is a XML--based metabased meta--modelmodelwell-defined data structuresModules, channels and ports
Simulink CAAMSimulink CAAM
One-to-one correspondence (Simulink)Simulink Port to Send/Receive Block
Colif CAAMColif CAAM
1010
Step 4 : HW Architecture Step 4 : HW Architecture GenerationGeneration
HW Libary.Comp. Subsystems.Comm. Channels
Application
Simulink modeling
Simulink application model
Application/Architecture mapping
Simulink CAAM
Simulink parsing
Colif CAAM
1
2
3
Virtual Architecture Model
Transaction Accurate Model
Virtual Prototype Model
HW Architecture Gen. Multithread Code Gen.54
SW Libary.Thread Library.HdS Library
VP HW platform
Global shared bus
1111
Step 5: Multithreaded Code Step 5: Multithreaded Code Generation Generation
HW Libary.Comp. Subsystems.Comm. Channels
Application
Simulink modeling
Simulink application model
Application/Architecture mapping
Simulink CAAM
Simulink parsing
Colif CAAM
1
2
3
Virtual Architecture Model
Transaction Accurate Model
Virtual Prototype Model
HW Architecture Gen. Multithread Code Gen.54
SW Libary.Thread Library.HdS Library
******Copy Removal and Buffer Copy Removal and Buffer sharing are used for sharing are used for memory optimizationmemory optimization
1212
ContentContent
Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow
HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps
Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment
Conclusions & Future WorksConclusions & Future Works
1313
VLDVLD REC & DFREC & DFMC/SCMC/SC
H.264 Baseline DecoderH.264 Baseline Decoder
4th 8x8 LumaIQ/IT
LuminanceDF
4th 8x8 Luma REC
4th 8x8 LumaMC/SC
3rd 8x8 LumaIQ/IT
LuminanceDF
3rd 8x8 Luma REC
3rd 8x8 LumaMC/SC
2nd 8x8 LumaIQ/IT
LuminanceDF
2nd 8x8 Luma REC
2nd 8x8 LumaMC/SC
Chroma V VLD Chroma V
DF Chroma V
REC
Chroma VIQ/IT
Chroma VMC/SC
Global ctrl
MacroblockVLD
Chroma U VLD
Chroma UIQ/IT
Chroma UDF
Chroma UREC
Chroma UMC/SC
Luminance VLD
1st 8x8 LumaIQ/IT
LuminanceDF
1st 8x8 Luma REC
1st 8x8 LumaMC/SC
2 times
4 times
Receives an encoded video bit stream and performs iterative Receives an encoded video bit stream and performs iterative executions ofexecutions of MacroblockMacroblock level functions:level functions:VLD:VLD: Variable Length Decoder Variable Length Decoder IQ:IQ: Inverse Quantization Inverse Quantization IT:IT: Inverse Transform Inverse Transform SC:SC: Spatial Compensation Spatial Compensation MC:MC: Motion Compensation Motion Compensation REC:REC: ReconstructionReconstructionDF:DF: Deblock FilterDeblock Filter
Chroma DecodingChroma Decoding
LumaLuma DecodingDecoding
8x88x8
16x16
8x88x816x16
8x88x816x16
1414
Simulink ModelSimulink Model
Chroma U
Decoding
Chroma VDecoding
Luma first 8x8 block SC/MC
Luma second 8x8 block SC/MC
Luma third 8x8 block SC/MC
Luma fourth 8x8 block SC/MC
LumaLuma RECREC
VLDVLD83 83 S-Functions24 24 delays 286 286 data links43 43 if-action-subsystems5 5 for-iteration subsystems101 101 pre-defined Simulink blocks
1515
A Simulink CAAM ExampleA Simulink CAAM Example
GFIFOGFIFO
4 CPU SS4 CPU SS
4 Threads4 Threads
InterInter--SS COM: SS COM: GFIFOGFIFO
Processor :Processor :
ARM7ARM7
GFIFOGFIFO GFIFOGFIFO
LumaLumaMC/SC/RECMC/SC/REC
Chroma Chroma DecodingDecoding
LumaLumaDFDF
Global CTRL Global CTRL & VLD& VLD
GFIFOGFIFO
1616
ExperimentExperiment: : Simulation Simulation TTimeime
11066s11066s6.5K/s6.5K/s
325s325s224K/s224K/s
2.8s2.8s26.0M/s26.0M/s
20.0s20.0s3.6M/s3.6M/s
0.5s0.5s146M/s146M/s
VPVPTATAVAVASimulinkSimulinkRTWRTW
An experiment for An experiment for decoding 10 frames QCIF decoding 10 frames QCIF FOREMAN FOREMAN ::
Four ARM7 Processors (VP)Four ARM7 Processors (VP)GFIFO (TA, VP)GFIFO (TA, VP)RTW (A sequential program running on host machine) RTW (A sequential program running on host machine)
VPVP is too long to debug the is too long to debug the whole whole systemsystem..ItIt’’s necessary to make use of TA for HW/SW cos necessary to make use of TA for HW/SW co--simulation.simulation.
1717
Experiment: Performance Optimization Experiment: Performance Optimization with Different Architectures with Different Architectures
F1
F2
F3 F4F5
F6 F7 F8
SS1
A popular and simple task A popular and simple task partition strategy was used partition strategy was used in this experimentin this experiment
Computation-based
Step by Step
1818
Experiment Result of H.264 Baseline Experiment Result of H.264 Baseline Decoder with Three ArchitecturesDecoder with Three Architectures
Luma MC/SC and Luma RecCPU6
Luma IQ/ITCPU5CPU4
Luma DFCPU4CPU3
Luma VLDCPU3CPU4
Chroma decodingCPU2CPU2
CPU2
Global control and MB VLDCPU1CPU1CPU1
Function Block6ARM4ARM2ARM(a) Execution Time150
102 100
020406080
100120140160
2ARM 4ARM 6ARM
Tota
l exe
cutio
n cy
cles
(c) H264_GFIFO_4ARM
14
3422
80
8 4 4 4
78
6274
16
0102030405060708090
CPU1 CPU2 CPU3 CPU4
(b) H264_GFIFO_2ARM
9
88
5 5
86
70
20
40
60
80
100
Tradeoff between performance, cost and flexibility:Tradeoff between performance, cost and flexibility:
-- HighHigh--performance dedicated processor? performance dedicated processor?
-- FineFine--granularity task partition?granularity task partition?
1919
ContentContent
Motivation & ObjectiveMotivation & ObjectiveSimulinkSimulink--Based MPSoC Design FlowBased MPSoC Design Flow
HW and SW Mixed ModelsHW and SW Mixed ModelsDesign StepsDesign Steps
Case StudyCase StudyH.264 Baseline DecoderH.264 Baseline DecoderExperimentExperiment
Conclusions & Future WorksConclusions & Future Works
2020
Conclusions & Future WorksConclusions & Future Works
Proposed a Simulink based MPSoC design flow Proposed a Simulink based MPSoC design flow For automated concurrent hardwareFor automated concurrent hardware--software design and software design and verification. verification. Refine Simulink CAAM to three different abstraction level modelsRefine Simulink CAAM to three different abstraction level models(VA, TA, VP)(VA, TA, VP)
In the case study of H.264 decoderIn the case study of H.264 decoderThe feasibility and efficiency of proposed design flowThe feasibility and efficiency of proposed design flow
Functional evaluationFunctional evaluationHW/SW codesignHW/SW codesignPerformance analysis for architecture explorationPerformance analysis for architecture exploration
Plan to improve the current design flow :Plan to improve the current design flow :DDedicated instruction setsedicated instruction sets..Communication protocol with DMA.Communication protocol with DMA.AAutomatic design space exploration.utomatic design space exploration.
2121
AcknowledgeAcknowledge
Thanks to:Thanks to:Prof. Ahmed Jerraya Prof. Ahmed Jerraya ((CEA-LETI, MINATEC, France))SangSang--Il Han Il Han (Seoul National University, Korea)Katalin Popovici Katalin Popovici (TIMA Laboratory, France)Lisane BrisolaraLisane Brisolara(Federal University of Rio Grande do Sul, Brazil)
2424
Simulink CAAM of MSimulink CAAM of M--JPEG DecoderJPEG Decoder
Architecture Layer
Subsystem Layer
Thread Layer
Four threads and three CPUsFour threads and three CPUsARM processor and GFIFO/SWFIFOARM processor and GFIFO/SWFIFO
7 S7 S--FunctionsFunctions7 Delays7 Delays26 Links26 Links4 IASs4 IASs
2525
Inter/IntraInter/Intra--Subsystem CommunicationSubsystem Communication
Distributed memory serverwith DMA
@ MSAP@ MSAPLocal memoriesDMS
Software FIFO via local memories@ mailbox@ mailboxLocal memoriesLFIFO
Software FIFO Software FIFO via global via global shared shared memorymemory
@ mailbox@ mailbox@ mailbox @ mailbox Shared memoryShared memoryGFIFOGFIFO
Hardware FIFO@ hardware FIFO@ hardware FIFOHardware queueHWFIFO
Shared memory@ local memory@ local memoryLocal memorySHM
Software FIFO @ local memory@ local memoryLocal memorySWFIFO
DescriptionReceiver sync. addrSender sync. addrData bufferProtocol
2626
Code and Data memory size of H.264 Code and Data memory size of H.264 decoderdecoder
Multi-thread with copy removal and buffer sharingM3
Multi-thread with copy removal.M2
Multi-thread without optimization optionsM1
Single-thread with copy removal and buffer sharingS3
Single-thread with copy removal.S2
Single-thread without optimization optionsS1
Data memory size (Kbyte)
ConstantChannelBuffer
100(27.0K)
98.8(26.6K)
58.1(15.7K)
29.1(7.9K)
110.3(29.7K)
74.4(20.0K)
35.3(9.5K)
0.0
20.0
40.0
60.0
80.0
100.0
120.0
RTW S1 S2 S3 M1 M2 M3
100(79.0K)
97.7(77.2K)
79.0(62.4K)
78.7(62.1K)
125.9(99.5K)
105.3(83.2K)
105.9(83.6K)
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
RTW S1 S2 S3 M1 M2 M3App. library HdS library Thread+main
Code memory size (Kbyte)