Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
3
Copyright © 2010 Akash Kumar
Trends in Multimedia Systems
Increasing number of features i.e. applications
Simultaneously active applications
Power increasingly becoming more important
Short time-to-market, new devices released every few months
Multiple standards to be supported
Multiprocessors being used increasingly
4
Copyright © 2010 Akash Kumar
Challenges in Multimedia System Design
Ensuring all applications can meet their performance
Handle the huge number of use-cases i.e. combinations of applications Each possible set of applications leads to a new use-case For 10 applications there are over a thousand use-cases!
Limit the design time Late launch of products directly hurts profits Increased design-time implies higher design costs
Deal with dynamism in the applications
5
Copyright © 2010 Akash Kumar
Contributions
Analysis Accurately predict performance of multiple
applications executing concurrently Basic and iterative probabilistic techniques
Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases
Management Resource manager for MPSoC systems Admission control and budget enforcement
6
Copyright © 2010 Akash Kumar
Assumptions
Heterogeneous MPSoC used increasingly more Different levels of parallelism in application uProc – better for control-flow DSP – better for signal processing Dedicated hardware blocks needed for certain parts Improves efficiency and saves power
Applications modeled as SDF
First-come-first-serve arbiter at cores
Non-preemptive system – tasks can not be stopped
7
Copyright © 2010 Akash Kumar
Non-Preemptive Systems
State-space needed is smaller
Lower implementation cost
Less overhead at run-time
Cache pollution, memory size
Task
8
Copyright © 2010 Akash Kumar
AdmissionControl (Chapter 4)
Use-case 2
Use-case 3
Use-case 1
ApplicationsSpecifications
System Design and Synthesis(Chapter 5 & 6)
ArbiterRMArbiter b1a0
Arbiter Arbiter
RMArbiter Arbiter Arbiter
Arbiter
b2b0b1 a2a0
a1 a3
Hardware Specification
Design Flow
BudgetEnforcement(Chapter 4)
ArbiterRMArbiter b1a0
Arbiter Arbiter
RMArbiter Arbiter Arbiter
Arbiter
b2b0b1 a2a0
a1 a3
Hardware Specification
a0 a2
a1
a3A
b1
b0 b2B
c1
c0 c2C
Performance Analysis(Chapter 3)
Throughput
ApplicationsBA C
Analysis Results
9
Copyright © 2010 Akash Kumar
Outline
Introduction – Multimedia Multiproc Systems
Introduction to SDF
Analysis Basic Probabilistic Performance Prediction Iterative Probabilistic Performance Prediction
Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases
Management Resource Management for MPSoC systems
10
Copyright © 2010 Akash Kumar
Synchronous Dataflow Graphs
First proposed in 1987 by Edward Lee
SDF Graphs used extensively SDFG: Synchronous Data Flow Graphs DSP applications Multimedia applications
Similar to task graphs with dependencies
11
Copyright © 2010 Akash Kumar
Synchronous Dataflow Graphs
actor channelrate token
A B C2 3 1 2α β221
execution time
fire A
A B C2 3 1 2α β221
12
Copyright © 2010 Akash Kumar
Synchronous Dataflow Graphs
fire B
A B C2 3 1 2α β221
A B C2 3 1 2α β221
13
Copyright © 2010 Akash Kumar
Synchronous Dataflow Graphs
Example – H263 Decoder
IQ
28,8002376
1
1188
1188
2
1188
IDCT120,000
96,000
30,000
VLD
1188
Reconstruction1
2376
14
Copyright © 2010 Akash Kumar
Synchronous Dataflow Graphs
Advantages Easily allows performance analysis of single
applications Communication buffers can be easily modeled
Disadvantages Sharing of resources is hard to model Only static resource arbitration can be modeled:
infinite possibilities with multiple applications Difficult to analyze performance of multiple
applications executing concurrently Unable to handle dynamism in the application
15
Copyright © 2010 Akash Kumar
Problem: Predicting Multiple Application Performance
• Two applications – each with three actors• Mapped on a heterogeneous platform• Non-preemptive scheduler
P1 P2 P3
Mapping & Scheduling
50 50
50
A 50 50
50
B
16
Copyright © 2010 Akash Kumar
Considering Only Actors on a Processor
Task Only Actors
Individual Graph
Worst Case
Static Priority Based
A pref. B pref.A 30 20 10B 30 20 10Total 60 40 20
Iteration count for each task for 3,000 cycles
50 50
50
A 50 50
50
B
17
Copyright © 2010 Akash Kumar
Considering Only Applications
Task Only Actors
Individual Graph
Worst Case
Static Priority Based
A pref. B pref.A 30 20 10B 30 20 10Total 60 40 20
Iteration count for each task for 3,000 cycles
50 50
50
A 50 50
50
B
18
Copyright © 2010 Akash Kumar
Worst Case Waiting Time
50 50
50
A
50 50
50
AWait
Calculate waiting
time
50 50
50
B
P1 P2 P3
20
Copyright © 2010 Akash Kumar
Worst Case Waiting Time
Task Only Actors
Individual Graph
Worst Case
Static Priority Based
A pref. B pref.A 30 20 10B 30 20 10Total 60 40 20
Unrealistic!
Iteration count for each task for 3,000 cycles
Lower Bound
100 100
100
50
5050
50
50
50
21
Copyright © 2010 Akash Kumar
Static Order Arbitration
50 50
50
A 50 50
50
B
t0
A
B
P1
P2
P3
Add orderingdependencies (edges)
Steady state
t1 t2 t3
22
Copyright © 2010 Akash Kumar
Problem: Predicting Performance
Task Only Actors
Individual Graph
Worst Case
Static Priority Based
A pref. B pref.A 30 20 10 15B 30 20 10 15Total 60 40 20 30
Iteration count for each task for 3,000 cycles
50 50
50
A 50 50
50
B
23
Copyright © 2010 Akash Kumar
Problem: Predicting Performance – Priority Based
P1
P2
P3
50 50
50
A 50 50
50
B
A
B
t1t0 t2 t3SteadyState
24
Copyright © 2010 Akash Kumar
Problem: Predicting Performance
Task Only Actors
Individual Graph
Worst Case
Static Priority Based
A pref. B pref.A 30 20 10 15 20 10B 30 20 10 15 10 20Total 60 40 20 30 30 30
Iteration count for each task for 3,000 cycles
50 50
50
A 50 50
50
B
25
Copyright © 2010 Akash Kumar
Problem
No good techniques exist to analyze performance of multiple applications on non-preemptive heterogeneous systems
Use probabilistic approach to estimate the performance of multiple applications
running on an MPSoC platform
26
Copyright © 2010 Akash Kumar
Analyzing Multiple Applications Performance
When resources need to be shared, the actor execution may be delayed
Determining this waiting time is the key
tresp = texec + twait
50
5050
?
?
? 50 50
50
27
Copyright © 2010 Akash Kumar
82
.150
1
1501.)(
50
0
2
50
0
=
=
= ∫
x
dxxxEP(x)
x
Probability Distribution
50 50
50
A
1/31/150
50
2/3
x denotes the time other actors have to wait for respective resources to be free from actors of A
E(x) provides the expected time an actor will need to wait when sharing resources with actors of A
Compute the probability distribution of a resource being blocked by an actor
29
Copyright © 2010 Akash Kumar
Basic P3 Algorithm
Compute throughput of all applications
Compute the probability of blocking a resource
Estimate the waiting time for all actors
Update the response time for all actors Response time = execution time + waiting time
Re-compute the application throughput
30
Copyright © 2010 Akash Kumar
Basic P3 Algorithm – Exponential Complexity
So if actor ai and bi are mapped on the same resource, bi on average will need to wait for
31
Copyright © 2010 Akash Kumar
Complexity Reduction
• Overall complexity is O(nn) – n is the number of actors mapped on a processing resource
• Higher order probability products– Limit the equation to only second or fourth-
order• Complexity reduces significantly
Algorithm Complexity
Original O(nn)Second-order O(n2)Fourth-order O(n4)
32
Copyright © 2010 Akash Kumar
Probabilistic Performance Prediction (P3)
Basic P3 technique Looks at all possible combinations of other actors
blocking a particular actor Results in exponential possibilities
Iterative P3 technique Looks at how an actor can contribute to waiting time
of other actors Results in linear complexity Iterating over the algorithm while updating throughput
improves the estimate further
33
Copyright © 2010 Akash Kumar
Determining the Waiting Time
Three states of an actor Not ready – data not present Actors arriving in this state, are not affected by this
actor
Ready and waiting – data present, but resource is busy
Actors arriving in this state have to wait for the full execution of this actor
Ready and executing – data and resource available Waiting time for other actors depend on where the
actor is in its execution Uniform distribution assumed
34
Copyright © 2010 Akash Kumar
A’s Waiting Time Due to B
CBA D
B not in queue
B being served
B waiting in queue
ProcessorArbiter
35
Copyright © 2010 Akash Kumar
P(x)
x
Updated Probability Distribution
texec
1-Pw-Pe
Pw
When the actor is in
queue
When the actor is
not ready
When the actor is
executing
0
Pe
2..)( exec
eexecwtPtPxE +=
36
Copyright © 2010 Akash Kumar
P(x)
x
Updated Probability Distribution – Conservative
texec
1-Pw-PePw
When the actor is in
queue
When the actor is
not ready
When the actor is
executing0
Pe
execew
execeexecw
tPPtPtPxE
).(..)(
+=+=
37
Copyright © 2010 Akash Kumar
Iterative Probability
Iterate until the analysis estimate stabilizes
Updating the throughput in one iteration Compute throughput of all applications
Compute the probability of blocking a resource – both while waiting and executing
Estimate the waiting time for all actors
Update the response time for all actors Response time = execution time + waiting time
Re-compute the application throughput
38
Copyright © 2010 Akash Kumar
Experimental Results
SDF3 tool used to generate random graphs Ten graphs generated Each had 8-10 actors Over 1000 use-cases generated
Simulations performed using POOSL –Parallel Object Oriented Specification Language
28 hours for simulation
10 min for analysis using all approaches
39
Copyright © 2010 Akash Kumar
Iterative Analysis – all applications together
0
2
4
6
8
10
12
14
A B C D E F G H I J
Original
Simulation
Worst case
WCSim
Basic
Iterative
Applications
Appl
icat
ion
perio
d (n
orm
aliz
ed to
orig
inal
)
40
Copyright © 2010 Akash Kumar
Iterative Analysis – all applications together
0.7
0.8
0.9
1
1.1
1.2
1.3
A B C D E F G H I J
Simulation
Basic
Iterative
Conservative
Applications
Appl
icat
ion
perio
d (n
orm
aliz
ed to
sim
ulat
ed)
41
Copyright © 2010 Akash Kumar
Case-study with Mobile Phone Applications
0
5
10
15
20
25
30
35
155
160
H263Decoder
H263Encoder
JPEGDecoder
Modem VoiceCall
Per
iod
of A
pplic
atio
ns (N
orm
aliz
ed to
orig
inal
per
iod)
Applications
SimulationIterative Analysis
Conservative AnalysisWorst Case
Basic - Fourth Order
42
Copyright © 2010 Akash Kumar
FPGA Implementation Results
3.0
3.4
36
36
28.9
44.5
83.1
Max
O(m.M+N.n.k) 1.9 279460Iterative - 10 Iterations*
O(m.M+N.n.k) 2.2139730Iterative - 5 Iterations*
O(m.M+N.n.k) 12.627946Iterative - 1 Iteration*
O(m.M) 12.615258Iterative - 1 Iteration
O(m4.M) 9.91740232Fourth Order
O(m2.M) 22.345697Second Order
O(m.M) 72.62090Worst Case
O(N.n.k) 12688Throughput Computation
O(N.n.k) 1903500Load from CF Card
AverageComplexity Error (%age)Clock cycles Algorithm/Stage
N-number of applicationsn-number of actors in an applicationk-number of throughput equations for an applicationm-number of actors mapped on a processorM-number of processors
19ms with 100 MHz
2.8ms with 100 MHz
43
Copyright © 2010 Akash Kumar
Outline
Introduction – Multimedia Multiproc Systems
Introduction to SDF
Analysis Basic Probabilistic Performance Prediction Iterative Probabilistic Performance Prediction
Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases
Management Resource Management for MPSoC systems
44
Copyright © 2010 Akash Kumar
Problem
Current Design Practice for multiple applications Manual or Semi-automated
Which is Error Prone Time Consuming
45
Copyright © 2010 Akash Kumar
Current Tools - Example
Xilinx Automatic tool chain limited to single processors No Support for multiple applications Design space exploration is manual
46
Copyright © 2010 Akash Kumar
Solution
Multi Application Multi-Processor Synthesis A design-flow that takes in application(s)
specifications Generates the entire MPSoC hardware Creates the software models for it Real C-program can also be run
Provides two main benefits Fast design space exploration Support for multiple applications
49
Copyright © 2010 Akash Kumar
MAMPS
Example – H263 DecoderIQ
28,8002376
1
1188
1188
2
1188
IDCT120,000
96,000
30,000
VLD
1188
Reconstruction1
2376
50
Copyright © 2010 Akash Kumar
MAMPS
Pro 0VLD
Pro 1IQ
Pro 2IDCT
Pro 3Recon
BUS
Timer UART CF Card DDR RAM
FIFO LINKS
Example – H263 Decoder
53
Copyright © 2010 Akash Kumar
DSE Case Study
Design Time
1:54 min36:05 min~5 daysAverage time/ iteration19x1x-Speed-Up
2411Iterations45:40 min36:05 min~5 daysTotal time10:00 min0:25 min0:25 minSoftware Synthesis35:40 min35:40 min35:40 minHardware Synthesis
60ms60ms~3 daysSoftware Generation40ms40ms~2 daysHardware Generation
Complete DSE
Generating Single Design
Manual Design
Speedup!
54
Copyright © 2010 Akash Kumar
MAMPS
Used by following people Ahsan Shabbir – TUe. Michiel Rooijakkers – TUe. Thom Gielen – TUe and NUS, Singapore. Abhinav Krishna – NUS, Singapore. Priyantha Desilva – NUS, Singapore. Shakith Fernando – NUS, Singapore. Zhonglei – TU Munchen, Germany. James Young - Brigham Young University. Amit Kumar Singh – Nanyang Technical University,
Singapore. Guan Yu – IMEC, Belgium.
55
Copyright © 2010 Akash Kumar
Handling Multiple Use-cases
For rapid prototyping, hardware synthesis time is the bottleneck Limits the design space exploration
For real system, more use-cases implies More memory to store the configuration Increased switching
Use-case merging and partitioning Reduces the number of partitions Reduces the synthesis time Better for DSE, and run-time memory
56
Copyright © 2010 Akash Kumar
Use-case Merging
Proc 0 Proc 1
Proc 2Proc 3
Use-case B
Proc 0 Proc 1
Proc 2
Use-case A
Proc 0 Proc 1
Proc 2Proc 3
Merged Design
58
Copyright © 2010 Akash Kumar
Use-case Merging and Partitioning Results
Random Graphs Mobile Phone
# Partitions Time (ms) # Partitions Time (ms)
Without Reduction
Without Merging 853 - 23 -
Greedy Out of Memory Out of Memory
First-Fit 126 400 2 200
With Reduction
Without Merging 178 100 3 40
Greedy 112 3,300 2 180
First-Fit 116 300 2 180
Optimal Partitions >110 - 2 -
Reduction Factor 7 - 11 -
59
Copyright © 2010 Akash Kumar
Outline
Introduction – Multimedia Multiproc Systems
Introduction to SDF
Analysis Basic Probabilistic Performance Prediction Iterative Probabilistic Performance Prediction
Design Synthesizing MPSoC for multiple applications Synthesizing MPSoC for multiple use-cases
Management Resource Management for MPSoC systems
60
Copyright © 2010 Akash Kumar
Dynamism in Applications
Multimedia applications are often dynamic
SDF assumes worst-case-execution-time – not realistic
Analysis results may be pessimistic – lead to waste of resources & energy
Dynamic execution time may lead to unpredictable application performance
61
Copyright © 2010 Akash Kumar
Unpredictability – Variation in Execution Time
A
B
t1t0 t2 t3Steady State
P1
P2
P3
50 50
50
A 50 50
50
B
A
B
t1t0 t2 t3Steady State
49 49
49
A 49 49
49
B
62
Copyright © 2010 Akash Kumar
Resource Manager
Budget enforcement When running, each application signals RM when it
completes an iteration RM keeps track of each application’s progress Operation modes
‘Polling’ mode ‘Interrupt’ mode
Suspends application if needed
63
Copyright © 2010 Akash Kumar
Budget Enforcement (Polling)
Performance goes down!
ResourceManager
Better than required!
New job enters!
job suspended!
job resumed!
68
Copyright © 2010 Akash Kumar
Conclusions
Modern multimedia systems support a number of applications executing concurrently.
A number of challenges remain for designers Probabilistic performance prediction presented for
multiple applications executing concurrently The approach is fast, yet accurate: ideal for DSE A design methodology is proposed that take
application(s) specification and generates the MPSoC platform
Handle multiple use-cases by merging and partitioning Resource manager presented: admission control and
budget enforcement
69
Copyright © 2010 Akash Kumar
Future Work
Support for hard real-time applications: both analysis and design-flow
Provide soft real-time guarantee: analysis
Mixing hard and soft real-time tasks
Extend MAMPS to CSDF, SADF models
Achieving predictability in suspension
Considering the use-case usage when partitioning them
70
Copyright © 2010 Akash Kumar
Relevant Publications – Journals (first author)
Akash Kumar et al. Multi-processor Systems Synthesis for Multiple Use-Cases of Multiple Applications on FPGA. Transactions on Design Automation in Electronic Systems (ToDAES), 2008. ACM.
Akash Kumar et al. Analyzing Composability of Applications on MPSoC Platforms, Journal of Systems Architecture (JSA), 2008. Elsevier.
Akash Kumar et al. Iterative Probabilistic Performance Prediction for Multi-Application Multi-Processor Systems, Transactions on Computer Aided Design (TCAD), 2010. IEEE.
71
Copyright © 2010 Akash Kumar
Relevant Publications – Conferences (first author)
Akash Kumar et al. Global Analysis of Resource Arbitration for MPSoC. Digital Systems Design (DSD), 2006. IEEE.
Akash Kumar et al. Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip. Embedded Systems for Real-Time Multimedia (Estimedia) 2006. IEEE.
Akash Kumar et al. An FPGA Design Flow for Reconfigurable Network-Based Multi-Processor Systems-on-Chip. Design Automation and Test in Europe (DATE), 2007. IEEE.
Akash Kumar et al. A Probabilistic Approach to Model Resource Contention for Performance Estimation of Multi-featured Media Devices, Design Automation Conference (DAC), 2007. ACM/IEEE.
Akash Kumar et al. Multi-processor System-level Synthesis for Multiple Applications on Platform FPGA, Field Programmable Logic (FPL), 2007. IEEE.