Upload
ivie
View
17
Download
0
Embed Size (px)
DESCRIPTION
Slack Analysis in the System Design Loop. Girish VenkataramaniCarnegie Mellon University, The MathWorks Seth C. Goldstein Carnegie Mellon University. Typical System Design Flow. Spec. Scalability Issues: Simulation takes minutes to hours Synthesis takes many hours. - PowerPoint PPT Presentation
Citation preview
Slack Analysis in the System Design Loop
Girish Venkataramani Carnegie Mellon University,The MathWorks
Seth C. Goldstein Carnegie Mellon University
2
Typical System Design Flow
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
SystemDesign Loop
Too slow!
Scalability Issues:Simulation takes minutes to hoursSynthesis takes many hours
IR
3
Proposed Design Flow
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IRTraditionalSystemDesign Loop
Timing Analysis
Optimize Design
Update timing in IR
New Optimization
Loopreplace with
4
Key Contributions
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IR
Timing Analysis
Optimize Design
Update timing in IR
1. Slack is a distributed representation of system timing2. Linear-time update algorithm based on slack3. > 100x reduction in total design time:
• Optimization Loop runs in seconds/minutes while • System Design Loop runs in hours/days
TraditionalSystemDesign Loop
replace withNew
OptimizationLoop
5
Outline
• Motivation
• Timing Metrics– Cycle time– Slack: An alternative view of cycle time
• Slack Update Algorithm• Experimental Evaluation• Conclusions
6
The Intermediate Representation (IR)
• Models a dynamic system
• Concurrent sub-systems– PE, FSM, S/W, Memory
• Communication between sub-systems based on pre-defined protocols– FIFOs, NoC, shared bus
Sub-System
Sub-System
Sub-System
Sub-System
Network
… …
… …
Transaction-Level Modeling (TLM), [Cai, ISSS 03]Adopted by System-C, Bluespec, Balsa, Tangram
7
Marked Graphs
• Model dynamic system interactions
• Events and transitions– Event: An edge acquires a
token– Transition: Node consumes
inputs and generates outputs
• Encode the communication protocols
S1 S2
S3
token
8
Timing Analysis of IR
• Time Separation between Event (TSEs)– TSE between consecutive firings of same event in steady state is
the mean cycle time
• Mean Cycle Time, CT
• Computing CT is about O(|E|3) complexity [Dasdan 04]
cycleon tokensofnumber is N
cycle,graph a oflatency isC :
)(
i
iwhere
CTi
i
i
NC
CMax
9
Slack as a Timing Metric
• Distributed representation of cycle time– Different type of TSE– Defined on each (input edge, node) pair– How early this input arrives
Slack(S1, S3) = 3Slack(S2, S3) = 0Slack(S3, S1) = 0Slack(S3, S2) = 0
• Zero-slack input is locally critical
• Longest chain of zero-slack events yields the critical cycle or the Global Critical Path (GCP)– Cycle time = Latency of the GCP– Given slack values, computing cycle time has linear complexity
*2 5
8
S1 S2
S3
locallycritical
10
Slack is an Annotation on the IR
Sub-System
Sub-System
Sub-System
Sub-System
Network
… …
… …
SlackSlack
Slack
Slack
Helps in discovering hotspots and applying optimizations
11
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
12
Optimizations Change the IR
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IR+
slack
Optimize Design
Update timing in IR
New Optimization
Loop
Causes changes in component delays,in turn, changing in slack values
Need to update slack on each change
13
Problem Description
• Given a graph model and its current slack values, compute new values of slack when latency of a node changes by a given Δ
2 5
8
S1 S2
S3 6, Δ = -2
14
Insight behind Update Algorithm
• Slack is also latency difference of two branches of a re-convergent fork-join
• If delay of node, sc, increases by Δ– Update slack in surrounding
re-convergent fork-joins– Propagate change globally
sfork
sjoin
e1 e2
P1
P2
Assume δ(P1) > δ(P2), Di = δ(P1) - δ(P2)Slack(e1, sjoin) = 0Slack(e2, sjoin) = Di
Update (let Δ ≤ Di):Slack(e2, sjoin) = Di + Δ
+Δ sc
15
Insight behind Update Algorithm
1. If there is a path from change point to every input of sjoin, then no change in slack
2. If not, then slack changes occur
sjoin
e1 e2
+Δ sc
sfork+Δ
Easy for acyclic graphs, but what about scc graphs?
16
Insight for Cyclic Graphs
• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values
2 5
8
S1 S2
S3
2, Δ = -3
# toks = 1# toks = 2
Change in slack exists
17
Insight for Cyclic Graphs
• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values
2 5
8
S1 S2
S3 6, Δ = -2
# toks = 0 # toks = 0
No change in slack
18
Algorithm Summary
• Initially, find # tokens between every pair of nodes in the graph– Problem formulated as a flow lattice
– Invoked once, complexity is O(|M0| |V|)
• After inducing every change in graph– Compute slack change at each node– Propagate new change to neighboring outputs– Overall complexity is O(|V|)
19
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
20
Experimental Setup
• Slack Update loop incorporated into CASH compiler [ASPLOS 04, DAC 07]
– Synthesizes asynchronous circuits from ANSI-C programs
• Applied three different optimizations– SM: Slack Matching [ICCAD 06]
– OC: Operation Chaining [ICCAD 07]– ASU: Heterogeneous Pipeline Synthesis [Async 08]
• Benchmarks: Fifteen frequently executed kernels from Mediabench suite, [Lee 97]
• All results are post-synthesis mapped to ST Micro 180nm library
21
Absolute Accuracy
• After SM, compare computed values of slack against actual values of slack for adpcm_d
0
10
20
30
40
50
60
70
80
90
100
-100 -50 0 50 100
% Slack Difference = [(Actual - Estimate)*100 / Actual]
% o
f T
ota
l Eve
nts
• Close to 100 changes applied (algo invoked for each change)
• 1.2x performance speedup
• Update inaccuracy due to unknown latency values during circuit transformation
22
Design Loop Experiments
Spec.
Mapping + Allocation
Physical Synthesis
System Partitioning
RTL (*.v, *.vhd)Simulation
Code Emission
IRSystemDesign Loop
Timing Analysis
Optimize Design
Update timing in IR
OptimizationLoop
Run the same N optimizations in both loops and compare:1. Overall Performance change2. Overall design time
23
Design Loop Experiments
• Three optimization sequences1. SM-ASU: Slack matching followed by
Heterogenous latch insertion
2. ASU-SM
3. SM-ASU-OC
• Compare final circuit timing and loop traversal (design) times between the two methodologies
24
Design Loop Experiments
• About ~500 circuit changes, on average• Design Loop time: 0.5 – 4 hours• Optimization loop: 10 - 100 seconds
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
d-u
p o
ve
r B
as
e
Design Loop
Optimization Loop
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
du
p o
ve
r B
as
e
Design Loop
Optimization Loop
SM-ASU ASU-SM
25
Design Loop Experiments: SM-ASU-OC
• About ~1000-3000 circuit changes, on average• Design Loop time: 1 – 10 hours• Optimization loop: 20 - 200 seconds
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
K1.ad
pcm
_d
K2.ad
pcm
_e
K3.g7
21_d
K4.g7
21_e
K5.gs
m_d
K6.gs
m_d
K7.gs
m_e
K8.gs
m_e
K9.jpe
g_d
K10.jp
eg_e
K11.m
peg2
_d
K12.m
peg2
_d
K13.m
peg2
_e
K14.p
gp_d
K15.p
gp_e
Sp
ee
du
p o
ve
r B
as
e
Design Loop
Optimization Loop
d
300x Speedup
26
Outline
• Motivation
• Timing Metrics
• Slack Update Algorithm
• Experimental Evaluation
• Conclusions
27
Conclusions
• Slack update algorithm to speed up design loop in TLM-based workflows
• Use slack to track system-level timing changes
• Orders of magnitude reduction in design time at negligible loss in accuracy
• New optimizations loop enables scalability in iterative system design flows