Slack Analysis in the System Design Loop

Slack Analysis in the System Design Loop

Girish Venkataramani Carnegie Mellon University,The MathWorks

Seth C. Goldstein Carnegie Mellon University

2

Typical System Design Flow

Spec.

Mapping + Allocation

Physical Synthesis

System Partitioning

RTL (*.v, *.vhd)Simulation

Code Emission

SystemDesign Loop

Too slow!

Scalability Issues:Simulation takes minutes to hoursSynthesis takes many hours

IR

3

Proposed Design Flow

Spec.


Physical Synthesis

System Partitioning


Code Emission

IRTraditionalSystemDesign Loop

Timing Analysis

Optimize Design

Update timing in IR

New Optimization

Loopreplace with

4

Key Contributions

Spec.


Physical Synthesis

System Partitioning


Code Emission

IR

Timing Analysis

Optimize Design

Update timing in IR

1. Slack is a distributed representation of system timing2. Linear-time update algorithm based on slack3. > 100x reduction in total design time:

• Optimization Loop runs in seconds/minutes while • System Design Loop runs in hours/days

TraditionalSystemDesign Loop

replace withNew

OptimizationLoop

5

Outline

• Motivation

• Timing Metrics– Cycle time– Slack: An alternative view of cycle time

• Slack Update Algorithm• Experimental Evaluation• Conclusions

6

The Intermediate Representation (IR)

• Models a dynamic system

• Concurrent sub-systems– PE, FSM, S/W, Memory

• Communication between sub-systems based on pre-defined protocols– FIFOs, NoC, shared bus

Sub-System

Sub-System

Sub-System

Sub-System

Network

… …

… …

Transaction-Level Modeling (TLM), [Cai, ISSS 03]Adopted by System-C, Bluespec, Balsa, Tangram

7

Marked Graphs

• Model dynamic system interactions

• Events and transitions– Event: An edge acquires a

token– Transition: Node consumes

inputs and generates outputs

• Encode the communication protocols

S1 S2

S3

token

8

Timing Analysis of IR

• Time Separation between Event (TSEs)– TSE between consecutive firings of same event in steady state is

the mean cycle time

• Mean Cycle Time, CT

• Computing CT is about O(|E|3) complexity [Dasdan 04]

cycleon tokensofnumber is N

cycle,graph a oflatency isC :

)(

i

iwhere

CTi

i

i

NC

CMax

9

Slack as a Timing Metric

• Distributed representation of cycle time– Different type of TSE– Defined on each (input edge, node) pair– How early this input arrives

Slack(S1, S3) = 3Slack(S2, S3) = 0Slack(S3, S1) = 0Slack(S3, S2) = 0

• Zero-slack input is locally critical

• Longest chain of zero-slack events yields the critical cycle or the Global Critical Path (GCP)– Cycle time = Latency of the GCP– Given slack values, computing cycle time has linear complexity

*2 5

8

S1 S2

S3

locallycritical

10

Slack is an Annotation on the IR

Sub-System

Sub-System

Sub-System

Sub-System

Network

… …

… …

SlackSlack

Slack

Slack

Helps in discovering hotspots and applying optimizations

11

Outline

• Motivation

• Timing Metrics

• Slack Update Algorithm

• Experimental Evaluation

• Conclusions

12

Optimizations Change the IR

Spec.


Physical Synthesis

System Partitioning


Code Emission

IR+

slack

Optimize Design

Update timing in IR

New Optimization

Loop

Causes changes in component delays,in turn, changing in slack values

Need to update slack on each change

13

Problem Description

• Given a graph model and its current slack values, compute new values of slack when latency of a node changes by a given Δ

2 5

8

S1 S2

S3 6, Δ = -2

14

Insight behind Update Algorithm

• Slack is also latency difference of two branches of a re-convergent fork-join

• If delay of node, sc, increases by Δ– Update slack in surrounding

re-convergent fork-joins– Propagate change globally

sfork

sjoin

e1 e2

P1

P2

Assume δ(P1) > δ(P2), Di = δ(P1) - δ(P2)Slack(e1, sjoin) = 0Slack(e2, sjoin) = Di

Update (let Δ ≤ Di):Slack(e2, sjoin) = Di + Δ

+Δ sc

15

Insight behind Update Algorithm

1. If there is a path from change point to every input of sjoin, then no change in slack

2. If not, then slack changes occur

sjoin

e1 e2

+Δ sc

sfork+Δ

Easy for acyclic graphs, but what about scc graphs?

16

Insight for Cyclic Graphs

• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values

2 5

8

S1 S2

S3

2, Δ = -3

# toks = 1# toks = 2

Change in slack exists

17

Insight for Cyclic Graphs

• Use token knowledge– Count tokens from change point to every input– If value is equal for all inputs, then no change in slack values

2 5

8

S1 S2

S3 6, Δ = -2

# toks = 0 # toks = 0

No change in slack

18

Algorithm Summary

• Initially, find # tokens between every pair of nodes in the graph– Problem formulated as a flow lattice

– Invoked once, complexity is O(|M0| |V|)

• After inducing every change in graph– Compute slack change at each node– Propagate new change to neighboring outputs– Overall complexity is O(|V|)

19

Outline

• Motivation

• Timing Metrics



• Conclusions

20

Experimental Setup

• Slack Update loop incorporated into CASH compiler [ASPLOS 04, DAC 07]

– Synthesizes asynchronous circuits from ANSI-C programs

• Applied three different optimizations– SM: Slack Matching [ICCAD 06]

– OC: Operation Chaining [ICCAD 07]– ASU: Heterogeneous Pipeline Synthesis [Async 08]

• Benchmarks: Fifteen frequently executed kernels from Mediabench suite, [Lee 97]

• All results are post-synthesis mapped to ST Micro 180nm library

21

Absolute Accuracy

• After SM, compare computed values of slack against actual values of slack for adpcm_d

0

10

20

30

40

50

60

70

80

90

100

-100 -50 0 50 100

% Slack Difference = [(Actual - Estimate)*100 / Actual]

% o

f T

ota

l Eve

nts

• Close to 100 changes applied (algo invoked for each change)

• 1.2x performance speedup

• Update inaccuracy due to unknown latency values during circuit transformation

22

Design Loop Experiments

Spec.


Physical Synthesis

System Partitioning


Code Emission

IRSystemDesign Loop

Timing Analysis

Optimize Design

Update timing in IR

OptimizationLoop

Run the same N optimizations in both loops and compare:1. Overall Performance change2. Overall design time

23


• Three optimization sequences1. SM-ASU: Slack matching followed by

Heterogenous latch insertion

2. ASU-SM

3. SM-ASU-OC

• Compare final circuit timing and loop traversal (design) times between the two methodologies

24


• About ~500 circuit changes, on average• Design Loop time: 0.5 – 4 hours• Optimization loop: 10 - 100 seconds

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

K1.ad

pcm

_d

K2.ad

pcm

_e

K3.g7

21_d

K4.g7

21_e

K5.gs

m_d

K6.gs

m_d

K7.gs

m_e

K8.gs

m_e

K9.jpe

g_d

K10.jp

eg_e

K11.m

peg2

_d

K12.m

peg2

_d

K13.m

peg2

_e

K14.p

gp_d

K15.p

gp_e

Sp

ee

d-u

p o

ve

r B

as

e

Design Loop

Optimization Loop

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

K1.ad

pcm

_d

K2.ad

pcm

_e

K3.g7

21_d

K4.g7

21_e

K5.gs

m_d

K6.gs

m_d

K7.gs

m_e

K8.gs

m_e

K9.jpe

g_d

K10.jp

eg_e

K11.m

peg2

_d

K12.m

peg2

_d

K13.m

peg2

_e

K14.p

gp_d

K15.p

gp_e

Sp

ee

du

p o

ve

r B

as

e

Design Loop

Optimization Loop

SM-ASU ASU-SM

25

Design Loop Experiments: SM-ASU-OC

• About ~1000-3000 circuit changes, on average• Design Loop time: 1 – 10 hours• Optimization loop: 20 - 200 seconds

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

K1.ad

pcm

_d

K2.ad

pcm

_e

K3.g7

21_d

K4.g7

21_e

K5.gs

m_d

K6.gs

m_d

K7.gs

m_e

K8.gs

m_e

K9.jpe

g_d

K10.jp

eg_e

K11.m

peg2

_d

K12.m

peg2

_d

K13.m

peg2

_e

K14.p

gp_d

K15.p

gp_e

Sp

ee

du

p o

ve

r B

as

e

Design Loop

Optimization Loop

d

300x Speedup

26

Outline

• Motivation

• Timing Metrics



• Conclusions

27

Conclusions

• Slack update algorithm to speed up design loop in TLM-based workflows

• Use slack to track system-level timing changes

• Orders of magnitude reduction in design time at negligible loss in accuracy

• New optimizations loop enables scalability in iterative system design flows

Documents

Slack Analysis in the System Design Loop