47
ADSP Lecture2 - Unfolding ([email protected]) 2-1 VLSI Signal Proces VLSI Signal Proces sing sing Lecture 2 Unfolding Lecture 2 Unfolding Transformation Transformation

ADSP Lecture2 - Unfolding ([email protected])2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation

Embed Size (px)

Citation preview

ADSP Lecture2 - Unfolding ([email protected]) 2-1

VLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingVLSI Signal ProcessingLecture 2 Unfolding Lecture 2 Unfolding

TransformationTransformation

ADSP Lecture2 - Unfolding ([email protected]) 2-2

Multiple-Data Processing• Create a program with more than one

iteration, e.g. J loops unrolling• Example: Loop unrolling + software pipelining

1

2

3

4

5

6

7

8

clock cycle operation

1

2

3

1

2

3

1

2

1

1

1

2

2

2

3

3

3

1

2

3

4

5

6

7

8

clock cycle

ADSP Lecture2 - Unfolding ([email protected]) 2-3

Basic Ideas• Parallel

processing• Pipelined

processing

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

P1

P2

P3

P4

P1

P2

P3

P4

time time

ADSP Lecture2 - Unfolding ([email protected]) 2-4

Data Dependence• Parallel processing

requires NO data dependence between processors

• Pipelined processing will involve inter-processor communication

P1

P2

P3

P4

P1

P2

P3

P4

time time

ADSP Lecture2 - Unfolding ([email protected]) 2-5

Parallel Processing•

• In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)

ADSP Lecture2 - Unfolding ([email protected]) 2-6

Parallel Processing• Block processing

– the number of inputs processed in a clock cycle is referred to as the block size

– at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)

S e ria l toP a ra lle l

C o nve rte r

S IS Ox(n) y(n)

M IM O

x(3k ) y(3k )

x(3 k+1 )

x(3 k+2 )

y(3 k+1 )

y(3 k+2 )

P ara lle l toS eria l

C o nve rte rx(n) y(n)

ADSP Lecture2 - Unfolding ([email protected]) 2-7

I/O Conversion• Serial to parallel converter

• Parallel to serial converter

3 k

D D

T/3T/3

s a m p lin g p e rio d

y(3k )y(3 k+1 )y(3 k+2 )

y(n)

x(n) D D

x(3k)x(3 k+1 )x(3 k+2 )

T/3T/3

s a m p lin g p e rio d

ADSP Lecture2 - Unfolding ([email protected]) 2-8

General approach for block processing

ADSP Lecture2 - Unfolding ([email protected]) 2-9

Mathematical Formulation

• e.g. y(n) = ay(n-9) + x(n)• 2-parallel

Y(2k) = ay(2k-9) + x(2k)Y(2k+1) = ay(2k-8) + x (2k+1)

• In 2-parallel SDFG, one active clock edge leads two samplesY(2k) = ay(2(k-5)+1) + x(2k)Y(2k+1) = ay(2(k-4)+0) + x(2k+1)

• Dependency with less than # parallelism of sample delays can be implemented with internal routing

ADSP Lecture2 - Unfolding ([email protected]) 2-10

Unfolding the DFG

T=Ts

T=J Ts

Not trivial, even for a simple graph

ADSP Lecture2 - Unfolding ([email protected]) 2-11

Block Processing for FIR Filter

• One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)

• Block vector: [x(3k) x(3k+1) x(3k+2)]• Clock cycle: can be 3 times longer• Original (FIR filter):

• Rewrite 3 equations at a time: )2()1()()( ncxnbxnaxny

(3 ) (3 ) (3 1) (3 2)

(3 1) (3 1) (3 ) (3 1)

(3 2) (3 2) (3 1) (3 )

y k x k x k x k

y k a x k b x k c x k

y k x k x k x k

ADSP Lecture2 - Unfolding ([email protected]) 2-12

Block Processing

ADSP Lecture2 - Unfolding ([email protected]) 2-13

Block Processing for IIR Digital Filter

• Original formulation:

• Rewrite:

• Vector formulation:

( ) ( 2) ( )y n a y n x n n: sample period

k: processor period

Tsample≠Tclk

)12()12()12(

)2()22()2(

kxkayky

kxkayky

)()1()(

)12(

)2()( ,

)12(

)2()(

kkak

kx

kxk

kx

kxk

xyy

yx

ADSP Lecture2 - Unfolding ([email protected]) 2-14

Block IIR Filter

D

D

S/P P/S+

+

x(2k)

x(2k+1)

y(2k+1)

y(2k)x(n) y(n)

y(2(k1))

y(2(k1)+1)

clock period not equal to sampling period

ADSP Lecture2 - Unfolding ([email protected]) 2-15

Timing Comparison

• Pipelining

• Block processing

1 2 3 4x(1) x(2) x(3) x(4)

y(1) y(2) y(3) y(4)

1 2 3 4 5 6 7 8x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7)

MAC

1 2 3 4 5 6 7 8

y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7)Add

a y(1)

Mul

1 1 3 3 5 5 7 7

2 2 4 4 6 6 8 8x(2) x(4) x(6) x(8)

x(1) x(3) x(5) x(7)

ADSP Lecture2 - Unfolding ([email protected]) 2-16

Definitions• Unfolding is the process of unfolding a loop so

that several iterations are unrolled into the same iteration.

• Also known as (a.k.a.)– Loop unrolling (in compilers for parallel programs)– Block processing

• Applications– Reducing sampling period to achieve iteration bound

(desired throughput rate) T.

– Parallel (block processing) to execute several iterations concurrently.

– Digit-serial or bit-serial processing

ADSP Lecture2 - Unfolding ([email protected]) 2-17

Unfolding the DFG• y(n)=ay(n-9)+x(n)

• Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k)y(2k+1)=ay(2k-8)+x(2k+1)

y(2k)=ay(2(k-5)+1)+x(2k)y(2k+1)=ay(2(k-4))+x(2k+1)

• After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period.

ADSP Lecture2 - Unfolding ([email protected]) 2-18

Timing Diagram

• Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold.

• Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later.

y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13)

T=Ts

y(0) y(2) y(4) y(6) y(8) y(10) y(12)

y(1) y(3) y(5) y(7) y(9) y(11) y(13)

T=2Ts

9 T

4T5T

9 T

ADSP Lecture2 - Unfolding ([email protected]) 2-19

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 1. Duplicate J copies of each node

ADSP Lecture2 - Unfolding ([email protected]) 2-20

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

Q1

S1

T1

R1

J=2

T=3

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 2. Add all edges with 0 delay on them.

ADSP Lecture2 - Unfolding ([email protected]) 2-21

Another DFG Unfolding Example

Q

S

T

R

3D2D

Q0

S0

T0

R0

D

Q1

S1

T1

R1

D

D 2D

J=2

T=3

T=6

i w(i+w)%J

0 0 0 0

0 2 0 1

0 3 1 1

1 0 1 0

1 2 1 1

1 3 0 2

( ) /i w J

Step 3. Use table on the left to figure out edges with delays.

ADSP Lecture2 - Unfolding ([email protected]) 2-22

Unfolding Transformation• For each node U in the original DFG, draw J node U0, U1,…, UJ-1• For each edge UV with w delays in the original DFG, draw the J edge

s UiV(i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1

Example

• Unfolding of an edge with w delays in the original DFG produces J-w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J

• Unfolding preserves precedence constraints of a DSP algorithm

ADSP Lecture2 - Unfolding ([email protected]) 2-23

Precedence Preservation

ADSP Lecture2 - Unfolding ([email protected]) 2-24

Delay Preservation• Unfolding preserves the number of delays in a DFG• Let , where

11

11

111

mJ

Jw

mJ

Jm

J

nJnJm

J

nJw

mJ

JJm

J

nJnJm

J

nJw

mJ

w

nJmw Nnm 0, 10 Jn

w

nJm

nmnJm

J

Jw

J

nJw

J

nJw

J

w

1

11

ADSP Lecture2 - Unfolding ([email protected]) 2-25

Example• Unfold the following DFG using folding factor 2 and 5

A B C E

D

7 DD

2 D

3 D

A 0 B 0 C 0 E 0

D 0

A 1 B 1 C 1 E 1

D 1

D

3 D

4 D

D

D

2 D

D

A 0 B 0 C 0 E 0 D 0

A 1 B 1 C 1 E 1 D 1

A 2 B 2 C 2 E 2 D 2

A 3 B 3 C 3 E 3 D 3

A 4 B 4 C 4 E 4 D 4

DD

D

D

2 D

2 D

D

DD

D

D

2 - unfo ld e d D F G5 - unfo ld e d D F G

ADSP Lecture2 - Unfolding ([email protected]) 2-26

Properties of Unfolding• Unfolding preserves the

number of registers (delays) in a DFG

• For a loop with w delays in a DFG that has been unfolded J times, it leads to – g.c.d.(w, J) loops in the

unfolded DFG, with each of these loops containing

W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of

each node that appear in the original loop.

• Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT.

• A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG.

• Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.

ADSP Lecture2 - Unfolding ([email protected]) 2-27

When a Loop is Unfolded• A loop ℓ with w delays in a DFG • Travel the loop A~>A p times also a loop with pw delays • In J-unfolded DFG, consider the path AiA(i+pw)%J . It is a loop if

i=(i+ pw)%J. This implies that J | pw• The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one c

an travel the loop A~>A J/gcd(J, w) times.• Recall that there are totally J copies of node A. Hence, there a

re J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays.

• The iteration bound in J-unfolded DFG is then

JTw

tJ

wjw

twj

J

Tl

l

l

l

l

ll

lmax

),gcd(

),gcd(max'

ADSP Lecture2 - Unfolding ([email protected]) 2-28

When a Path is Unfolded• If w<J, then a path containing w delays within a DFG will lea

d to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG.

• If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical.

• Assume that the critical path of the J-unfolded DFG is c. If D(U,V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J

• Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding.

ADSP Lecture2 - Unfolding ([email protected]) 2-29

When a Path is Unfolded• Suppose r’ is a legal retiming for the J-unfolded DFG, GJ, wh

ich leads to critical path c.• Let r(U) = i r’(Ui), 0≤i≤J-1.

– r is a feasible retiming for the original DFG, G.– The retiming leads to a critical path c

constraintpath critical

)( if ,1'' )2(

constraint feasible '' )1(

then,path critical a toleads and for retiming legal is ' Since

in delays with edgean Consider

)%()%(

)%(

cVUDJ

wiVrUr

J

wiVrUr

cGr

GwVU

JwiiJwii

Jwii

J

0≤i≤J-1

i

JVUWVrUr

wVrUr

),()()( )2(

)()( )1(

ADSP Lecture2 - Unfolding ([email protected]) 2-30

Sample Period Reduction• Case1: A node in the DFG having

computation time greater than T∞

• Case2: Iteration bound is not an integer

• Case3: Longest node computation is larger than the iteration T∞, and T∞ is not an integer

ADSP Lecture2 - Unfolding ([email protected]) 2-31

Case 1• Critical path dominates, since a node

computation time is more than iteration bound

Retiming cannot be used to reduce sample period

ADSP Lecture2 - Unfolding ([email protected]) 2-32

Sample Period Reduction• Rule of Thumb: used be should unfolding

TtU

T∞=6,Tcritical=6

ADSP Lecture2 - Unfolding ([email protected]) 2-33

Case 2• Iteration period cannot not achieve the

iteration bound

ADSP Lecture2 - Unfolding ([email protected]) 2-34

Sample Period Reduction

ADSP Lecture2 - Unfolding ([email protected]) 2-35

Case 3

ADSP Lecture2 - Unfolding ([email protected]) 2-36

Parallel Processing• Parallel processing can be

performed by unfolding

ADSP Lecture2 - Unfolding ([email protected]) 2-37

Bit-Level Parallel Processing

ADSP Lecture2 - Unfolding ([email protected]) 2-38

ADSP Lecture2 - Unfolding ([email protected]) 2-39

Bit-Serial Adder

ADSP Lecture2 - Unfolding ([email protected]) 2-40

Unfolding of Switches

ADSP Lecture2 - Unfolding ([email protected]) 2-41

Example

ADSP Lecture2 - Unfolding ([email protected]) 2-42

Example

ADSP Lecture2 - Unfolding ([email protected]) 2-43

Example

ADSP Lecture2 - Unfolding ([email protected]) 2-44

Example

ADSP Lecture2 - Unfolding ([email protected]) 2-45

Switches with Delays

ADSP Lecture2 - Unfolding ([email protected]) 2-46

Switch with Delays

ADSP Lecture2 - Unfolding ([email protected]) 2-47

If Wordlength is not a Multiple of J