L7: Pipelining and Parallel Processing VADA Lab

L7: Pipelining and Parallel Processing

VADA Lab.

Introduction (1) Pipelining transformation leads

to a reduction in the critical path, which can be exploited to increase the clock speed (sample speed), or to reduce power consumption at same speed.

In the parallel processing, multiple outputs are computed in parallel in a clock period. Therefore, the effective sampling speed is increased by the level of parallelism.

Introduction (2)

3-tap FIR digital filter

y(n) = ax(n)+bx(n-1)+cx(n-2)

Sample Period

Sampling frequency

2sample M AT T T

1

2sample

M Af

T T

Pipelining of FIR digital filter

Pipelined implementation of the 3-tap FIR filter is obtained by placing 2 additional latches.

The critical path is reduced from TM+2TA to TM+TA .

The two main drawbacks of the pipelining are increase in the number of latches and in system latency.

<Pipelined FIR filter>

Pipelining of FIR digital filter (2)

The critical path (longest path) can be reduced by suitably placing the pipelining latches in the architecture.

The pipelining latches can only be placed across any feed-forward cutset of the graph

Introduce 2 definitions of graph for pipelining. Cutset A cutset is a set of edges of a graph such that if these edges are re

moved from the graph, the graph becomes disjoint. Feed-forward Cutset A cutset is called a feed-forward cutset if the data m

ove in the forward direction on all the edges of the cutset.

To obtain an appropriate pipelining circuit, pipelining latches should be inserted on all the edges in the feed-forward cutset !!


Signal-flow graph example


Data-Broadcast Structures

The critical path of the original 3-tap FIR filter can be reduced without introducing any pipelining latches by transposing the structure.

Transposition theorem

“ Reversing the direction of all the edges in a given SFG (signal-

flow graph) and interchanging the input and output ports

preserves the functionality of the system.”


< SFG representation

of the FIR filter>

< Transposed SFG representation

of the FIR filter>


Transposed SFG representation leads to the data-broadcast structure where data are not stored but are broadcast to all the multipliers simultaneously.

<Data-broadcast structure of the FIR filter>


Fine-Grain Pipelining Let TM=10 units and TA units, and the desired clock period be (TM

+TA)/2=6 units. In this case the multiplier is broken into 2 smaller units with proce

ssing times of 6 units and 4 units, respectively. By placing the latches on the horizontal cutset across the multiplier

, the desired clock speed can be achieved.

Parallel Processing (1) Designing a Parallel FIR System

To obtain a parallel processing structure, the SISO(single-input single-output) system must be converted into a MIMO(multiple-input multiple-output) system.

y(3k) = ax(3k)+bx(3k-1)+cx(3k-2)

y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1)

y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k) Parallel Processing systems are also referred to as block processin

g systems.

Parallel Processing (2)

Parallel processing architecture for a 3-tap FIR filter

(with block size 3)


The critical path of the parallel processing system has remained unchanged and the clock period (Tclk) must satisfy :

But since 3 samples are processed in 1 clock cycle instead of 3, the iteration period is given by

In a Pipelined system : Tclk = Tsample

2clk M AT T T

1 1( 2 )

3iter sample clk M AT T T T T

L


Complete parallel processing system with block size 4


Why do we use parallel processing when we can use pipelining ? Due to a fundamental limit to pipelining imposed by the I/O bottlenecks. Pipelining can be combined with parallel processing to further increase

the speed of the architecture. By combining parallel processing and pipelining, the sample period has

been reduced to

Parallel processing is also used for reduction of power consumption while using slow clocks.

1 1( 2 )

6iter sample clk M AT T T T T

LM


< A chip set>


<Combined fine-grain pipelining and parallel processing

for 3-tap FIR filter>

Pipelining and Parallel processing for Low power

There are two main advantages of using pipelining and parallel processing :

Higher speed Lower power

For CMOS circuit, the propagation delay can be written as :

Power consumption of a CMOS circuit can be estimated as :

arg 0

20( )

ch epd

t

C VT

k V V

20totalP C V f

Pipelining for Low power (1)

represent the power consumed in the original filter.

(where Tseq is the clock period of the original sequential filter)

In the M-level pipelined system, the critical path is reduced to 1/M of i

ts original length and the capacitance to be charged/discharged in a sin

gle clock cycle is reduced to Ccharge / M.

supply voltage can be reduced to

20seq totalP C V f

0V

Pipelining for Low power (2)

The power consumption factor, , can be determined by examining

the relationship between the propagation delay of the original filter

and the pipelined filter.

arg 0

20( )

ch eseq

t

C VT

k V V

arg 0

20( )

ch epipe

t

C VT

k V V

2 2 20pip total seqP C V f P

Parallel processing for Low power (1)

Parallel processing, like pipelining, can reduce the power consumption of a system by allowing the supply voltage to be reduced.

In an L-parallel system, the charging capacitance does not change while the total capacitance is increased by L times.

In order to maintain the same sample rate, the clock period of the L-parallel circuit must be increased to LTseq, where Tseq is the propagation delay of the sequential circuit.

There is more time to charge the same capacitance => supply voltage can be reduced to 0V

supply voltage can be reduced to

Parallel processing for Low power (2)

arg 0

20( )

ch eseq

t

C VLT

k V V

The propagation delay of the L-parallel system is given by :

2 2 20par total seqP C V f P

arg 0

20( )

ch eseq

t

C VT

k V V

Conclusions

The pipelining Pipelining latches are placed across the feed-forward cutsets in the

SFG and computation time of the critical path is reduced The clock frequency can be increased and hence the sampling rate

is increased.

Parallel processing The hardware for the original serial system is duplicated and the re

sulting system is MIMO parallel system. The clock freq. Stays the same, and the sampling freq. is increased.

Two scheme is used for higher speed and lower power design (using lower supply voltage).

Documents

L7: Pipelining and Parallel Processing VADA Lab