03 Nonlinear Pipeline Summary

Nonlinear Pipeline Processors

May 2, 2011

A dynamic pipeline can be reconfigured to perform variable functions at different times.The traditional pipelines like the ones presented in the previous tutorial are used to performfixed functions (like fetch, decode, execute, write back, etc.).

1 Reservation and Latency Analysis

We have mentioned the Reservation in the previous tutorial as a time-space flow.

ROR SHL INVInput

Output X

Output Y

Figure 1: A dataflow circuit.

Consider the example in Figure 1. The first function unit performs a rotation to theright (ROR), the second function unit performs a logical shift to the left (SHL), and thelast function unit performs the first complement or inverse (INV). This dataflow circuitcan be used to produce the function Output X as follows:

OutputX =[[(((Input) >>> 1) << 1)] << 1

]>>> 1

The same dataflow circuit can be used to perform another function, for example:

OutputY = [((Input) >>> 1) << 1)] >>> 1

1

This can be represented in form of a Reservation table as in Table 1 and Table 2. Thecheckmarks in each row of the reservation table correspond to the time instants (cycles)that a particular stage will be used.

There may be multiple checkmarks in a row, which means repeated usage of the samestage in different cycles. Contiguous checkmarks in a row simply imply the extended usageof a stage over more than one cycle. Multiple checkmarks in a column mean that multiplestages are used in parallel during a particular clock cycle.

→ Time (clock cycles)

1 2 3 4 5 6 7 8S1(ROR) X X XS2(SHL) X XS3(INV ) X X X

Table 1: Reservation table for function X.


1 2 3 4 5 6S1(ROR) Y YS2(SHL) YS3(INV ) Y Y Y

Table 2: Reservation table for function Y.

So in conclusion, different functions may follow different paths on the reservation table.The function units in Figure 1 can be converted to pipeline stages by inserting a buffer

at their input.

1.1 Some Important Definitions

Evaluation Time The number of columns in a reservation table is called the evaluationtime of a given function. For example, function X requires eight clock cycles to evaluate,and function Y requires six clock cycles, as shown in Table 1 and Table 2, respectively.

Latency Analysis The number of times units between two initiations of a pipeline isthe latency between them. A latency of k means that two initiations are separated by kclock cycles.

Collision Any attempt by two or more initiations to use the same pipeline stage at thesame time will cause a collision. A collision implies resource conflict between two initiationsin the pipeline. Therefore, all collisions must be avoided in scheduling a sequence of pipelineinitiations.

Forbidden Latencies Some latencies will cause collisions, and some will not. Latenciesthat cause collisions are called forbidden latencies.For example, to evaluate function X,

2

latencies 2 and 5 are forbidden.


1 2 3 4 5 6 7 8 9 10 11S1(ROR) X1 X2 X3 X1 X4 X1,X2 X2,X3 . . .S2(SHL) X1 X1,X2 X2,X3 X3,X4 X4 . . .S3(INV ) X1 X1,X2 X1,X2,X3 X2,X3,X4 . . .

Table 3: Collision with scheduling latency 2.


1 2 3 4 5 6 7 8 9 10 11S1(ROR) X1 X1,X2 X1 . . .S2(SHL) X1 X1 X2 X2 . . .S3(INV ) X1 X1 X1 X2 X2 . . .

Table 4: Collision with scheduling latency 5.

To detect forbidden latency, one needs simply to check the distance between any twocheckmarks in the same row of the reservation table. For example. the distance betweenthe first mark and the second mark in row 1 in Figure 1, implying that 5 is a forbiddenlatency.

Similarly, latencies 2,4,5, and 7 are all seen to be forbidden from inspecting the samereservation table. From the reservation table in Figure 2, we discover the forbidden latencies2 and 4 for function Y. A latency sequence is a sequence of permissible nonforbiddenlatencies between successive task initiations.

A latency cycle is a latency sequence which repeats the same subsequence (cycle) in-definitely. For example, the latency cycle (1, 8) represents the infinite latency sequence 1,8, 1, 8, 1, 8, . . . This implies that successive initiations of new tasks are separated by onecycle and eight cycles alternately.

Average Latency The average latency of a latency cycle is obtained by dividing thesum of all latencies by the number of latencies along the cycle. The latency cycle (1, 8)thus has an average latency of (1 + 8)/2 = 4.5. A constant cycle is a latency cycle whichcontains only one latency value.

2 Collision-Free Scheduling

When scheduling events in a pipeline, the main objective is to obtain the shortest averagelatency between initiations without causing collisions.

3

2.1 Collision Vectors

A permissible latency of p = 1 corresponds to the ideal case. In theory, a latency of 1 canalways be achieved in a static linear pipeline.

The combined set of permissible and forbidden latencies can be easily displayed by acollision vector, which is an m-bit binary vector C = (CmCm−1 . . . C2C1). The value ofCi = 1 if latency i causes a collision and Ci = 0 if latency i is permissible. Note thatit is always true that Cm = 1, corresponding to the maximum forbidden latency, wherem ≤ n− 1 and n is the number of columns in the reservation table.

For the two reservation tables (Table 1, Table 2), the collision vector CX = (1011010)and CY = (1010). The distance between the first two checkmarks on the first row in Table 1is 5 and the distance between the first checkmark and the third is 7. The distance betweenthe second checkmark and the third checkmark on the first row is 2. The distance betweenthe first checkmark and the third checkmark on the third row is 4. Hence, 7, 5, 4, 2 areall forbidden latencies and are denoted by a “one” in the collision vector. The remaininglatencies are permissible, namely 6,3, and 1.

2.2 State Diagram

From the above collision vector, one can construct a state diagram specifying the permis-sible state transitions among successive initiations. The collision vector, like CX above,corresponds to the initial state of the pipeline at time 1 and thus is called an initial collisionvector.

The next state of the pipeline at time t+ p when p is a permissible latency is obtainedwith the assistance of an m-bit right shift register as in Figure 2.

(Cn Cn-1 . C1)=Initial Collision Vector

“0” “0” safe“1” collision

Figure 2: State transition using an n-bit right shift register, where n is the maximumforbidden latency.

A state diagram is obtained in Figure 3 for function X. From the initial state (1011010),only three outgoing transitions are possible, corresponding to the three permissible latencies6,3, and 1 in the initial collision vector. Similarly, from state (1011011), one reaches the

4

same state after either three shifts or six shifts. When the number of shifts is m + 1 orgreater, all transitions are redirected back to the initial state. For example, after eight ormore (denoted as 8+) shifts, the next state must be the initial state,regardless of whichstate the transitions starts from.

1011010

11111111011011

18+ 8+63

3 6

8+

Figure 3: State diagram for function X.

2.3 Greedy Cycles

From the state diagram, we can determine optimal latency cycles which result in the MAL(minimal average latency). There are infinitely may latency cycles one can trace from thestate diagram. For example, (1, 8), (1, 8, 6, 8), (3), (6), (3, 8), (3, 6, 3), ..., are legitimatecycles traced from the state diagram in Figure 3. Among these cycles, only simple cyclesare of interest. A simple cycle is a latency cyle in which each state appears only once. Inthe state diagram in Figure 3, only (3), (6), (8), (1, 8), (3, 8) and (6, 8) are simple cycles.

Some of the simple cycles are greedy cycles. A greedy cycle is one whose edges areall made with minimum latencies from their respective starting states. For example, fromFigure 3, (1, 8) and (3) are greedy cycles. Greedy cycles in Figure 4 are (1, 5) and (3).Their average latencies must be lower than those of other simple cycles. The MAL in bothfigures is 3.

The greedy cycles yielding the MAL is the final choice.

1010

11111011

15+ 5+3

3

5+

Figure 4: State diagram for function Y.

5

3 Pipeline Schedule Optimization

It is required to obtain an optimal latency cycle, which is absolutely the shortest.

3.1 Bounds on the MAL

The MAL is lower-bounded by the maximum number of checkmarks in any row of thereservation table. The lower bound guarantees the optimality. Therefore, the MAL=3 forboth functions X, and Y has met the lower bound of 3 from their respective reservationtables (Table 1 and Table 2). To optimize the MAL, one needs to find the lower bound bymodifying the reservation table by inserting noncompute delay stages.

3.2 Example

Consider the following pipeline in Figure 5. Its reservation table is given in Table 5. Thestate transition diagram is given in Figure 6. The lower bound on MAL is 2, however, thegreedy cycle with MAL is (3) and given MAL=3. The efficiency is 6/9 = 0.67 as deducedfrom Table 6. The throughput is the efficiency divided by τ .

S1 S2 S3Input

Output X

Figure 5: A three-stage pipeline.


1 2 3 4 5S1 Y YS2 Y YS3 Y Y

Table 5: Reservation table for function Y.

It is therefore required to insert noncompute delay stages to change the reservationtable and thereby the MAL. The checkmark at the fourth timing signal at the third stagestands in the way of issuing a new instruction after a latency 1. Therefore, we insert adelay stage between S3 and itself as shown in Figure 7. The modified reservation table isshown in Table 7.

6

1011 35+

Figure 6: New state transition diagram with MAL=3.


Cycle Repeats1 2 3 4 5 6 7 8 9 10 11

S1 Y1 Y2 Y1 Y3 Y2 Y4 Y3S2 Y1 Y1 Y2 Y2 Y3 Y3 Y4S3 Y1 Y1 Y2 Y2 Y3 Y3

Table 6: Reservation table for function Y extended to calculate Efficiency.

S1 S2 S3Input

Output XD

Figure 7: Insertion of one noncompute delay stage.


1 2 3 4 5 6S1 Y YS2 Y YS3 Y YS3 D

Table 7: Modified reservation table for function Y.

7

The resulting state transition diagram is shown in Figure 8. The greedy cycle is (1,3,3)from this state diagram and thereby the MAL=(1+3+3)/3=2.33. If you draw more issuesin the reservation table you will notice the cycle repeats starting from cycle 9 to cycle 15.The efficiency accordingly is 0.75 which is higher than the previous efficiency.

However, the achieved MAL=2.33 is still greater than the lower bound. This reservationtable allows us to issue an instruction after the first cycle and then three cycles later anotherinstruction. If only we could issue the next instruction one cycle after the second instructionissue then we can get the cycle (1,3) which would give us the MAL=2. The cell highlightedin red in Table 8 should be moved to the right to allow a cycle of (1,3). Hence we introduceanother delay cycle between S3 and S1 to delay the last checkmark in the first row.


1 2 3 4 5 6S1 Y YS2 Y YS3 Y YS3 D

Table 8: Modified reservation table for function Y with the checkmark that stands in theway of the optimal cycle highlighted in red.

10010

11011 10011

36+

16+ 46+3

3 4

Figure 8: Modified state diagram with a reduced MAL=(1+3+3)/3=2.33.

The new pipeline is shown in Figure 9 and the new reservation table is shown in Table9. The generated state diagram is shown in Figure 10. It can be seen that the greedycycle from this state diagram is (1,3) and the corresponding MAL is 2, which is the lowerbound on MAL that we were seeking. If you complete the reservation table for subsequentinstructions, you will realize the pattern showing repeating cycles starting from cycle 6 tocycle 9. This gives an efficiency of 0.8 which is higher than the previous two efficiencies.Therefore, the lower the MAL, the higher the efficiency.

8

S1 S2 S3Input

Output XD1

D2

Figure 9: Insertion of two delay stages.


1 2 3 4 5 6 7S1 Y YS2 Y YS3 Y YD1 DD2 D

Table 9: Modified reservation table for function Y.

100010

110011 100011

4,7+

17+ 54,7+4,7+

3

5

100110

1

4

3

3 5

Figure 10: Modified state diagram with a reduced MAL=(1+3)/2=2.

9

4 Nonlinear Pipeline Throughput

This is essentially the initiation rate or the average number of tasks per clock cycle. Theinitiation rate or the pipeline throughput is the inverse of the MAL adapted. Therefore,the scheduling strategy does affect the pipeline performance.

5 Nonlinear Pipeline Efficiency

The percentage of time that each pipeline stage is used over a sufficiently long series of taskinitiations is the state utilization. The accumulated rate of all stage utilizations determinesthe pipeline efficiency.

At least one stage of the pipeline should be fully (100%) utilized at the steady statein any acceptable initiation cycle; otherwise, the pipeline capability has not been fullyexplored.

10

Documents

03 Nonlinear Pipeline Summary