1 DATAFLOW ARCHITECTURES JURIJ SILC, BORUT ROBIC, THEO UNGERER

1

DATAFLOW ARCHITECTURES

JURIJ SILC, BORUT ROBIC, THEO UNGERER

2

Literature

Jurij Silc, Borut Robic, and Theo Ungerer: Processor Architecture: From Dataflow to Superscalar and Beyond (Springer-Verlag, Berlin, New York,1999).

Jurij Silc, Borut Robic, and Theo Ungerer: "Asynchrony in parallel computing: From dataflow to multithreading", Parallel and Distributed Computing Practices, 1(1):3-30, 1998.

Borut Robic, Jurij Silc, and Theo Ungerer: "Beyond dataflow", J. Computing and Information Technology, 8(2):89-101, 2000.

3

Dataflow Processors - Motivation

In basic processor pipelining hazards limit performance Structural hazards Data hazards due to

true dependences or name (false) dependences: anti and output dependences

Control hazards Name dependences can be removed by:

compiler (register) renaming renaming hardware advanced superscalars single-assignment rule dataflow computers

Data hazards due to true dependences and control hazards can be avoided if succeeding instructions in the pipeline stem from different contexts dataflow computers, multithreaded processors

4

Dataflow vs. Control-Flow

von Neumann or control flow computing model: a program is a series of addressable instructions, each of which either

specifies an operation along with memory locations of the operands or it specifies (un)conditional transfer of control to some other instruction.

Essentially: the next instruction to be executed depends on what happened during the execution of the current instruction.

The next instruction to be executed is pointed to and triggered by the PC. The instruction is executed even if some of its operands are not available

yet (e.g. uninitialized). Dataflow model: the execution is driven only by the availability of operand!

no PC and global updateable store the two features of von Neumann model that become bottlenecks in

exploiting parallelism are missing

5

Dataflow model of computation

Enabling rule: An instruction is enabled (i.e. executable) if all operands are

available. Notice, that in von Neumann model, an instruction is enabled if it is

pointed to by PC.

The computational rule or firing rule, specifies when an enabled instruction is actually executed.

Basic instruction firing rule: An instruction is fired (i.e. executed) when it becomes

enabled. The effect of firing an instruction is the consumption of its input data

(operands) and generation of output data (results).

Where are the structural hazards? Answer: ignored!!

6

Dataflow languages

Main characteristic: The single-assignment rule A variable may appear on the left side of an assignment

only once within the area of the program in which it is active.

Examples: VAL, Id, LUCID

A dataflow program is compiled into a dataflow graph which is a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependences among instructions.

The dataflow graph is similar to a dependence graph used in intermediate representations of compilers.

During the execution of the program, data propagate along the arcs in data packets, called tokens.

This flow of tokens enables some of the nodes (instructions) and fires them.

7

Example in VAL

function Stats: computes the mean and standard deviation of three input values.

function Stats(x,y,z: real returns real,real);

let

Mean := (x + y + z)/3;

StDev := SQRT((x**2 + y**2 + z**2)/3 - Mean**2);

in

Mean, StDev

endlet

endfun

22 2

2

z y x

+

+

+

+

-

sqrt

Mean

StDev

3

3

8

Example in Id

Id program segment computes factorial n! of integer n

( initial j <- n; k <- 1

while j > 1 do

new j <- j - 1;

new k <- k * j;

return k )*

1

1-

F T

T FSWITCH

CHOOSE

n

n!

1F

9

Two important characteristics of dataflow graphs

Functionality: The evaluation of a dataflow graph is equivalent to evaluation of the corresponding mathematical function on the same input data.

Composability: Dataflow graphs can be combined to form new graphs.

10

Dataflow architectures

Pure dataflow computers: static, dynamic, and the explicit token store architecture.

Hybrid dataflow computers: Augmenting the dataflow computation model with control-

flow mechanisms, such as RISC approach, complex machine operations, multithreading, large-grain computation, etc.

11

Pure Dataflow

A dataflow computer executes a program by receiving, processing and sending out tokens, each containing some data and a tag.

Dependences between instructions are translated into tag matching and tag transformation.

Processing starts when a set of matched tokens arrives at the execution unit.

The instruction which has to be fetched from the instruction store (according to the tag information) contains information about

what to do with data and how to transform the tags.

The matching unit and the execution unit are connected by an asynchronous pipeline, with queues added between the stages.

Some form of associative memory is required to support token matching. a real memory with associative access, a simulated memory based on hashing, or a direct matched memory.

12

Static Dataflow

A dataflow graph is represented as a collection of activity templates, each containing:

the opcode of the represented instruction, operand slots for holding operand values, and destination address fields, referring to the operand slots

in sub-sequent activity templates that need to receive the result value.

Each token consists only of a value and a destination address.

13

Dataflow graph and Activity template

*

data token

acknowledge signal

data arc

acknowledgement arc

sqrt

x y

z

n i

n j

32 x

y

z

sqrt

*n i

n i

n j

n j

x

y

z

2

3

14

Acknowledgement signals

Notice, that different tokens destined for the same destination cannot be distinguished.

Static dataflow approach allows at most one token on any one arc. Extending the basic firing rule as follows:

An enabled node is fired if there is no token on any of its output arcs.

Implementation of the restriction by acknowledge signals (additional tokens ), traveling along additional arcs from consuming to producing nodes.

The firing rule can be changed to its original form: A node is fired at the moment when it becomes enabled.

Again: structural hazards are ignored assuming unlimited resources!

15

MIT Static Dataflow Machine

C o m m u n ic a tio n

N e tw o rk

P E

P E

. . .

A c tiv ityS to re

In s tru c tio nQ u e u e

F e tchU n it

U p d a teU n it

S U

R U

O p e ra tio n U n it(s )

lo c a lc o m m u n ic a tio n

to /fro m th eC o m m u n ic a tio nN e tw o rk

P ro cess in g E lem e n t

16

Deficiencies of static dataflow

Consecutive iterations of a loop can only be pipelined. Due to acknowledgment tokens, the token traffic is doubled. Lack of support for programming constructs that are essential to

modern programming language no procedure calls, no recursion.

Advantage: simple model

17

Dynamic Dataflow

Each loop iteration or subprogram invocation should be able to execute in parallel as a separate instance of a reentrant subgraph.

The replication is only conceptual. Each token has a tag:

address of the instruction for which the particular data value is destined

and context information Each arc can be viewed as a bag that may contain an arbitrary

number of tokens with different tags. The enabling and firing rule is now:

A node is enabled and fired as soon as tokens with identical tags are present on all input arcs.

Structural hazards ignored!

18

The U-interpreter (U = unraveling)

Each token consists of an activity name and data the activity name comprises the tag.

the tag has an instruction address n, the context field c that uniquely identifies the context in

which the instruction is to be invoked, and the initiation number i that identifies the loop iteration in

which this activity occurs. Note, that c is itself an activity name. Since the destination instruction may require more than one

input, each token also carries the number of its destination port p.

We represent a token byp

datanic ,..

19

The U-interpreter

if the node ni performs a dyadic function f, and if the port p of nj is the destination of ni, then we have

executionni

nj

f

c.i.n , x1i c.i.n , y

2

21

i

ni

nj

f

c.i.n , f(x,y)j p

p

},..,,..{:21

ynicxnicin ii}),(,..{:

pj yxnicout

20

MERGE and SWITCH nodes

MERGE node SWITCH node

execution

executionX X

XX

T

T

F

F

T

T

F

F

T

F

execution

executionSWITCH SWITCH

SWITCH SWITCH

21

Branch Implementations

X

f g

T F

x

bni

nj

nk

SWITCH P

f g

T F

x

bn

i

nj

nk

CHOOSE

COPY

P

controlidatai bnicxnicin ;..;,..:

Fbifxnic

Tbifxnicout

k

j

,..

,..:

Branch

Speculative branchevaluation

22

L: initiation, new loop context D: increments loop iteration number D-1: reset loop iteration number to 1 L-1: restore original context

L, L-1, D, and D-1 Operators for Loop Implementation

xnicinL i ,..:: ik niccwherexncout ..',,.1'.:

xnjcinD j ,.'.:: xnjcout k ,.1'.:

xnkcinD l ,.'.::1 xncout m ,.1'.:

xncinL m ,.1'.::1 xnicout n ,.'.:

23

Basic Loop Implementation

X

f

T F

L

D

D -1

L-1

new x

ni

nj

nl

nm

nk SWITCH P

24

A: create new context BEGIN: replicate tokens for each fork END: return results, unstack return address A-1: replicate output for successors

A, A-1, BEGIN, and END Operators for function calls

A:

where and nj is the address of the A-1 operator

END:

jnicc ..'

ancoutanicqnicin beginifunci ,.1'.:,..,,..:arg

)(,..:)(,.'.: aqnicoutaqnicin jend

25

Function application

ni

ni

nend

nbegin

nj

q a

A

A -1

q a

q

BEGIN

END

APPLY

26

I-structures (I = incremental)

Problem: Single-assignment rule and complex data structures each update of a data structure consumes the structure and

the value producing a new data structure.

awkward or even impossible to implement.

Solution: concept of I-structure: a data repository obeying the single-assignment rule each element of the I-structure may be written only once but

it may be read any number of times

The basic idea is to associate with each element status bits and a queue of deferred reads.

27

I-structures

The status of each element of the I-structure can be: present: the element can be read but not written, absent: a read request has to be deferred but a write

operation into this element is allowed, waiting: at least one read request of the element has been

deferred. After an element of the data structure has become defined

(initialized, value assigned; can happen exactly once), all deferred reads, which are kept in the associated queue, become immediately satisfied.

I-structure makes it possible to use a data structure before it is fully defined.

It allows defining complex data structures from existing though partially defined data structures.

28

I-structures

The status of each element of the I-structure can be: present: the element can be read but not written, absent: a read request has to be deferred but a write

operation into this element is allowed, waiting: at least one read request of the element has been

deferred.

29

I-structure

The following three elementary operations are defined on I-structures: allocate: reserves a specified number of elements for a new I-

structure, I-fetch: retrieves the contents of the specified I-structure element

(if the element has not yet been written, then this operation is automatically deferred),

I-store: writes a value into the specified I-structure element (if that element is not empty, an error condition is reported).

These elementary operations are used to construct nodes SELECT and ASSIGN.

I-fetch instruction is implemented as split-phase memory operation:a read request issued to an I-structure is independent in time from the response received and thus does not cause a wait by the issuing PE.

30

I-structure select and assign

jA

ni

nj

SELECT

jA

nj

I-fetch

address

I-structure

Storage

j xA

ni

nj

ASSIGN

x

nj

I-store

I-structure

Storage

jA

address

signal

31

MIT Tagged-Token Dataflow Architecture

Processing Element

Communication

Network

PE

PEI-Structure

Storage

I-Structure

Storage. . . . . .

localcommunication

SU

RU

FormTokenUnit

InstructionFetchUnit

ALU

TokenQueue

ProgramStore

& Form Tag

& ConstantStore

Wait-Match Unit & Waiting Token Store

to/from theCommunicationNetwork

32

Manchester Dataflow Machine

Processing Element

Switch Switch

Switch

StructureStorage

StructureStorage

PE

Switch

MatchingUnit

InstructionStore

TokenQueue

Processing Unit

ALU ALU

outputinput

. . .

Host

33

Advantages and Deficiencies of Dynamic Dataflow

Major advantage: better performance (compared with static) as it allows multiple tokens on each arc thereby unfolding more parallelism.

Problems: efficient implementation of the matching unit that collects

tokens with matching tags. Associative memory would be ideal. Unfortunately, it is not cost-effective since the amount of

memory needed to store tokens waiting for a match tends to be very large.

As a result, all existing machines use some form of hashing techniques that are typically not as fast as associative memory.

34

Explicit Token Store Approach

Target: efficient implementation of token matching.

Basic idea: allocate a separate frame in the frame memory for each active loop iteration or subprogram invocation.

A frame consists of slots where each slot holds an operand that is used in the corresponding activity.

Since access to slots is direct (i.e. through offsets relative to the frame pointer), no associative search is needed.

35

Explicit Token Store

2.34

presencebit value

Frame Memory

FP

FP + 2

IP

*

+

-

sqrt

<FP, IP, 3.01>

sqrt*

35

+22 +1 +2

+1 +5+3 +2

+

op-code

offsetin theactivationframe

destinationsleft right

Instruction Memory

-

36

Explicit Token Store Matching Scheme

t=0

t=1

t=2

2.34

presencebit

valueFrame memory

FP*

+sqrt

<FP, IP, 3.01>

2.34

7.4

presencebit

valueFrame Memory

FP

FP

*

+sqrt

<FP, IP+2, 6.02 ><FP, IP+1, 6.02 >

< FP, IP, 2.0 >

<FP, IP, 7.4 >

2.34

presencebit

valueFrame Memory

FP*

+sqrt

3.01

,

,

37

k-bounded Loop Scheme

X

X

T F

X

p

T FSWITCHSWITCH

Dk

Dk-2

DresetDreset

Dk

G

. . .

. . .

. . .

loop body

.

.

.

k -1

toke

ns

Result Signal

Synchro-nization Tree

loop prelude

. . .

. . .

. . .

. . .. . .

...Frame Memory

fram

e k

acti

vati

onfr

ame

1ac

tiva

tion

fram

e 2

acti

vati

on

k

2k

2k-1

k-1

3k-1

1+k

1

1+2k

38

k-bounded Loop Scheme - new Operators

New: gate operator G, operator Dk and a synchronization tree. The Dk operator performs the modulo k increment of the iteration

number of tokens circulation through loop. The G operator has two inputs (for control and data) and

functions as loop throttle by passing one token from its data input to the output for each token on its control input.

At the end of each iteration, a new control token is generated by combining the output of all Dk operators into a single value using the synchronization tree.

Loop throttling: at most k consecutive loop iterations active at the same time and, as a result, only k frames are needed.

39

Monsoon, an Explicit Token Store Machine

Each PE is using an eight-stage pipeline instruction fetch

--- precedes token matching (in contrast to dynamic dataflow processors with associative matching units)!

token matching1: effective address generation: explicit token address is computed from the frame address and operand offset

token matching2: presence bit operation: a presence bit is accessed to find out if the first operand of a dyadic operation has already arrived

not arrived presence bit set and the current token is stored into the frame slot of the frame memory

arrived presence bit is reset and the operand can be retrieved from the slot of the frame memory in next stage

token matching3: frame operation stage: Operand storing or retrieving.

40


Each PE is using an eight-stage pipeline (continued) Next three stages: execution stages in the course of which

the next tag is also computed concurrently. Eighth stage: form-token: forms one or two new tokens that

are sent to the network, stored in a user token queue, a system token queue, or directly recirculated to the instruction fetch stage of the pipeline.

41


Processing Element

MultistagePacketSwitchingNetwork

PE

PE

I-Structure

Storage

I-Structure

Storage

. . . . . .

Fra

me

Mem

ory

FormToken

Use

r Q

ueue

Sys

tem

Que

ue

InstructionMemory

ALU

InstructionFetch

EffectiveAddressGeneration

PresenceBitOperation

FrameOperation


42

Monsoon Prototype

16 prototypes at beginning of 90ies! Processing element:

10 MHz clock 56 kW Instruction Memory (32 bit wide) 256 kW Frame Memory (word + 3 presence bits, word size: 64 bit

data + 8 bit tag) Two 32 ktoken queues (system, user)

I-structure storage: 4MW (word + 3 presence bits) 5 M requests/sec

Network Multistage, pipelined Packet Routing Chips (PaRC, 4 x 4 crossbar) 4 M tokens/s/link (100 MB/s)

43

Advantages and Deficiencies of Dynamic Dataflow

Major advantage: better performance (compared with static) because it allows multiple tokens on each arc thereby unfolding more parallelism.

Problems: efficient implementation of the matching unit that collects tokens

with matching tags. Associative memory would be ideal. Unfortunately, it is not cost-effective since the amount of

memory needed to store tokens waiting for a match tends to be very large.

All existing machines use some form of hashing techniques. bad single thread performance (when not enough workload is

present) dyadic instructions lead to pipeline bubbles when first operand

tokens arrive no instruction locality no use of registers

44

Augmenting Dataflow with Control-Flow

Poor sequential code performance by dynamic dataflow computers An instruction of the same thread can only be issued to the

dataflow pipeline after the completion of its predecessor instruction.

In the case of an 8-stage pipeline, instructions of the same thread can be issued at most every eight cycles.

Low workload: the utilization of the dataflow processor drops to one eighth of its maximum performance.

Another drawback: the overhead associated with token matching. before a dyadic instruction is issued to the execution stage, two

result tokens have to be present. The first token is stored in the waiting-matching store, thereby

introducing a bubble in the execution stage(s) of the dataflow processor pipeline.

measured pipeline bubbles on Monsoon: up to 28.75 % No use of registers possible!

45

Augmenting Dataflow with Control-Flow

Solution: combine dataflow with control-flow mechanisms.

Several techniques for combining control-flow and dataflow emerged:

hybrid dataflow, RISC dataflow, dataflow with complex machine operations, threaded dataflow, large-grain dataflow.

46

Threaded Dataflow

Threaded dataflow: the dataflow principle is modified so that instructions of certain instruction streams are processed in succeeding machine cycles.

In a dataflow graph a subgraph that exhibits a low degree of parallelism is transformed into a sequential thread.

The thread of instructions is issued consecutively by the matching unit without matching further tokens except for the first instruction of the thread.

Threaded dataflow covers the repeat-on-input technique used in Epsilon-1 and Epsilon-2

processors, the strongly connected arc model of EM-4, and the direct recycling of tokens in Monsoon.

47

Threaded Dataflow (continued)

Data passed between instructions of the same thread is stored in registers instead of written back to memory.

These registers may be referenced by any succeeding instruction in the thread.

Thereby single-thread performance is improved. The total number of tokens needed to schedule program

instructions is reduced which in turn saves hardware resources. Pipeline bubbles are avoided for dyadic instructions within a

thread. Two threaded dataflow execution techniques can be distinguished:

direct token recycling (Monsoon), consecutive execution of the instructions of a single thread

(Epsilon & EM).

48

Direct token recycling of Monsoon

Cycle-by-cycle instruction interleaving of threads similar to multithreaded von Neumann computers!

8 register sets can be used by 8 different threads. Dyadic instructions within a thread (except for the start

instruction!) refer to at least one register, i.e. need only a single token to be enabled.

A result token of a particular thread is recycled ASAP in the 8-stage pipeline, i.e. every 8th cycle the next instruction of a thread is fired and executed.

This implies that at least 8 threads must be active for a full pipeline utilization.

Threads and fine-grain dataflow instructions can be mixed in the pipeline.

49

Epsilon and EM-4

Epsilon and EM-4 execute instructions from a thread consecutively. The circular pipeline of fine-grain dataflow is retained. However, the matching unit has to be enhanced with a mechanism

that, after firing the first instruction of a thread, delays matching of further tokens in favor of consecutive issuing of all instructions of the started thread.

Example: strongly connected arc model (EM-4): each arc of the dataflow graph is classified as either a normal arc

or a strongly connected arc The set of nodes that are connected by strongly connected arcs

is called strongly connected block. such a block is fired if its source nodes are enabled the instructions in the block executed successively

Problem: implementation of an efficient synchronization mechanism

50

Strongly connected blocks in EM-4

1 32 4

5 A

B

7

8 9

1011 12

13 14

6normal arc

strongly connected arc

strongly connected block

51

Direct matching: Instruction Memory and Operand Memory

2.34

presencebit value

Operand Memory

OSN

TSN

TSN

*

+

-

sqrt

<SF,OSN,DPL, 6.02>

SF = dyadic right

*+

op-code

Instruction Memory

-sqrt

Tem

plat

e S

egm

ent

Operand

Segm

ent

52

EM-4Processing Element:EMC-R Processor + Memory PE

PE Om

ega

Net

wor

k

. . .

Reg

iste

rF

iles

Mem

ory

Con

trol

Uni

t


EMC-R

Memory

OperandSegments

TemplateSegments

Heap

Execution Unit

Inpu

t Buf

fer

Uni

t

Fetch MatchingUnit

InstructionFetch

Execute andEmit Tokens

SwitchingUnit

53

Large-Grain (coarse-grain) Dataflow

A dataflow graph is enhanced to contain fine-grain (pure) dataflow nodes and macro dataflow nodes.

A macro dataflow node contains a sequential block of instructions.

A macro dataflow node is activated in the dataflow manner, its instruction sequences is executed in the von Neumann style!

Off-the-shelf microprocessors can be used to support the execution stage.

Large-grain dataflow machines typically decouple the matching stage (sometimes called signal stage, synchronization stage, etc.) from the execution stage by use of FIFO-buffers.

Pipeline bubbles are avoided by the decoupling and FIFO-buffering.

54

Large-Grain Dataflow: StarT

Con

tinu

atio

nQ

ueue

MessageQueue

MessageQueue

MessageQueue

Nod

e M

emor

y

to/f

rom

the

Com

mun

icat

ion

Net

wor

k

loca

l

DataProcessordP

SynchronizationProcessorsP

Remote MemoryRequest ProcessorRmem

55

Dataflow with Complex Machine Operations

Use of complex machine instructions, e.g. vector instructions ability to exploit parallelism at the subinstruction level Instructions can be implemented by pipeline techniques as in

vector computers. The use of a complex machine operation may spare several

nested loops.

Structured data is referenced in block rather than element-wise and can be supplied in a burst mode.

Problem: I-structure scheme: each data element within a complex data structure is fetched individually from a structure store.

The structure-flow technique (SIGMA-1) defines structure load/store instructions that can move whole vectors to/from structure store.

56

Dataflow with Complex Machine Operationsand combined with LGDF

Often: use of FIFO-buffers to decouple the firing stage and the execution stage

bridges different execution times within a mixed stream of simple and complex instructions.

Major difference to pure dataflow: tokens do not carry data (except for the values true or false).

Data is only moved and transformed within the execution stage.

Applied in: Decoupled Graph/Computation Architecture, the Stollmann Dataflow Machine, and the ASTOR architecture.

These architectures combine complex machine instructions with large-grain dataflow.

57

Augsburg Structure-Oriented Architecture (ASTOR)

Processing Element

PE

PE

Inst

ruct

ion

Com

unic

atio

nN

etw

ork

Dat

aC

omm

unic

atio

nN

etw

ork . . .

Dyn

amic

Cod

eSt

orag

e

Con

trol

Con

stru

ctM

anag

ers

Dan

ymic

Cod

eA

cces

sM

anag

er

Stat

icC

ode

Stor

age

Stat

ic C

ode

Acc

ess

Man

ager

Dat

aSt

orag

e

Dat

aA

cces

sM

anag

ers

to/from the DataCommunicationNetwork

to/from the InstructionCommunicationNetwork

Data object processing part

Programflowcontrolpart

...

...

I/O

Man

ager

I/O

Man

ager

ComputationalStructure Manager

Data TransformationUnits

I/O

Man

ager

58

RISC and Dataflow

Another dataflow/von Neumann hybrid of Arvind Support the execution of existing software written for conventional

processors. Using such a machine as a bridge between existing systems and new

dataflow supercomputers should have made the transition from imperative von Neumann languages to dataflow languages easier!?

The basic philosophy underlying the development of the RISC dataflow architecture can be summarized as follows:

use a RISC-like instruction set, change the architecture to support multithreaded computation, add (explicit!!) fork and join instructions to manage multiple

threads, implement all global storage as I-structure storage, and implement load/store instructions to execute in split-phase mode.

59

RISCifying dataflow

joinoperationfork

mul

add

sub div

sqrt

frame pointerdest. instruction address

109

110

111

115

116

120

121

122

123

141

159

160

join 2

d, d1, d2add

...

mul

jump

jump

jump

d1, ...

120

120

sqrt d2

join 2 dj

fork 159

141

sub ... , d, ...

join 2 ...

div ... , d, ...

op-code operands

60

P-RISC Architecture

Processing Element

CommunicationNetwork

PE

PE

GlobalMemory

. . .

Token Queue

InstructionFetchUnit

OperandFetchUnit

ALU


or

Global (I-structure)Memory Operand

StoreUnit

Com

mun

icat

ion

Que

ue

Fram

e M

emor

yIn

stru

ctio

nM

emor

y

61

P-RISC Characteristics

Load/store instructions are the only instructions accessing GM (implemented as I-structure storage)

Arithmetic/logical instructions operate on local memory (registers) Fixed instruction length One-cycle instruction execution (except for load/store instructions) No explicit matching unit: all operands associated with a sequential

thread of computation are kept in a frame in local Program Memory (PM).

Continuation: an (IP;FP) pair IP serves to fetch the next instruction FP serves as the base for fetching and storing operands.

To make P-RISC multithreaded, the stack of frames is arranged as a tree of frames, and a separate continuation is associated with each thread.

62

Other Hybrids

Various different blends of operating principle possible:

Treleven 1982: dataflow and reduction principle What is reduction? Dataflow is data-driven, reduction is demand-driven

MUSE 1985: dataflow, tree machine, and graph reduction MADAME (Silc and Robic 1989): synchronous dataflow

principle DTN Dataflow Computer 1990: Dataflow workstation, based

upon NEC Image Pipelined Processor chips

63

Lessons Learned from Dataflow

The latest generation of superscalar microprocessors displays an out-of -order dynamic execution that is referred to as local dataflow or micro dataflow.

Colwell and Steck 1995, in the first paper on the PentiumPro: The flow of the Intel Architecture instructions is predicted and

these instructions are decoded into micro-operations (ops), or series of ops, and these ops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order.

State-of-the-art microprocessors typically provide 32 (MIPS R10000), 40 (Intel PentiumPro) or 56 (HP PA-8000) instruction slots in the instruction window or reorder buffer.

Each instruction is ready to be executed as soon as all operands are available.

64

Comparing dataflow computers with superscalar microprocessors

Superscalar microprocessors are von Neumann based: (sequential) thread of instructions as input not enough fine-grained parallelism to feed the multiple functional units speculation

dataflow approach resolves any threads of control into separate instructions that are ready to execute as soon as all required operands become available.

The fine-grained parallelism generated by dataflow principle is far larger than the parallelism available for microprocessors.

However, locality is lost no caching, no registers

65

Lessons Learned from Dataflow (Pipeline Issues)

Microprocessors: Data and control dependences potentially cause pipeline hazards that are handled by complex forwarding logic.

Dataflow: Due to the continuous context switches, pipeline hazards are avoided; disadvantage: poor single thread performance.

Microprocessors: Antidependences and output dependences are removed by register renaming that maps the architectural registers to the physical registers.

Thereby the microprocessor internally generates an instruction stream that satisfies the single assignment rule of dataflow.

The main difference between the dependence graphs of dataflow and the code sequence in an instruction window of a microprocessor:branch prediction and speculative execution.

Microprocessors: rerolling execution in case of a wrongly predicted path is costly in terms of processor cycles.

66

Lessons Learned from Dataflow (Continued)

Dataflow: The idea of branch prediction and speculative execution has never been evaluated in the dataflow environment.

Dataflow was considered to produce an abundance of parallelism while speculation leads to speculative parallelism which is inferior to real parallelism.

Microprocessors: Due to the single thread of control, a high degree of data and instruction locality is present in the machine code.

Microprocessors: The locality allows to employ a storage hierarchy that stores the instructions and data potentially executed in the next cycles close to the executing processor.

Dataflow: Due to the lack of locality in a dataflow graph, a storage hierarchy is difficult to apply.

67


Microprocessors: The operand matching of executable instructions in the instruction window is restricted to a part of the instruction sequence.

Because of the serial program order, the instructions in this window are likely to become executable soon. The matching hardware can be restricted to a small number of slots.

Dataflow: the number of tokens waiting for a match can be very high. A large waiting-matching store is required.

Dataflow: Due to the lack of locality, the likelihood of the arrival of a matching token is difficult to estimate, caching of tokens to be matched soon is difficult.

68

Lessons Learned from Dataflow (Memory Latency)

Microprocessors: An unsolved problem is the memory latency caused by cache misses.

Example: SGI Origin 2000: latencies are 11 processor cycles for a L1 cache miss, 60 cycles for a L2 cache miss, and can be up to 180 cycles for a remote memory access. In principle, latencies should be multiplied by the degree of

superscalar. Microprocessors: Only a small part of the memory latency can be

hidden by out-of-order execution, write buffer, cache preload hardware, lockup free caches, and a pipelined system bus.

Microprocessors often idle and are unable to exploit the high degree of internal parallelism provided by a wide superscalar approach.

Dataflow: The rapid context switching avoids idling by switching execution to another context.

69


Microprocessors: Finding enough fine-grain parallelism to fully exploit the processor will be the main problem for future superscalars.

Solution: enlarge the instruction window to several hundred instruction slots; two draw-backs

Most of the instructions in the window will be speculatively assigned with a very deep speculation level (today's depth is normally four at maximum). most of the instruction execution will be speculative. The principal problem here arises from the single instruction stream that feeds the instruction window.

If the instruction window is enlarged, the updating of the instruction states in the slots and matching of executable instructions lead to more complex hardware logic in the issue stage of the pipeline thus limiting the cycle rate.

70


Solutions: the decoupling of the instruction window with respect to

different instruction classes, the partitioning of the issue stage into several pipeline stages, and alternative instruction window organizations.

Alternative instruction window organization: the dependence-based microprocessor:

Instruction window is organized as multiple FIFOs. Only the instructions at the heads of a number of FIFO buffers

can be issued to the execution units in the next cycle. The total parallelism in the instruction window is restricted in

favor of a less costly issue that does not slow down processor cycle rate.

Thereby the potential fine-grained parallelism is limited somewhat similar to the threaded dataflow approach.

71

Lessons Learned from Dataflow (alternative instruction window organizations)

Look at dataflow matching store implementations Look into dataflow solutions like threaded dataflow

(e.g. repeat-on-input technique or strongly-connected arcs model)

Repeat-on-input strategy issues compiler-generated code sequences serially (in an otherwise fine-grained dataflow computer). Transferred to the local dataflow in an instruction window:

an issue string might be used; a series of data dependent instructions is generated by a

compiler and issued serially after the issue of the leading instruction.

However, the high number of speculative instructions in the instruction window remains.

Documents

1 DATAFLOW ARCHITECTURES JURIJ SILC, BORUT ROBIC, THEO UNGERER