15
[email protected] Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de Reconfigurable HPC Reconfigurable HPC part 2 Data-Stream-based Computing Reiner Hartenstein TU Kaiserslautern May 14, 2004 , TU Tallinn, Estonia © 2004, [email protected] http://hartenstein.de TU Kaiserslautern 2 Preface Response to the Computenik Shock: the completely wrong roadmap to HPC blinders to ignore the impact of morphware ... continue to bang their heads against the memory wall instead of © 2004, [email protected] http://hartenstein.de TU Kaiserslautern 3 Crusty Computing Sciences [David Padua, John Hennessy] 98.5% vN-only this monopoly is the problem went morphware dead went morphware exhausted memory wall © 2004, [email protected] http://hartenstein.de TU Kaiserslautern 4 Exhausted: Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories [Gordon Bell, keynote at ISCA 2000]. MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics © 2004, [email protected] http://hartenstein.de TU Kaiserslautern 5 HPC going configware http://fpl.org International Conference on Field-Programmable Logic and Applications (FPL) Aug. 20 Sept 1, 2004, Antwerp, Belgium 288 submissions ! accel. μProc. ... going into every type of application they all work on high performance © 2004, [email protected] http://hartenstein.de TU Kaiserslautern 6 >> Machine vs. Anti Machine << Machine vs. Anti Machine Terminology Morphware Platforms coarse-grained Platforms Dual Machine Paradigms Final Remarks http://www.uni-kl.de

Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

  • Upload
    vukhanh

  • View
    229

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

Reconfigurable HPC

Reconfigurable HPC

part 2

Data-Stream-based Computing

Reiner Hartenstein

TU Kaiserslautern

May 14, 2004 , TU Tallinn, Estonia

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

2

Preface

Response to the Computenik Shock:

the completely wrong roadmap to HPC

blinders to ignore the impact of morphware

... continue to bang their heads against the memory wall

instead of

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

3

Crusty Computing Sciences

[David Padua, John Hennessy]

98.5% vN-only

this monopoly is the problem

went morphware

dead

went morphware

exhausted

memory wall

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

4

Exhausted: Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland •Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000].

•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

5

HPC going configware

http://fpl.org

International Conference on Field-Programmable Logic

and Applications (FPL)

Aug. 20 – Sept 1, 2004, Antwerp, Belgium

288 submissions ! accel. µProc.

... going into every type of application

they all work on high performance © 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

6

>> Machine vs. Anti Machine <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

Page 2: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

7

Reconfigurable Computing: a second programming domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

8

machine vs. anti machine

progra

m

counter

DPU CPU

RAM memory

(r)DPA

without sequencer

asMA: auto-sequencing Memory Array

........

asM

asM

asM

asM

asM

(r)DPA

........

data stream machine (anti machine)

data

counter

memory bank

asM

asM: auto-sequencing Memory

instruction stream machine (von Neumann etc.)

matter vs. anti matter

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

9

Data-stream-based Computing

The Anti Universe of Computing

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

10

The anti universe

•Paul Dirac predicted a complete anti universe consisting of antimatter

•“There are regions in the universe, which consist of antimatter .....

•We are not aware, that there is a new area in computing sciences , which consists of antimatter of computing

• .... But there are asymmetries”

•Reconfigurable Computing is made from this antimatter: data-stream-based computing

•when a particle hits its antiparticle, both are converted into energy: Annihilation

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

11

anti particles

• 1956: anti neutron created on Bevatron

• 1928: Paul Dirac: „there should be an anti electron having positive charge“ (Nobel price 1933)

• 1932: Carl David Anderson detected this „positron“ in cosmic radiation (Nobel price 1936)

• 1955 Owen Chamberlain et al. create anti proton on Bevatron

• 1954: new accelerators: cyclotron, like Berkeley‘s Bevatron

• 1965: creation of a deuterium anti nucleus at CERN

hydrogen anti hydrogen

• 1995: hydrogen anti atom created at CERN – by forcing positron and anti proton to merge by very low energy.

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

12

Matter & Antimatter: Atom and Anti Atom

The World of Matter -

machine paradigm: the Atom

Anti Matter -

machine paradigm: Anti Atom

+ + -

- - +

Page 3: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

13

Matter & Antimatter of Informatics : Machine and Anti Machine

+

CPU

- 1936 1st electronic computer (Konrad Zuse)

Machine paradigm: „von Neumann“

1946 v. N. machine paradigm

1971 1st microprocessor (Ted Hoff)

1979 „data streams“ (systolic array: Kung / Leiserson)

- DPU

+

Anti Machine paradigm

1990 anti machine paradigm published

1995 rDPA / DPSS (supersystolic: Rainer Kress)

novel

compilation

techniques

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

14

- DPU

(r)DPU

+

CPU

instruction stream

Matter vs. antimatter: CPU vs. DPU

-

progra

m

counter

DPU CPU

without sequencer

+

dat

a st

ream

(r)DPA

dat

a st

ream

s

+

+

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

15

heavy anti atoms: DPA = DPU array

- DPA

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU -

DPA

+

+

+

+

+

+

+

+

+

cohere

nt d

ata

stre

ams

spin

ning

aro

und

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

16

Terminology: DPU versus CPU ...

• DPU: data path unit • DPA: DPU array • GA: gate array • rDPU: reconfigurable DPU • rDPA: reconfigurable DPA • rGA: reconfigurable GA

• DPU is no CPU: there is nothing central - like in a DPA

DPU DPU

DPU instruction sequencer

CPU

DPA (r)

(r)

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

17

Parallelism by Concurrency

+ -

+

- -

+

- +

+

-

- +

- +

independent instruction streams difficult ...

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

18

Annihilation?

- +

-

+ -

+

avoidable by careful

methodology

Page 4: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

19

CS education .....

software person

procedural

structural

hardware person

Configware / Software Co-Design? Hardware / Software Co-Design?

Annihilation avoidable by CS curricular revision

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

20

>> Terminology <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

21

„Re-configurable Hardware“ ??

„Re-configurable Hardware“ ??

this „Hardware“ is not hard !

We need a concise terminology: a consensus is on the way

it‘s Morphware

Terminology has been highly confusing

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

22

Configware and Flowware ... ... are the sources for programming morphware.

Software is the source for programming traditional hardwired processors (instruction-stream-driven: von Neumann machine paradigm and its derivatives)

For Configware and Flowware we prefer the anti machine paradigm – conterpart of von Neumann

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

23

de facto Duality of RAM-based platforms

traditional new

RAM-based platform CPU morphware (FPGA, rDPA ..)

„running“ on it: software configware

machine paradigm von Neumann etc.: instruction-stream-based

anti machine: data-stream-based

2nd paradigm

We now have 2 types of programmable platforms

hardware viewed as frozen configware:

Just earlier binding

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

24

Terminology: Digital System Platforms clearly distinguished

platform source running on it

machine paradigm

hardware (not running on it)

none

morphware

fine grain rGA (FPGA) configware

coarse grain

rDPU, rDPA reconfigurable data stream processor

flowware & configware anti machine

data stream processor (hardwired) flowware

instruction stream processor software von Neumann machine

Page 5: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

25

Terminology: Digital System Platforms clearly distinguished

platform source running on it

machine paradigm

hardware (not programmable)

none

morphware

fine grain rGA (FPGA) configware

coarse grain

rDPU, rDPA reconfigurable data stream processor

flowware & configware anti machine

data stream processor (hardwired) flowware

instruction stream processor software von Neumann machine

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

26

Importance of binding time

Configuration: like a kind of pre-packed frozen-in „super instruction fetch“

load time

compile time

Reconfigurable Computing

not all switching is done by Configware

run time Microprocessor Parallel Computer

time of “instruction

fetch” read new

instruction read new instruction

c 0 1

for a pipe network

1

0

c

1

0

c

configure datapaths

fabrication time

Full custom oder ASIC

fabricate a

datapath

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

27

Software vs Flowware and Configware

Programming source for instruction-stream-based computing (von Neumann etc.):

The programming source for data-stream-based computing operations (the anti machine paradigm):

Programming sources for Reconfigurable Computing (morphware):

Software

Flowware

Flowware and Configware

Sources for Embedded Systems:

Flowware, Configware & Software

µProc.

compile

rec.anti machine

compile

rec.anti machine µProc.

partit. compiler

hdw.anti machine

d.schedule

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

28

control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

Flowware

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

29

Flowware Languages

MoPL: fully supporting the anti machine paradigm – the counterpart of the von Neumann paradigm

general purpose:

Brook: for modern graphics hardware

Streams-C: defines 1-D streams; generates VHDL (LANL Open-source C compiler targets FPGAs)

DSP-C: allows to describe key features of DSPs www.dsp-c.org

(Data Streaming Language examples)

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

30

domain procedural structural

computing in ... time only* space and time

program source software*

hardwired reconfigurable

currently emerging

(hardware +) software**

(hardware +) flowware

configware + flowware

„instruction“ fetch at runtime

before fabrication at loading time

data „fetch“ at run time **) software „simulates“ flowware

algorithms variable

resources variable

reconfigurable:

http://hartenstein.de

programming: procedural vs. structural

algorithms fixed

resources fixed

fully hardwired: not programmable

*) only one source needed

algorithms variable

resources fixed

CPU:

embedded systems: data-stream-based instruction-stream-based

Page 6: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

31

Glossary

DPU data path unit rDPU reconfigurable DPU DPA data path array (DPU array) rDPA reconfigurable DPA ISP instruction set processor AM anti machine AMP data stream processor* rAMP reconfigurable AMP

*) no “dataflow machine”

platform category

source „running“ on platform

machine paradigm

hardware (not programmable) none

ISP** software von Neumann

• morphware configware FPGA: none

data stream processor (AMP*)

flowware

anti machine reconfigurable

AMP (rAMP)

flowware &

configware

digital system platforms:

morphware use granularity (path width) (re)configurable blocks

reconfigurable logic • fine grain (FPGA) (~1 bit) CLBs

reconfigurable computing coarse grain (e.g. 32 bits) rDPUs (e.g. ALU-like)

multi granular: by slice bundling rDPU slices (e.g. 4 bits)

categories of morphware:

approaching consensus

**) Von Neumann etc. *) data stream processor © 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

32

data streams*: not new

1980: data streams (Kung, Leiserson: systolic arrays)

1989: data-stream-based Xputer architecture

1990: rDPU (Rabaey)

1994: Flowware Language MoPL (Becker et al.)

1995: super systolic array (rDPA) + DPSS tool (Kress)

1996+: Streams-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...

1996: configware / software partitioning compiler (Becker)

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

33

URLs: Software vs Flowware and Configware

µProc.

compile

rec.anti machine

compile

rec.anti machine µProc.

partit. compiler

hdw.anti machine

d.schedule

http://data-streams.org/

http://anti-machine.org/

http://kressarray.de/

http://flowware.net/

http://configware.org/

http://morphware.net/

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

34

>> Embedded Computing <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

35

Fine-grain Morphware: Drawbacks

• FPGA Architectures – SRAM-based Look-up Tables (LUTs)

– Problems:

• Routing: reduces Performance

• Bad Ratio: active / passive Elements

reconfigurable Interconnect (Switching Boxes)

Configurable Logic

Block (CLB)

Source: R. Hartenstein

LUT

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

36

Reconfigurability Overhead

S S

S S

resources needed for reconfigurability

partly for configuration code storage

L

L L

L L

L

L L L

area used by application

“hidden RAM” not shown

Page 7: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

37

Throughput vs. Efficiency

1000

100

10

1

0.1

0.01

0.001 2 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

S S

S S

resources needed for

reconfigurability

L

L L

L L

L

L L L

area used by application

1 Bit CLB

T. Claasen et al.: ISSCC 1999

Wiring by abutment: 32 Bit example

*) R. Hartenstein: ISIS 1997

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

38

One more argument for coarse grain

100

1000

10

1

0.1

0.01

0.001 2 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

T. Claasen et al.: ISSCC 1999

Wiring by abutment: a 32 Bit KressArray example

if coarse grain cells are full custom and

mesh-connected, and 2nd level interconnect

ressources layouted over the cells

*) R. Hartenstein: ISIS 1997

the array is almost as

area-efficient as hardwired

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

39

It’s a Paradigm Shift !

• Using FPGAs (fine grain reconfigurable) just mainly has been classical Logic Synthesis on a “strange hardware” platform

• Coarse Grain Reconfigurable Arrays (rDPAs) (Reconfigurable Computing), however, mean a really fundamental Paradigm Shift

• This is still ignored by CS and EE Curricula and almost all R&D scenes

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

40

rDPA (coarse grain) vs. FPGA (fine grain)

roughly: area efficiency

(transistors/chip, orders of magnitude)

hardwired 4

FPGA 2 µProc 0

rDPA 4

roughly: performance (MOPS/mW,

orders of magnitude)

hardwired 3

FPGA 2

µProc 0

rDPA 3

DSP 1

Status: ~1998

commodity commodity

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

41

Mega-rGAs 10 000 000

1 000 000

100 000

10 000

1 000

1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004

planned

Virtex II

XC 40250XV

Virtex

XC 4085XL

100

System gates per rGA chip

Jahr

[Xilinx Data]

200

500

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

42

• FPGA Fabric-based on Virtex-II Architecture

Source: Ivo Bolsens, Xilinx

On Chip Memory

Controller

Power PC Core

Embeded RAM

Rocket IO

entire system on a single chip

all you need on board all you need on board

• Xilinx Virtex-II Pro FPGA Architecture

• PowerPC 405 RISC CPU (PPC405) cores

Page 8: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

43

Mask & NRE cost [ST microelectronics]

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

44

>> coarse-grained Platforms <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

45

computing in space

Computing in space and time

data

streams

y 1 0 ( )

y 2 0 ( )

y 3 0 ( )

-

-

-

y 1

y 2

y 3

-

-

- x

1

x 2

x 3

-

- -

computing in time

a 12

a 11 a

21

a 32

a 31

a 23

a 33

a 22

a 13

placement

systolic arrays etc.

and other transformations migration by re-timing

this dichotomy is completely ignored

by our CS curricula

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

46

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

Flowware defines: ... which data item at which time at which port

Flowware programs data streams

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

47

Mathematicians X-ing

Mathematicians like to wax rhapsodic about the elegance, beauty and depth of their proofs.

John Horgan: „The End of Science Revisited“

Systolic

Synthesis

Mathematicians like the beauty and elegance of Systolic Arrays. Due to a lack of depth in understanding, their efforts yielded poor synthesis algorithms.

Reiner Hartenstein

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

48

2

Generalized Stream-based Computing System heterogenous Array of rDPUs (reconf. data path units)

Scheduler

Mapper

expression tree

DPU architectures

y

+ *

x

a

1

simultaneous

placement

& routing

3

+

+ +

+

*

*

* sh

* sh

sh sh

xf

xf

-

- data

streams

4

The same mapper for both: Reconfigurable, or hardwired

Kress DPSS [1995]

Configware

Compiler

Page 9: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

49

Supersystolic Array Principles

• take systolic array principles

• replace classical synthesis by simulated annealing

• yields the supersystolic array

• a generalization of the systolic array

• no more restricted to regular data dependencies

• now reconfigurability makes sense: use morphware

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

50

flowware history

time

port #

time

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port # time

port #

1980: data streams

(Kung, Leiserson) 1995: super systolic

rDPA (Kress) 1996+: SCCC (LANL),

SCORE, ASPRC, Bee (UCB), ...

(tutorials and courses available on all this)

flowware history:

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

51

computing paradigms and methodologies

1946: machine paradigm (von Neumann)

1980: data streams (Kung, Leiserson)

1989: anti machine paradigm

1990: rDPU (Rabaey)

1994: anti machine high level programming language

1995: super systolic array (rDPA)

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

1997+: discipline of distributed memory architecture

1997: configware / software partitioning compiler

flow

war

e*

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

52

Super Pipe Networks

pipeline properties array applications

shape resources

mapping scheduling

(data stream formation)

systolic array

regular data

dependencies only

linear only

uniform only

linear projection or algebraic synthesis

super-systolic rDPA

no restrictions simulated

annealing or P&R algorithm

(e.g. force-directed) scheduling algorithm

*

*) KressArray [1995]

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

53

Programming Language Paradigms

language category Von Neumann Languages Anti Machine Languages

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

language features control flow + data manipulation

data streams only (no data manipulation)

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

54

Similar Programming Language Paradigms

language category Computer Languages Xputer Languages

both deterministic procedural sequencing: traceable, checkpointable

sequencingdriven by:

read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching

read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching

Page 10: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

55

Machine Paradigms

machine category Computer (the Machine:

“v. Neumann”) The Anti Machine

driven by: Instruction streams data streams (no “dataflow”)

engine principles instruction sequencing sequencing data streams

state register single program counter (multiple) data counter(s)

Communication path set-up .

at run time at load time

resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path

operation sequential parallel pipe network etc.

( “instruction fetch” )

also hardwired implementations* *) e g. Bee project Prof. Broderson

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

56

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

mapping algorithms efficently onto rDPA

rout thru only

not used backbus connect

SNN filter on KressArray

by the way: example of scalability / relocatability by EDA support

also FPGA scalability (avoid routing congestion) by EDA solution

rDPA

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

57

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

route-thru-only rDPU

3 vert. NNports, 32 bit

http://kressarray.de

Xplorer Plot: SNN Filter Example

+ [13]

2 hor. NNports, 32 bit

operator

result

operand

operand

route thru

backbus connect

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

58

>>> distributed memory

(application-specific)

distributed memory

The new discipline of

[] Herz et al.: proc. IEEE ICECS 2002

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

59

Synthesizable distributed memory architecture...

Memory (data memory)

memory bank

memory bank

memory bank

memory bank

memory bank

...

...

Scheduler

for a Stream-based Soft Machine

rDPA “instructions”

Compiler

Sequencers (data stream

generator)

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

60

http://kressarray.de

Efficient Memory Communication should be directly supported by the Mapper Tools

sequencers

memory ports

application

not used

Legend: Optimized Parallel Memory Controller

Synthesizable Distributed Memory

An example by Nageldinger’s KressArray Xplorer

Page 11: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

61

Coarse Grain Architectures

style project first

publ.

source architecture granularity fabrics mapping intended target application

DP-FPGA 1994 [4] 2-D array 1 & 4 bit multi-granular Inhomog. routing channels switchbox routing regular datapaths

KressArray 1995 [5,11] 2-D mesh family: sel. pathwidth multiple NN & bus segments (co-)compilation (adaptable)

Colt 1996 [12] 2-D array 1 & 16 bit inhomogenous run time reconfiguration highly dynamic reconfig.

Matrix 1996 [15] 2-D mesh 8 bit, multi-granular 8NN, length 4 & global lines multi-length general purpose

RAW 1997 [17] 2-D mesh 8 bit, multi-granular 8NN switched connections switchbox rout experimental

Garp 1997 [16] 2-D mesh 2 bit global & semi-global lines heuristic routing loop acceleration

REMARC 1998 [18] 2-D mesh 16 bit NN & full length buses (info not available) multimedia

MorphoSys 1999 [19] 2-D mesh 16 bit NN, length 2 & 3 global lines manual P&R (not disclosed)

CHESS 1999 [20] hexagon 4 bit, multi-granular 8NN and buses JHDL compilation multimedia

DReAM 2000 [21] 2-D array 8 &16 bit NN, segmented buses co-compilation next generation wireless

CS2000 family 2000 [23] 2-D array 16 & 32 bit inhomogenous array (not disclosed) communication

MECA family 2000 [24] 2-D array multi-granular (not disclosed) (not disclosed) tele- & datacommunication

CALISTO 2000 [25] 2-D array 16 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication

mesh

FIPSOC 2000 [26] 2-D array 4 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication

RaPID 1996 [27] 1-D array 16 bit segmented buses channel routing pipelining linear

PipeRench 1998 [29] 1-D array 128 bit (sophisticated) scheduling pipelining

PADDI 1990 [30] crossbar 16 bit central crossbar routing DSP

PADDI-2 1993 [32] crossbar 16 bit multiple crossbar routing DSP and others Cross bar

Pleiades 1997 [33] mesh+crossbar multi-granular multiple segmented crossbar switchbox routing multimedia

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

62

Primarily Mesh-based ….

market project bits granularity source

KressArray variable U. Kaiserslautern

Garp 2 UC Berkeley

CHESS 4 Hewlett Packard

Matrix

RAW8 M.I.T.

Colt 1 & 16 Virginia Tech

DReAM 8 &16 TU Darmstadt

REMARC Stanford

research

MorphoSys UC Irvine

CALISTO Slicon Spice

MECA family

16

Malleable

CS2000 family 16 & 32 Chameleon Systems

FIPSOC 16 & analog SIDSA

commercial

XPP XPU128 32 PACT Corp.

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

63

UC Berkeley (Jan Rabaey)

market project bits granularity source

PADDI

PADDI-2research

Pleiades

16 UC Berkeley

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

64

commercial rDPA example:

PACT XPP - XPU128

XPP128 rDPA

• Evaluation Board available, and • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies

© PACT AG, Munich http://pactcorp.com

rDPA

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

65

XPP64A: Platform Development Board - SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab

- Assembly & Test / Available March 2003

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

66

PACT Corp

• Xtreme Processor Platform (XPP) family of IP cores, high-speed data-stream-capable, scalable, reconfigurable clusters of arrays of 32-bit DPUs with embedded memories, and high-speed I/O ports -

• Application development support software featuring a flow graph-style algorithm mapping language - to minimize training requirements.

• XPP's fabrics, featuring automatic DataFlow synchronization and flagged Event Network to dynamically configure the execution flow,

• Supports dynamic RTR: hierarchical configuration managers free the designer from chip-level details and ensure that configurations are independently loaded in exactly the intended order.

• Automatic event-based task swapping along with data streams: released resources automatically reconfigured immediately

Page 12: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

67

Sequential Processor Model

Conventional processors use the sequential model:

Each operation takes one clock cycle.

Multiple operations are computed consecutively.

Register Operation 1

Time

Operation 2

Operation 3

Operation 4

Operation 5

© 2003, PACT AG

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

68

A New Parallel Processor Paradigm

Multiple computations are configured as code sections onto

a two dimensional array.

Time

x

y

Data Buffer

© 2003, PACT AG

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

69

Parallel Processor Model

Multiple code sections are computed sequentially.

Time

y

x Operation 2

Section 1

Section 2

Section 3

© 2003, PACT AG

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

70

Dataflow Performance

Traditional Microprocessor XPP Architecture

SHIFT

Instruction

Memory and cache

Configuration

Memory and cache

MULT

One operation

per cycle

Many

operations

per cycle FFT

Viterbi

Basic machine operations

performed on

single words

Complex Functions

performed on

data streams

ADD

Register

One word

ALU Buffer

Stream

of words

Array of ALUs

Filter

© 2003, PACT AG

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

71

instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

72

Datastream-based Compilation Principles

library

data stream assembly

scheduler

mapper placement & routing

Page 13: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

73

>> Dual Machine Paradigms <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

74

3rd machine model became mainstream

1957

1967

1977

1987

1997

2007

computer age (PC age)

accel.

design

µProc.

compile

(Makimtos wave)

mainframe age

main frame

compile

instruction- stream-based

DPA r r µProc.

morphware age

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

75

benefit from RAM-based & 2nd paradigm

RAM-based platform needed for:

• flexibility, programmability • avoiding the need of specific silicon

simple 2nd machine paradigm needed as a common model:

• to avoid the need of circuit expertize • needed to to educate zillions of programmers

what programming source language ?

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

76

McKinsey Curve: dynamics of R&D disciplines

maturity of a discipline

year

fundmental issues

consolidation

saturation: limitations met

new discipline on top of it by ....

... by innovation

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

77

EDA Industry Revolutions

1978

Transistor entry: Applicon, Calma, CV ...

1992

Synthesis: Cadence, Synopsys ... 1985

Schematics entry: Daisy, Mentor, Valid ...

courtesy [Keutzer / Newton]

EDA industry paradigm switching every 7 years

1999 HLLs, (Co-) Compilation

Data-Stream-based DPU arrays

2006 coming closer to programmers‘ mind set

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

78

How to achieve acceptance

No hardware description languages

Courses tailored for students not being hardware-savvy

Tools usable by users not being hardware designers

EDA tools based on term rewriting [Arvind] [Mauricio Ayala]

[Courtesy Richard Newton]

Your name here: your proposals

how to hide the ugliness from the user [Herman Schmit]

Page 14: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

79

configware compiler

anti machine

configware compiler

source „program“

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

80

anti machine

Configware Compilation

data

counter

memory bank

asM

asM

asM

asM

asM

asM

........

(r)DPA configware compiler

source „program“

mapper

Placement & routing

data scheduler

flowware code

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

81

symbiosis of machine models

µProc.

configware compiler

software compiler

anti machine

source „program“

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

82

symbiosis of machine models

DPA r r µProc.

partitioning compiler

source „program“

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

83

Software / Configware Co-Compilation

Analyzer / Profiler

SW code

SW compiler

para d igm “vN" machine

CW Code

CW compiler

anti machine paradigm

Partitioner

Resource Parameters

supporting different platforms

Juergen Becker’s CoDe-X, 1996

High level PL source

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

84

Loop Transformation Examples

loop 1-8 body body endloop

loop 1-8 body endloop

loop 9-16 body endloop

fork

join

strip mining

loop 1-4 trigger endloop

loop 1-2 trigger endloop

loop 1-8 trigger endloop

reconf.array: host: loop 1-16 body endloop

sequential processes: resource parameter driven Co-Compilation

loop unrolling

Page 15: Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

[email protected]

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

85

>> final remarks <<

• Machine vs. Anti Machine

• Terminology

• Morphware Platforms

• coarse-grained Platforms

• Dual Machine Paradigms

• Final Remarks http://www.uni-kl.de

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

86

Jürgen Becker

Dissertation

Jürgen Becker:

• ... Automatically partitioning Co-compiler

• (configware / software co-compilation)

• Resource-parameter-driven retargettable

• Profiler-driven optimization

• Accepts HLL „ALE-X“ (extended C subset)

• (subset: pointers not supported)

Professor at Univ. Karlsruhe

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

87

Rainer Kress

Dissertation

Rainer Kress:

• ... on mapping applications onto his* KessArray

• DPSS datapath synthesis system

• Including a data scheduler

• (data stream scheduler)

• Generalization of the Systolic Array

• (KressArray is a super systolic array)

• 32 bit design via Eurochip support

infineon technologies, Munich

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

88

More Presentations / Literature

• http://hartenstein.de/keynotes.html

• http://hartenstein.de/publications.html

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

89

Antimatter Search ?

Antimatter Search

in EE & CS we do not need to search

© 2004, [email protected] http://hartenstein.de

TU Kaiserslautern

90

END