View
219
Download
0
Tags:
Embed Size (px)
Citation preview
CAPES / DFG Project Universidade do Brasilia
Universitaet KaiserslauternUniversitaet Karlsruhe
Reiner Hartenstein*
University ofKaiserslautern
November 14, 2003, Brasilia, Brazil
Present and Future of Reconfigurable
Systems
*) IEEE fellow
© 2003, [email protected] http://hartenstein.de2
University of Kaiserslautern
Xputer LabLiterature (also downloads)
http://hartenstein.de
also click „recent talks“this page: also links to available Ph. D theses:
Becker ,Herz, Kress, Nageldinger,
© 2003, [email protected] http://hartenstein.de3
University of Kaiserslautern
Xputer LabReconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
The opportunity to introduce the structural domain to programmers ...
The structural domain has become RAM-based
... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm
© 2003, [email protected] http://hartenstein.de4
University of Kaiserslautern
Xputer LabIT ages
mainframe age
computer age (PC age)
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
von Neumann does not support morphware
flowware
here?
© 2003, [email protected] http://hartenstein.de5
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de6
University of Kaiserslautern
Xputer Labfine grain
• Fine Grain morphware platforms
already mainstream: reconfigurable logic
just logic design on a strange platform ?
speed-up til 3 orders of magnitude
© 2003, [email protected] http://hartenstein.de7
University of Kaiserslautern
Xputer Lab
cost / mio §
4
3
2
1mask set
cost [eASIC]
NRE and mask cost
[dataquest] .
12 12 16 20 26 28 30 >30no. of masks
0.8 0.6 0.35 0.25 0.18 0.15 0.13 0.1 0.07 feature size
PC: 25%
22%communication
others: 31%
6 %automotive
16% consumer
Xilinx42%
Altera37%
Lattice15%
Actel6%
Top 4 PLD Manufacturers 2000total: $3.7 Bio
• [Dataquest] > $7 billion by 2003.
• FPGAs going into every type of application – also SoC• fastest growing segment of semiconductor market
you don‘t need specific silicon !
you don‘t need specific silicon !
rGAs
© 2003, [email protected] http://hartenstein.de8
University of Kaiserslautern
Xputer Lab
switch
rGA with island architecture(Ausschnitt)
connect
switch
© 2003, [email protected] http://hartenstein.de9
University of Kaiserslautern
Xputer Lab switch box• R
eko
nfi
gu
rier
bar
switch box
switch
point
© 2003, [email protected] http://hartenstein.de10
University of Kaiserslautern
Xputer Lab connect box• R
eko
nfi
gu
rier
bar
connect boxconnect point
part of configuration
memory
© 2003, [email protected] http://hartenstein.de11
University of Kaiserslautern
Xputer Lab
Verbindungspunkt (vergrößert)
Verbindungs-Punkt• R
eko
nfi
gu
rier
bar
reconfigurable logic box
illustration
© 2003, [email protected] http://hartenstein.de12
University of Kaiserslautern
Xputer Lab connection activated
Die Zuleitung zur Funktionswahl des
rLB nicht gezeigt
reconfigurable logic box
illustration
© 2003, [email protected] http://hartenstein.de13
University of Kaiserslautern
Xputer Labconnect point activated• R
ou
tin
g
© 2003, [email protected] http://hartenstein.de14
University of Kaiserslautern
Xputer Lab
der 4. Schaltpunkt
der 5. Schaltpunkt
3 Schaltpunkte switch points
activated
• Ro
uti
ng
switch box
switch
point
© 2003, [email protected] http://hartenstein.de15
University of Kaiserslautern
Xputer Lab Routing continued
• Ro
uti
ng
© 2003, [email protected] http://hartenstein.de16
University of Kaiserslautern
Xputer Lab A
B
Plazierungs- und Routing Software bekannt s. 25 Jahren
Solche Netzwerk-Probleme manuell oder mit Hilfe der Graphen-Theorie behandelbar.
1979 Silva Lisco (Silicon Valley Research Corp.) bietet CALM-P an
20 Transistors + 20 Flipflops
Routing completed
for 1 net
•Routing
© 2003, [email protected] http://hartenstein.de17
University of Kaiserslautern
Xputer Lab
A
B
Passing through: long distance wiring from rLBs outside this region
Routing:long distance nets
A path can be used only once at a time .....
© 2003, [email protected] http://hartenstein.de18
University of Kaiserslautern
Xputer LabA
B
CCDD
C and D are not reachable.
A bridge can be passed only once (bridges of Königsberg)
routing congestion
C cannot be connected with D.
© 2003, [email protected] http://hartenstein.de19
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de20
University of Kaiserslautern
Xputer Lab
Leonhard Euler
Euler‘s problem of the bridges of Königsberg is such a network problem (1736):
Find a way, which passes each bridge exactly once .....
... also an optimization: none of the bridges remains unused.
1736
© 2003, [email protected] http://hartenstein.de21
University of Kaiserslautern
Xputer LabL. Euler: Solutio Problematis Ad geometriam Situs
Pertinentis; Commetarii Academiae Scientiarum Imperialis Petropolitanae 8 (1736), pp. 128-140
Graph
edge
node
Left Bank
Right Bank
Kneiphof Island
Other Island
© 2003, [email protected] http://hartenstein.de22
University of Kaiserslautern
Xputer Lab
adjacency matrix
Data structures for Graphs
ListGraph
1 2
3 4
0000
10
10
100
1
0
100
1234
1 2 3 4from
to
2 14 /2
3 /
2 /33 /4
directed graph
1 2
3 4
0
110
10
11
110
1
0
110
1234
1 2 3 4from
to
3 /2 13 1 22 1 33 /2 4
4 /
4 /
undirected graph
J. E. Hopcroft, R. E. Tarjan: Efficient algorithm
for graph manipulation; Comm. ACM, 1973
© 2003, [email protected] http://hartenstein.de23
University of Kaiserslautern
Xputer Lab
ENIAC, completed 1945
Partitioning over racks in the hallPartitioning over card cages in the rackPartitioning over boards (cards) in card cages Partitioning over chips etc. on the card (e. g. SBC)Partitioning over blocks on the chip (e. g. microprocessor)
Large Scale Routing
© 2003, [email protected] http://hartenstein.de24
University of Kaiserslautern
Xputer LabPCBs (printed circuit boards)
for 40 years
MULTEC at Böblingen produces printed circuits boards since 1963
planar „wiring“
no. of pins is limited
© 2003, [email protected] http://hartenstein.de25
University of Kaiserslautern
Xputer Lab
Integated Citcuit (Chip)limited number of pins
„wiring“ on a planar surface
© 2003, [email protected] http://hartenstein.de26
University of Kaiserslautern
Xputer Labhierarchy
card cage
rack
cardchip
macro cell
basic cell
more levels
Kaisers-lautern
1
KL2 KL3 KL4
FTI1
JWGU
FTI2
IMS1
IMS2
IMS3
IMS
IMS
IMS
IMS
IMSIMS
© 2003, [email protected] http://hartenstein.de27
University of Kaiserslautern
Xputer Labwiring
hierarchy
cables in the rackconnect thecard cages
card cage wiringconnectsthe cards
card wiring connects the chips
macro cell
cell
on-Chip-wiringconnectsthe cells
*) 30er: Telefon-Vermittlung (ohne Chips,Crossbar / Hebdreh-Wähler statt Karten)40er: erste Computer (ohne Chips)
© 2003, [email protected] http://hartenstein.de28
University of Kaiserslautern
Xputer Lab An obsolete Application Area
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
before fabrication ?
after fabrication ?
© 2003, [email protected] http://hartenstein.de29
University of Kaiserslautern
Xputer Lab
Celaro Pro (Mentor)
Dini Group
Dini Group
EmulatorsQuickturn
PCi bus extender
Dini group
© 2003, [email protected] http://hartenstein.de30
University of Kaiserslautern
Xputer LabCrossbar
324 x 4
n=8
no. of crossbar chips
n x n/2n
8 32
100 5000
cossbar chips in
a row
full crossbar
64
64
14
32
nn
8 8
100 100
no. of crossbar chips
cossbar chips in
a row
partial crossbar
© 2003, [email protected] http://hartenstein.de31
University of Kaiserslautern
Xputer Lab
14 Logic Chips (Lchip) with 128 pins(occasionally for rout-through)
32 Crossbar Chips (Xchip) with 72 I/O pins(for rout-through only)
each Xchip: 4 pins connected to each Lchip
8 Logic cards per card cage
Logik-Karte
Einschub
Schrank
8 card cages per rack
8 Ychip cards per card cage
Backplane: 8 Zboard cards per rack
Routing
© 2003, [email protected] http://hartenstein.de32
University of Kaiserslautern
Xputer Lab
1913 J. N. Reynold‘s crossbar switch
1915 patent granted
1926 first public telefon switching application in Shweden
Betulander‘s crossbar switch 1919
NASA telemetrics crossbar array 1964
Crossbar ?
© 2003, [email protected] http://hartenstein.de33
University of Kaiserslautern
Xputer LabRWC Real World Computing, Japan, 40 TFLOPS
Crossbar weight: 220 tons, 3000 km cable,5120 processors with 5000 pins each
© 2003, [email protected] http://hartenstein.de34
University of Kaiserslautern
Xputer Lab Routing Congestion
Example
direct connection impossible
rGA rGA rGA rGA
rGA rGA rGA rGA
rout-throughdetour connection
© 2003, [email protected] http://hartenstein.de35
University of Kaiserslautern
Xputer LabRouting-only configuration
(2 examples)
rLB
Identitityfunction
configured
• Ro
uti
ng
© 2003, [email protected] http://hartenstein.de36
University of Kaiserslautern
Xputer Lab
T. Uehara, W. M. van Cleemput: Optimal Layout of CMOS Functional Arrays; IEEE Trans. C-30, pp. 305-312, May 1981
Graphs, Partitioning, Algorithms
B. Kernighan, S. Lin: An Efficient Heuristic Procedure for Partitioning Graphs; BSTJ 49, 1970,
C. Alpert, A. Kahng: Recent Directions in Netlist Partitioning: A Survey; Integration, vol 19 (1-2), pp. 1-81, 1995
T. Cormen, et al.: Introduction to Algorithms; MIT Press / McGraw-Hill, 1991
© 2003, [email protected] http://hartenstein.de37
University of Kaiserslautern
Xputer Labwhy emulators are obsolete
10 000 000
1 000 000
100 000
10 000
1 000
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
planned
Virtex II
XC 40250XV
Virtex
XC 4085XL
100
System gates per rGA chip
Jahr
[Xilinx Data]
200
500
© 2003, [email protected] http://hartenstein.de38
University of Kaiserslautern
Xputer Lab
More and more the prototyping platform of rGA based systems will be directly delivered as the product to the customer: fully configured
ASICs lost the battle. rGAs are the winners
0.1 3
2001 2002 2003 2004
year
50,000
40,000
30,000
20,000
10,000
0c)
number of design starts
rGA-basiert
[N. Tredennick, Gilder Technology Report, 2003]
why declining ASIC business?
ASIC emulators have been a transient solution: now with declining commercial significance.
you don‘t need specific silicon !you don‘t need specific silicon !
© 2003, [email protected] http://hartenstein.de39
University of Kaiserslautern
Xputer Lab
• FPGA Fabric-based on Virtex-II Architecture
Source: Ivo Bolsens, Xilinx
On Chip Memory Controller
Power PCCore
EmbededRAM
RocketIO
Xilinx: full hierarchy on chip
from rack to chipfrom rack to chip• Xilinx Virtex-II Pro
FPGA Architecture
• PowerPC 405 RISC CPU (PPC405) cores
© 2003, [email protected] http://hartenstein.de40
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de41
University of Kaiserslautern
Xputer Labfocusing on coarse grain
• Fine Grain morphware platforms
• Coarse Grain platforms:
already mainstream: reconfigurable logicjust logic design on a strange platform
Reconfigurable Computing :not that new – but shocking the
fundamentals of CS curricula
an order of magnitude more MIPS/mW than fine grain
© 2003, [email protected] http://hartenstein.de42
University of Kaiserslautern
Xputer Labwhy coarse grain
1000
100
10
1
0.1
0.01
0.0012 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
FPGAs (reconfigurable logic)hardwired
instruction set processors
standard microprocessor
DSP
T. Claasen et al.: ISSCC 1999*) R. Hartenstein: ISIS 1997
rDPAs (reconfigurable computing)*
flexibility
throughput
hard-wired
vonNeumann
FPGAs
coarse grain goes far beyond bridging the gap
coarsegrain
© 2003, [email protected] http://hartenstein.de43
University of Kaiserslautern
Xputer Lab
Reconfigurable Interconnect Fabric
separate routing area
rDPA (Reconfigurable Datapath Array)
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
RIF layouted over rDPUs:rDPA wired by abutment
© 2003, [email protected] http://hartenstein.de44
University of Kaiserslautern
Xputer LabCMOS intercoonnect resources
Foundries offer up to 9 metal layers
and up to 3 poly layers
reconfigurable interconnect fabric layouted over the
rDU cell
© 2003, [email protected] http://hartenstein.de45
University of Kaiserslautern
Xputer LabCommercial rDPAs
XPU family (IP cores):PACT Corp., Munich
XPU128
© 2003, [email protected] http://hartenstein.de46
University of Kaiserslautern
Xputer Lab
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
mapping algorithms efficently onto rDPA
rout thru only
not usedbackbus connect
SNN filter on KressArray
by the way: example of scalability / relocatability by EDA support
„Structured
Configware
Design“ [R. H.]
© 2003, [email protected] http://hartenstein.de47
University of Kaiserslautern
Xputer Lab
badly scalable
Hundreds of rGAs or very large rGAs
Routing congestion growing exponentially
•Routing
© 2003, [email protected] http://hartenstein.de48
University of Kaiserslautern
Xputer Lab Communication Resource Requirements
... often Functional Resources are not the Throughput
BottleneckIn some Application Areas,such as e. g. Wireless Communication, Reconfigurable Computing Arraysneed extraordinarily rich and powerful Communication ResourcesThe Solution: Generators for Domain-specific RA Platforms
© 2003, [email protected] http://hartenstein.de49
University of Kaiserslautern
Xputer Lab
KressArray Family generic Fabrics: a few examples
Examples of 2nd Level Interconnect:layouted overrDPU cell - no separate routing areas !
+
rout-through and function
rout-throug
h only more NNports:
rich Rout Resources
Select Function
Repertory
select Nearest Neighbour (NN) Interconnect: an example
16 32 8 24
4
2 rDPU
Select mode, number, width of NNports
http://kressarray.de
© 2003, [email protected] http://hartenstein.de50
University of Kaiserslautern
Xputer LabSuper Pipe Networks
pipeline propertiesarray applications
shape resources
mappingscheduling
(data streamformation)
systolicarray
regular datadependencies
only
linearonly
uniformonly
linear projection oralgebraic synthesis
super-systolicRA
no restrictionssimulated
annealing orP&R algorithm
(e.g. force-directed)schedulingalgorithm
The key is mapping, rather than architecture
**) KressArray [ASP-DAC-1995]
© 2003, [email protected] http://hartenstein.de51
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de52
University of Kaiserslautern
Xputer LabMorphware machines vs. hardwired
machines
platformprogram source
running on it
hardware (not programmable)
morphware
fine grain rGA (FPGA)configwarecoarse
grainrDPU, rDPA
machine
reconfigurable data stream processor
flowware & configware
hardwired
data stream processor
flowware
instruction stream processor (v. N.)
software
A clear terminology helps a lot
© 2003, [email protected] http://hartenstein.de53
University of Kaiserslautern
Xputer Lab
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
time
port #
time
time
port #time
port #
... which data item at which time at which port
Flowware defines:
© 2003, [email protected] http://hartenstein.de54
University of Kaiserslautern
Xputer LabParadigm Shifts:
Nick Tredennick‘s view
algorithms variable
resources fixed
instruction-stream-based computing:
algorithms variable
resources variable
data-stream-based reconfigurable computing:
programmable
why 2 program sources ?
Configware
resources variable
Flowware
data-stream
Software
instruction-stream
© 2003, [email protected] http://hartenstein.de55
University of Kaiserslautern
Xputer Lab
Flowware heading toward mainstream
•Data-stream-based Computing is heading for mainstream
–1997 SCCC (LANL) Streams-C Configurabble Computing
–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution
–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing
–2000 Bee (UCB), ...
–Most stream-based multimedia systems, etc.
–Many other areas ....
Flowware ..... mostly not yet modelled that way: most
flowware is hidden by its indirect instruction-stream-based implementationFlowware:
managing data streamsSoftware: managing instruction streams
© 2003, [email protected] http://hartenstein.de56
University of Kaiserslautern
Xputer Labcontrol-procedural vs. data-procedural
The structural domain is primarily data-stream-based:
Flowware provides a (data-)procedural abstraction of the (data-stream-based) structural domain
Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...
... a Troyan horse to introduce the structural domain to the procedural mind set of programmers
© 2003, [email protected] http://hartenstein.de57
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de58
University of Kaiserslautern
Xputer Lab
asM
distributed memory
architecture
distributed memory
architecture
Configware / Flowware Compilation
r. DataPath
Array
rDPA intermediate
high level source
wrapper
flowwareflowware
scheduler
M M M M
M M M M
MM
MM
MM
MM
data streams
data sequencer
address generato
r
„instruction“ fetch before runtime
configwareconfigware
mapper
© 2003, [email protected] http://hartenstein.de59
University of Kaiserslautern
Xputer Lab>>> extremely high
efficiency: flowware-based computing
1. avoiding address computation memory cycle overhead
2. avoiding instruction fetch and interpretation overhead
3. high parallelism, massively multiple deep pipelines
4. much less configuration memory
5. interconnect layouted over the cell: no extra routing areas
6. methodologies readily available
© 2003, [email protected] http://hartenstein.de60
University of Kaiserslautern
Xputer LabProgramming Language
Paradigms
language category Software Languages Languages f. Anti Machine
both deterministic procedural sequencing: traceable, checkpointable
operation sequence driven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching
state register program counter data counter(s) address computation
massive memory cycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoided parallel memory bank access interleaving only no restrictions
language features control flow + data manipulation
data streams only (no data manipulation)
very easy to learn
multipleGAGsmuch more
simple
much moresimple
much more
powerful
flowware languagesflowware languages
© 2003, [email protected] http://hartenstein.de61
University of Kaiserslautern
Xputer LabMachine Paradigms
machine category Computer (the Machine:
“v. Neumann”) The Anti Machine
driven by: Instruction streams data streams (no “dataflow”)
engine principles instruction sequencing sequencing data streams
state register single program counter (multiple) data counter(s)
Communication path set-up .
at run time at load time
resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path
operation sequential parallel pipe network etc.
( “instruction fetch” )
also hardwired implementations**) e g. Bee project Prof. Broderson
© 2003, [email protected] http://hartenstein.de62
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de63
University of Kaiserslautern
Xputer Labcomputing paradigms and
methodologies
1946: machine paradigm (von Neumann)
1980: data streams (Kung, Leiserson)
1989: anti machine paradigm
1990: 1st rDPU* (Rabaey)
1994: anti machine high level programming language
1995: super systolic rDPA (Kress)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: 1st configware / software partitioning compiler
flow
ware
*) rDPU = reconfigurable Data Path Unit
© 2003, [email protected] http://hartenstein.de64
University of Kaiserslautern
Xputer LabThe Secret of Success: Co-
Compilation
Analyzer/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
Resource Parameters
supportingdifferentplatforms
supporting platform-based design
High level PL source
© 2003, [email protected] http://hartenstein.de65
University of Kaiserslautern
Xputer Lab
data-stream machine
M
DPU or rDPU
data addressgenerator(data sequencer)
memory
I/O
asM**
(anti machine)(anti machine)
Machine paradigms
von Neumanninstruction
stream machineM
I/O
instructionsequencer
CPU
instructionstream
I/OMM MM M
(r)DPU
DPU
Software
I/OMM MM M
(r)DPA
memorydistributed memory architecture*
data stream
Flowware
(Configware)
(reconf.)
*) the new discipline came just in time:see Herz et al.: Proc. IEEE ICECS, 2002
instruction stream+
CPU
- data stream
-DPU
+
memory
also see books by Francky Catthoor et al.
© 2003, [email protected] http://hartenstein.de66
University of Kaiserslautern
Xputer Lab
Synthesizable distributed memory architecture...
Memory(data memory)
memory bank
memory bank
memory bank
memory bank
memory bank
...
...
Scheduler
for a Stream-based Soft Machine
rDPA“instructions”
Compiler
Sequencers(data stream
generator)
© 2003, [email protected] http://hartenstein.de67
University of Kaiserslautern
Xputer LabPC replaced by PS
mainframe age
computer age (PC age)
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
PC replaced by PS (personal supercomputer)
PC replaced by PS (personal supercomputer)
flowware
rDPArDPAµProcµProc
co-compilerco-compiler
anti machineanti machinevon Neumannvon Neumann
© 2003, [email protected] http://hartenstein.de68
University of Kaiserslautern
Xputer Lab all methodologies available
data streams ...
morphware age
1957
1967
1977
1987
1997
2007
flowware
free know-how for personal super computer
free know-how for personal super computer
rDPArDPAµProcµProc
co-compilerco-compiler
.... and all other methodologies available from
literature
.... and all other methodologies available from
literature
© 2003, [email protected] http://hartenstein.de69
University of Kaiserslautern
Xputer LabWe have an education problem
... we need a second machine paradigm
The typical programmer has problems to understand function evaluation without machine mechanisms....
Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software
acceleratorsacceleratorsµprocessorµprocessor
It‘s the gap between procedural and structural mind set
Crossing the Hardware / Software Chasm [Mike
Butts]
© 2003, [email protected] http://hartenstein.de70
University of Kaiserslautern
Xputer Lab Ubiquitous Embedded Systems
... and the main focus in system design
embedded software and configware became the main vehicle to product differentiation ...
(Performance and) Flexibility are key issues
current CS curricula do not qualify our students
© 2003, [email protected] http://hartenstein.de71
University of Kaiserslautern
Xputer Labmisqualified: jobless CS graduates
?
Embe
dded
sof
twar
e [D
TI*
law
]
1
2
0 10 12 18 months
factor
*) Department of Trade and Industry, London
(1.4/year)
[Moore
’s law]90% of all code
written for embedded systems The real labor market:
10 times more programmers will write embedded applications than computer software by 2010
© 2003, [email protected] http://hartenstein.de72
University of Kaiserslautern
Xputer Lab>> outline <<
•fine grain reconfigurable•Placement and routing •coarse grain reconfigurable•Flowware•Datastream-based Computing•The Anti Machine Paradigm•Final Remarks
http://www.uni-kl.de
© 2003, [email protected] http://hartenstein.de73
University of Kaiserslautern
Xputer LabEDA Industry Revolution every 7 years
1978
Transistor entry: Applicon, Calma, CV ...
1992Synthesis (HDLs): Cadence, Synopsys ...
1985
Schematics entry: Daisy, Mentor, Valid ...
[Keutzer / Newton]McKinsey Curves
EDA industry paradigmswitching every 7 years
1999
© 2003, [email protected] http://hartenstein.de74
University of Kaiserslautern
Xputer LabEDA the main bottleneck
[cou
rtes
y by
Ric
hard
New
ton]
math formula ?TRS ?
© 2003, [email protected] http://hartenstein.de75
University of Kaiserslautern
Xputer LabBiggest Mistake of EDAguess it !
© 2003, [email protected] http://hartenstein.de76
University of Kaiserslautern
Xputer LabThe next EDA Industry Revolution
1978
Transistor entry: Applicon, Calma, CV ...
1992Synthesis (HDLs): Cadence, Synopsys ...
1985
Schematics entry: Daisy, Mentor, Valid ...
[Keutzer / Newton]McKinsey Curves
EDA industry paradigmswitching every 7 years
1999
(Co-) Compilation:data-stream-based
DPAs
Von Neumann does not support Morphware:
System-Cmath formula: TRS*
higher abstraction level:
*) Term Rewriting Systems
© 2003, [email protected] http://hartenstein.de77
University of Kaiserslautern
Xputer Lab Algorithmic cleverness needed
Example - migration from signal processor to rGA: very high throughput on low power slow FPGAs obtained only by algorithmic cleverness:
We need an all-embracing taxonomy of algorithms and survey on algorithm transformations ....
loop transformations ....
optimization, partitioning, signal processing, (de-) coding algorithms (wireless communication), image processing, sorting, .... And much more areas .....
© 2003, [email protected] http://hartenstein.de78
University of Kaiserslautern
Xputer Labalgorithmic cleverness needed for CS graduates in embedded
systemsthe hardware / configware / software partitioning problem: current CS curricula do not qualify our students
software / configware migration: current CS curricula do not qualify our students
extending software engineering into software / flowware engineering: the anti machine paradigm and reconfigurable computing are the curricular enablers
© 2003, [email protected] http://hartenstein.de79
University of Kaiserslautern
Xputer Lab>>> thank you
thank you
© 2003, [email protected] http://hartenstein.de81
University of Kaiserslautern
Xputer Lab
Appendix for
discussion
© 2003, [email protected] http://hartenstein.de82
University of Kaiserslautern
Xputer LabProcessor Memory Performance Gap
1
10
100
1000Performance
1980 1990 2000
µProc60%/yr..
DRAM7%/yr..
Processor-MemoryPerformance Gap:(grows 50% / year)
DRAM
CPU
© 2003, [email protected] http://hartenstein.de83
University of Kaiserslautern
Xputer LabWhy a dichotomy of machine
paradigms?
data stream machine:
• bad message: caches do not help
• good message: no vN bottleneck
• caches not needed
stolen from Bob Colwell
CPU
caches, ...
vN bottleneckvN: unbalanced
The anti machine has novon Neumann bottleneck
© 2003, [email protected] http://hartenstein.de84
University of Kaiserslautern
Xputer Lab„Pollack‘s Law“
(simplified)
[intel]
growth factor
µm
0.1
performance
area efficiency
© 2003, [email protected] http://hartenstein.de85
University of Kaiserslautern
Xputer LabLoop Transformation
Examples
loop 1-8bodybodyendloop
loop 1-8bodyendloop
loop 9-16bodyendloop
fork
joinstrip mining
loop 1-4triggerendloop
loop 1-2triggerendloop
loop 1-8triggerendloop
reconf.array:host:loop 1-16bodyendloop
sequential processes: resource parameter drivenCo-Compilation
loop unrolling
© 2003, [email protected] http://hartenstein.de86
University of Kaiserslautern
Xputer Lab
desi
gn c
ost
year
product life cycle
Die Entwurfs-KriseDie langen Durchlauf-Zeiten der ASIC-Fertigung werden zunehmend unbezahlbar
Steigende Nachfrage: schnelle Patches und Upgrades – möglichst am Standort des Kunden – Förderung der Langlebigkeit des Produktes
© 2003, [email protected] http://hartenstein.de87
University of Kaiserslautern
Xputer LabSummary of the Anti Machine
Paradigm
• anti language primitives are almost the same (slightly extended)
• anti machine execution potential is dramatically more powerful
• provides drastically more flexibility
• not always replacing von Neumann
© 2003, [email protected] http://hartenstein.de88
University of Kaiserslautern
Xputer LabReconfigurable Computing:
a second programming domain
Migration of programming to the structural domain
Currently running: the next fundamental revolution after introduction of the microprocessor
The structural domain has become RAM-based
However, CS curricula ignore this impact of Reconfigurable Computing – key issue in embedded systems ...
... causing the coming disaster by unqualified CS graduates pushing up the unemployment rate ?
© 2003, [email protected] http://hartenstein.de89
University of Kaiserslautern
Xputer LabAll enabling technologies are
available
•anti machine and all its architectural resources
•parallel memory IP cores and generators
•anything else needed
•languages & (co-)compilation techniques
•morphware vendors like PACT ....
•literature from last 30 years
© 2003, [email protected] http://hartenstein.de90
University of Kaiserslautern
Xputer LabNew horizons
• A new RAM-based platform going mainstream• Configware industry• New machine paradigm• New theory needed• New architectures – without v. N. bottleneck• New compilation techniques• More effective parallelism provided• Rich material is already available in many areas• Lots of similarities with the classical v.N. world• But a few asymmetries: a challenge
© 2003, [email protected] http://hartenstein.de91
University of Kaiserslautern
Xputer Lab evangelist‘s material + lobby
space
Evangelist‚s material:• http://hartenstein.de – click „recent talks“Lobby space:• http://morphware.net• http://configware.org• http://data-streams.org• http://flowware.netTrailblazer group:• you are welcome to improve, rewrite, post links ...• You are welcome to join the trailblazer group
© 2003, [email protected] http://hartenstein.de92
University of Kaiserslautern
Xputer LabThe genious of von Neumann
• enormous impact of the von Neumann paradigm• even stronger impact by a dichotomy of
paradigms:• von Neumann of matter• von Neuman of anti matter –• Von Neumann machine vs. anti machine
• does not mean throwing over v. N.‘s monument• it multiplies the glory of von Neumann
© 2003, [email protected] http://hartenstein.de93
University of Kaiserslautern
Xputer Lab MPU performance stalled
Moore’s law will stall soon for MPUs
relative computation time needed doubles every 2 years
had been compensated by Moore’s law
Bill Gates’ law:
© 2003, [email protected] http://hartenstein.de94
University of Kaiserslautern
Xputer LabBasics of Binding Time
run time
loading time
compile time
time of “Instruction Fetch”
microprocessorparallel computer
ReconfigurableComputing
© 2003, [email protected] http://hartenstein.de95
University of Kaiserslautern
Xputer LabTime to Market
• Morphware brings a new dimension to digital system development and has a strong impact on SoC design.
• Flexibility supports spin-around times of minutes instead of months for real time in-system debugging, profiling, verification, tuning, field-maintenance, and field upgrades
• A New Business Model (in-field debugging and upgrading ... )
• A Fundamental Paradigm Shift in Silicon Application
Revenue/ month
Time / months
1 10 20
ASIC Product
30
Update 1
Product
Update 2
reconfigurable Product with download
[Tom Kean]
© 2003, [email protected] http://hartenstein.de96
University of Kaiserslautern
Xputer LabKressArray principles
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the super systolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense
© 2003, [email protected] http://hartenstein.de97
University of Kaiserslautern
Xputer LabSignificance of Address Generators
• Address generators have the potential to reduce computation time significantly.
• In a grid-based design rule check a speed-up of more than 2000 has been achieved, compared to a VAX-11/750
• Dedicated address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead
© 2003, [email protected] http://hartenstein.de98
University of Kaiserslautern
Xputer LabAcceleration Mechanisms
•parallelism by multi bank memory architecture•auxiliary hardware for address calculation •address calculation before run time
•avoiding multiple accesses to the same data.•avoiding memory cycles for address computation•improve parallelism by storage scheme transformations•improve parallelism by memory architecture transformations
•alleviate interconnect overhead (delay, power and area)
© 2003, [email protected] http://hartenstein.de99
University of Kaiserslautern
Xputer Lab
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
microprocessor / DSP
No
rmal
ized
pro
cess
or
spee
d
battery performance
Algorithmic Complexity(Shannon’s Law)
memory
Tra
nsi
sto
rs/c
hip
1960 1970 1980 1990 2000 2010
100 000 000
10 000 000
1000 000
100 000
10 000
1000
100
10
1
2G
3G
4GWhy coarse
grain ?
1G
wireless
100
10
1
0.1
0.01
0.001
mA/ MIP
computational efficiency
StrongARMSH7752