© 2006, [email protected]
http://hartenstein.de
Reconfigurable Computing
Reiner Hartenstein
Computing MeetingEU, ESU, Brussells, May 18, 2006
2© 2006,
http://hartenstein.de
The Pervasiveness of RC
162,000
127,000
158,000113,000
171,000194,000
# of hits by Google
1,620,000
915,000
398,000
272,000
647,000
1,490,000
# of hits by Google
“FPGA and ….”ECE-savvy scene (mainstream many years)
Math/SW-savvy scene(more recently: 2-3 years)
and many more areas
and many more areas
3© 2006,
http://hartenstein.de
The dominance of Configware
Most compute power is coming from Configware
More MIPS migrated to Configware than running as Software
4© 2006,
http://hartenstein.de
Reconfigurable Supercomputing (VHPC) going commercial
Cray XD1
silicon graphics RASC
… and other vendors
5© 2006,
http://hartenstein.de
>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
6© 2006,
http://hartenstein.de
The Reconfigurable Computing Paradox
area-inefficient, slow, power-hungry, expensive
tools and languages unacceptable by most users
poor FPGA technology:
RC education: extremely poor, if at all
even most hardware experts (86%**) hate their tools
**) DeHon ‘98
poor tools:
poor education:- ignored by CS
curriculaCS taught like for a 50 year old mainframe …
7© 2006,
http://hartenstein.de
FPGA integration density
the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude
However, brillia
nt
results everywherewhat paradox ?
8© 2006,
http://hartenstein.de
X 2/yr
FPGA
speed-up factors published
1980 1990 2000 2010100
103
106
109
8080
Pentium 4
7%/yr
50%/yr
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
10 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
20002000
2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]
39,4
Lee Routing (by TU-KL)
Lee Routing (by TU-KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)1500015000
DSP and wirelessDSP and wirelessImage processing,Pattern matching,
Multimedia
Image processing,Pattern matching,
Multimedia
BioinformaticsBioinformatics
GRAPEGRAPE20
AstrophysicsAstrophysics
DPLADPLA
MoM Xputer architecture
Microprocessor
rela
tive
perf
orm
anc
e
Memory
10 000
x1.25 / yr (Moore)
cryptocrypto
1000
pre-FPGA era
>1 OoM>1 OoM
>2 OoM>2 OoM
>3 OoM>3 OoM
<4 OoM<4 OoM
9© 2006,
http://hartenstein.de
500MHz FlexibleSoft Logic Architecture
200KLogic Cells
500MHz Programmable DSP Execution Units
0.6-11.1GbpsSerial Transceivers
500MHz PowerPC™ Processors(680DMIPS)
withAuxiliary Processor Unit
1Gbps DifferentialI/O
500MHz multi-portDistributed 10 Mb SRAM
500MHz DCM DigitalClock Management
platform FPGAs: better area efficiency
[courtesy Xilinx Corp.]DSP platform FPGA
DeHon‘s 1st Law (1996) was for plane FPGAs
10© 2006,
http://hartenstein.de
pre FPGA era: Why DPLA* was so goodpre FPGA era: Why DPLA* was so good
Large arrays of canonical boolean expressions -
close to Moore’s lawclassical PLA layout highly area-efficient:
*) fabricated 1984 by E.I.S. multi university project
2ASM: Auto-Sequencing MemoryASM
**) for a survey by IMEC & TU-KL see: [M. Herz et al.: ICECS 2003, Dubrovnik]
1
Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them
a generalization of the DMA**
GAG Generic Address Generator** to avoid address computation overhead
reducing memory cycles which is the
key issue
Speed-up factor of 20 by
11© 2006,
http://hartenstein.de
X 2/yr
FPGA
taxonomy of algorithms, better tools and better education
1980 1990 2000 2010100
103
106
109
8080
Pentium 4
7%/yr
50%/yr
10 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
20002000
2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]
39,4
Lee Routing (by TU-KL)
Lee Routing (by TU-KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)1500015000
DSP and wirelessImage processing,Pattern matching,
Multimedia
Bioinformatics
GRAPEGRAPE20
Astrophysics
DPLADPLA
MoM Xputer architecture
Microprocessor
rela
tive
perf
orm
anc
e
Memory
10 000
x1.25 / yr (Moore)
cryptocrypto
1000
even
hig
her s
peed
-up
?
cons
olid
atio
n ?
12© 2006,
http://hartenstein.de
New dimensions of low power: Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year
(also a matter of national energy policy)GoogleAmsterdam
NY
„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]
13© 2006,
http://hartenstein.de
>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
14© 2006,
http://hartenstein.de
The Supercomputing Paradox
Growing listed Teraflops
Increasing number of processors running in parallel
COTS processor decreasing cost
promising technology
15© 2006,
http://hartenstein.de
HPC by classic supercomputing methodology
Extreme shortage of affordable capacity
Lack of scalability: progress only by innovation
More parallelism absorbs programmer productivity
Program ready: hardware obsolete The law of More
Not for high performance embedded computing
poor results
16© 2006,
http://hartenstein.de
>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
17© 2006,
http://hartenstein.de
Why traditional supercomputing / HPC failed
memory-cycle-hungryinstruction-stream-based:
the wrong way, how the data are moved around
because of the wrong multi-core interconnect architecture
extr
emel
y unbal
ance d
stolen from Bob Colwell
CPU
18© 2006,
http://hartenstein.de
Earth SimulatorCrossbar weight: 220 t, 3000 km of thick cable,
moving data around
inside the
19© 2006,
http://hartenstein.de
discarding the wrong road map
with a paradigm shift the same performance is feasible
on a single 19” rack
20© 2006,
http://hartenstein.de
Bringing together data and processor
moving the grand piano
by SoftwareMoving data to the processor:
21© 2006,
http://hartenstein.de
Key issues in very High Performance Computing (vHPC)
this needs a paradigm shift
reducing memory cycles is the key
issue
away from the dominance of instruction streams
22© 2006,
http://hartenstein.de
Here is the common model
data-stream-based
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware code
CPU
it’s not von Neumannit’s not von Neumann the vN monopoly in our
curricula is severely harmful
the vN monopoly in our
curricula is severely harmful
Von Neumann:the tail is wagging the dog
we need dual paradigm education
we need dual paradigm education
very high performance & electricity bill issues
very high performance & electricity bill issues
legacy issueslegacy issues
symbioticsymbiotic
23© 2006,
http://hartenstein.de
The wrong basic mind set
we need a a dual paradigm approach
this is a severe eduational challenge
our IT expert labor force lacks the rite basic mind set
24© 2006,
http://hartenstein.de
For high school and undergraduate education
we need a an archtype simple common model
this is a severe eduational challenge
instead of a wide variety of sophisticated architectures
25© 2006,
http://hartenstein.de
>> Outline <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
26© 2006,
http://hartenstein.de
integration density
the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude
the effective integration density of rDPAs* may come close to Moore’s law
*) reconfigurable DataPath Arrays (coarse-grained reconfigurability)
27© 2006,
http://hartenstein.de
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
Coarse grain is about computing, not logic
rout thru only
not usedbackbus connect
SNN filter on KressArray (mainly a pipe network)
[Ulrich Nageldinger]
reconfigurable Data Path Unit, e. g. 32 bits wide
no CPUrDPUrDPU
28© 2006,
http://hartenstein.de
SW 2coarse-grained CW migration example
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
S
+
29© 2006,
http://hartenstein.de
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
Compare it to software solution on CPU
S = R + (if C then A else B endif);C = 1simple conservative CPU example
memory cycles
nanoseconds
if C then read A
read instruction 1 100instruction decoding
read operand* 1 100operate & reg. transfers
if not C then read B
read instruction 1 100instruction decoding
add & store
read instruction 1 100instruction decoding
operate & reg. transfers
store result 1 100
total 5 500
S
+
Clock200S
+
S = R + (if C then A else B endif);
30© 2006,
http://hartenstein.de
hypothetical branching example to illustrate software-to-configware
migration
*) if no intermediate storage in register file
C = 1simple conservative CPU example
memory cycles
nanoseconds
if C then read A
read instruction 1 100instruction decoding
read operand* 1 100operate & reg. transfers
if not C then read B
read instruction 1 100instruction decoding
add & store
read instruction 1 100instruction decoding
operate & reg. transfers
store result 1 100
total 5 500
S = R + (if C then A else B endif);
S
+
ABR C
clock200 MHz(5 nanosec)
=1
no m
emor
y cy
cles
:
no m
emor
y cy
cles
:
spee
d-up
fac
tor
= 1
00
spee
d-up
fac
tor
= 1
00
31© 2006,
http://hartenstein.de
moving the locality of operation into the route of the data stream by P&R
Why the speed-up? What‘s the difference?
instead of moving data by instruction streams
32© 2006,
http://hartenstein.de
Bringing together data and processor
Move the stoolby
Configware
Place the location of execution into the data pipe
33© 2006,
http://hartenstein.de
Data-stream-based
instead of instruction-triggered
execution should be transport-triggered
transport should be done within compiled pipelines,
not by move engines*
*) which are instruction-stream-based !
34© 2006,
http://hartenstein.de
For high school and undergraduate education
we should send CTOs and professors back to school
this is a severe eduational challenge
35© 2006,
http://hartenstein.de
The wrong model
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
rout thru only
not usedbackbus connect
SNN filter on KressArray (mainly a pipe network)
[Ulrich Nageldinger]
reconfigurable Data Path Unit, e. g. 32 bits wide
no CPUrDPUrDPU
upon this schematics …… question by a Japanese Corporate vVIP
36© 2006,
http://hartenstein.de
The wrong mind set ....
not knowing this solution:symptom of the hardware / software chasm
and the configware / software chasm
„but you can‘t implement decisions!“
We need Reconfigurable Computing Education
S
+
ABR C
clock200 MHz(5 nanosec)
=1
(Question by a Japanese Corporate vVIP: [RAW’99])
37© 2006,
http://hartenstein.de
>> Outline <<
• Reconfigurable Computing Paradox
• The Supercomputing Paradox
• We are using the wrong model
• Coarse-grained Reconfigurable Devices
• Super Pentium for Desktop Supercomputer
http://www.uni-kl.de
38© 2006,
http://hartenstein.de
Universal HPC co-architecture for:some Goals
embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)
Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...
Meet product lifetime >> embedded syst. life:FPGA emulation logistics from
development downto maintenance and repair stationsexamples: automotive, aerospace,
industrial, ..
39© 2006,
http://hartenstein.de
Architecture: A potential Pentium successorDiscard most caches
have 64* cores, 0.5 - 1 GHz
with clever interconnect for:
▪ concurrent processes and
▪ and for multithreading,
▪ Kung-Kress pipe network
The Desk-top Supercomputer!
*) CPU mode / DPU mode capability
and, for
CPU
mod
eDP
U m
ode
40© 2006,
http://hartenstein.de
“Super Pentium” configuration examplerDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU
CPUCPU
CPUCPU CPUCPU
CPUCPU
twin paradigm machine
CPUCPU CPUCPU
CPUCPU CPUCPU
41© 2006,
http://hartenstein.de
e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz
GamesGames MusicMusicVideosVideos
SMeXPPSMeXPP
CameraCamera
Baseband-Baseband-ProcessorProcessor
Radio-Radio-InterfaceInterface
AudioAudio--InterfaceInterface
SD/MMC CardsSD/MMC Cards
LCD DISPLAY
rDPArDPA
• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes
World TV & game console & multi media center
http://pactcorp.com
42© 2006,
http://hartenstein.de
feasible under 500 MHz
means low electricity cost and allows very high inegration density
44© 2006,
http://hartenstein.de
Dual Paradigm Application Development Support
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware codedata-stream-based
CPU
software/configwareco-compiler
high level languageplacement & routing
in the compiler
optimizes
interconnect
bandwidth by
preferring nearest
neighbor connect
45© 2006,
http://hartenstein.de
Software / Configware Co-Compilation
Juergen Becker’s CoDe-
X, 1996
CPUCPU
SWcompiler
CWcompiler
C language source
Partitioner
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
Placement &
Routing(Move the Locality of Operation
)Resource
Parameters
supportingdifferentplatforms
46© 2006,
http://hartenstein.de
Software / Configware very high level Synthesis
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware codedata-stream-based
CPU
term-rewriting-basedvhl synthesis system
Math formula ....[Arvind, or,Mauricio Ayala]
47© 2006,
http://hartenstein.de
>> Conclusions <<
•Reconfigurable Computing Paradox
•The Supercomputing Paradox
•We are using the wrong model
•Coarse-grained Reconfigurable Devices
•Super Pentium for Desktop Supercomputer
•Conclusions http://www.uni-kl.de
48© 2006,
http://hartenstein.de
flexibility (for accelerators)
Objectives
avoiding specific silicon
rapid prototyping, field-patching, emulation
cheap, compact vHPC
for every area which needs:
49© 2006,
http://hartenstein.de
Reconfigurable Computing opens many spectacular new horizons:
Conclusion (1)
Cheap vHPC without needing specific silicon, no mask ....
Massive reduction of the electricity bill: locally and national
Cheap embedded vHPC Cheap desktop supercomputer (a new market)
Fast and cheap prototyping
Replacing expensive hardwired accelerators
Supporting fault tolerance, self-repair and self-organization
Flexibility for systems with unstable multiple standards by dynamic reconfigurability
Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)
50© 2006,
http://hartenstein.de
Universal vHPC co-architecture demonstrator
Conclusion (2)Needed:
The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved
Use this to develop a very good high school and undergraduate lab course
A motivator: preparing for the top 500 contest
For widely spreading its use successfully:
select killer applications for demo
54© 2006,
http://hartenstein.de
Compilation: Software vs. Configware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
C, FORTRANMATHLAB
55© 2006,
http://hartenstein.de
configware resources: variable
Nick Tredennick’s Paradigm Shifts explain the differences
2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Software EngineeringSoftware Engineering
1 programming source
needed
algorithm: variable
resources: fixedsoftware
CPU
56© 2006,
http://hartenstein.de
Co-Compilation
softwarecompiler
software code
Software / Configware Co-Compiler
Software / Configware Co-Compiler
configware code
mapperconfigware
compiler
scheduler
flowware code
data
C, FORTRAN, MATHLAB
automatic SW / CW partitionersimulated annealing
simulated annealing
simulated annealing
simulated annealing
57© 2006,
http://hartenstein.de
Co-Compiler for Hardwired Kress/Kung Machine[e. g. Brodersen]
softwarecompiler
software code
Software / Flowware
Co-Compiler
Software / Flowware
Co-Compiler
flowwarecompiler
scheduler
flowware code
data
source
automatic SW / CW partitioner
58© 2006,
http://hartenstein.de
The first archetype machine model
mainframe
CPU
compile orassemble
proceduralpersonalization
Software IndustrySoftware Industry Software Industry’sSecret of Success
simple basic .Machine Paradigm
personalization:RAM-based
instruction-stream- based mind set
“von Neumann”
59© 2006,
http://hartenstein.de
The 2nd archetype machine model
compilestructural
personalization
Configware IndustryConfigware Industry
Configware Industry’sSecret of Success
personalization:RAM-based
data-stream- based mind set
“Kress-Kung”
accelerator reconfigurable
simple basic .Machine Paradigm
60© 2006,
http://hartenstein.de
Co-Compiler Enabling Technology
is available from academia
only a small team needed for commercial re-implementation
on the road map to the Personal Supercomputer
61© 2006,
http://hartenstein.de
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data stream
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“ time
port #
time
time
port #time
port #
define: ... which data item at which time at which port
Data streams
(pipe network)
H. T. Kung paradigm(systolic array)
implemented by distributed
memory
datacounter
GAG RAM
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
MASM: Auto-
Sequencing Memory
50 & more on-chip ASM are feasible
50 & more on-chip ASM are feasible
62© 2006,
http://hartenstein.de
The Generalization of the Systolic Array
[R. Kress]:use optimization algorithmse. g.: simulated annealing
Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible
reconfigurability makes sense
discard algebraic synthesis methods
remedy?
only for applications with regular data dependencies
Kress-Kung paradigmsuper systolic array
63© 2006,
http://hartenstein.de
(Kress-Kung machine paradigm) drastically reducing memory
cycles
Data Counter instead of Program CounterGeneralization of the DMA
ASM: Auto-Sequencing Memory
datacounter
GAG RAM
ASM
GAG & enabling technology:multiple publications 1989 … -Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL
Storge Scheme optimization methodology, etc.*
64© 2006,
http://hartenstein.de
fine-grained RC: 1st DeHon‘s 1st Law Technology:
reconfigurability overhead>
routing congestion
wiring overhead
overhead:
>> 10 000
1980 1990 2000 2010100
103
106
109
FPGAlogical
FPGArouted
(Gordon Moore curve)
transistors / microchip
(microprocessor)
immense area inefficiency
[1996: Ph. D, MIT]1012
density:density:
FPGAphysical
65© 2006,
http://hartenstein.de
coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1st Law
rDPA
FPGArouted
>> 10 000
(Gordon Moore curve)
rDPA physical rDPA logical
area efficiency very close to Moore‘s law
[1996: ISIS, Austin, TX]
e.g.
KressArray
family
1980 1990 2000 2010100
103
106
109
transistors / microchip
1012
66© 2006,
http://hartenstein.de
More compute power by Configware than Software
Conclusion: most compute power from Configware
75% of all (micro)processors are embedded 4 : 1
avarage acceleration factor >2-> rMIPS* : MIPS > 2
*) rMIPS: MIPS replaced by FPGA compute power
25% embedded µProc. accelerated by FPGA(s)
1 : 4
(a very cautious estimation**)
**) Dataquest interaction pending
-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)
(difference probably an order of magnitude)
67© 2006,
http://hartenstein.de
Conclusion (3)
Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology
Universal vHPC co-architecture demonstrator
select a killer application for demo
For widely spreading its use successfully:
68© 2006,
http://hartenstein.de
Dual Paradigm Application Development Support
instruction-stream-
based
software code
accelerator reconfigurable
accelerator hardwired
configware codedata-stream-based
CPU
software/configwareco-compiler
high level languageMATLAB
adapter
other example