Download ppt - Reconfigurable Computing

© 2006, [email protected]

http://hartenstein.de

Reconfigurable Computing

Reiner Hartenstein

Computing MeetingEU, ESU, Brussells, May 18, 2006

2© 2006,

[email protected]


The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene (mainstream many years)

Math/SW-savvy scene(more recently: 2-3 years)

and many more areas

and many more areas

3© 2006,

[email protected]


The dominance of Configware

Most compute power is coming from Configware

More MIPS migrated to Configware than running as Software

4© 2006,

[email protected]


Reconfigurable Supercomputing (VHPC) going commercial

Cray XD1

silicon graphics RASC

… and other vendors

5© 2006,

[email protected]


>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de

6© 2006,

[email protected]


The Reconfigurable Computing Paradox

area-inefficient, slow, power-hungry, expensive

tools and languages unacceptable by most users

poor FPGA technology:

RC education: extremely poor, if at all

even most hardware experts (86%**) hate their tools

**) DeHon ‘98

poor tools:

poor education:- ignored by CS

curriculaCS taught like for a 50 year old mainframe …

7© 2006,

[email protected]


FPGA integration density

the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude

However, brillia

nt

results everywherewhat paradox ?

8© 2006,

[email protected]


X 2/yr

FPGA

speed-up factors published

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation


47

real-time face detectionreal-time face detection6000

video-rate stereo vision


900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL


20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)


160

Grid-based DRC („fair

comparizon“)


comparizon“)1500015000

DSP and wirelessDSP and wirelessImage processing,Pattern matching,

Multimedia

Image processing,Pattern matching,

Multimedia

BioinformaticsBioinformatics

GRAPEGRAPE20

AstrophysicsAstrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

pre-FPGA era

>1 OoM>1 OoM

>2 OoM>2 OoM

>3 OoM>3 OoM

<4 OoM<4 OoM

9© 2006,

[email protected]


500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

platform FPGAs: better area efficiency

[courtesy Xilinx Corp.]DSP platform FPGA

DeHon‘s 1st Law (1996) was for plane FPGAs

10© 2006,

[email protected]


pre FPGA era: Why DPLA* was so goodpre FPGA era: Why DPLA* was so good

Large arrays of canonical boolean expressions -

close to Moore’s lawclassical PLA layout highly area-efficient:

*) fabricated 1984 by E.I.S. multi university project

2ASM: Auto-Sequencing MemoryASM

**) for a survey by IMEC & TU-KL see: [M. Herz et al.: ICECS 2003, Dubrovnik]

1

Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them

a generalization of the DMA**

GAG Generic Address Generator** to avoid address computation overhead

reducing memory cycles which is the

key issue

Speed-up factor of 20 by

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik

11© 2006,

[email protected]


X 2/yr

FPGA

taxonomy of algorithms, better tools and better education

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

10 000



47

real-time face detectionreal-time face detection6000



900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C



20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4



160


comparizon“)


comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

even

hig

her s

peed

-up

?

cons

olid

atio

n ?

12© 2006,

[email protected]


New dimensions of low power: Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)GoogleAmsterdam

NY

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]

13© 2006,

[email protected]


>> Outline <<







14© 2006,

[email protected]


The Supercomputing Paradox

Growing listed Teraflops

Increasing number of processors running in parallel

COTS processor decreasing cost

promising technology

Reiner Hartenstein

programmer productivity shrinking with growing number of processors

15© 2006,

[email protected]


HPC by classic supercomputing methodology

Extreme shortage of affordable capacity

Lack of scalability: progress only by innovation

More parallelism absorbs programmer productivity

Program ready: hardware obsolete The law of More

Not for high performance embedded computing

poor results

16© 2006,

[email protected]


>> Outline <<







17© 2006,

[email protected]


Why traditional supercomputing / HPC failed

memory-cycle-hungryinstruction-stream-based:

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

extr

emel

y unbal

ance d

stolen from Bob Colwell

CPU

18© 2006,

[email protected]


Earth SimulatorCrossbar weight: 220 t, 3000 km of thick cable,

moving data around

inside the

19© 2006,

[email protected]


discarding the wrong road map

with a paradigm shift the same performance is feasible

on a single 19” rack

20© 2006,

[email protected]


Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:

21© 2006,

[email protected]


Key issues in very High Performance Computing (vHPC)

this needs a paradigm shift

reducing memory cycles is the key

issue

away from the dominance of instruction streams

22© 2006,

[email protected]


Here is the common model

data-stream-based

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware code

CPU

it’s not von Neumannit’s not von Neumann the vN monopoly in our

curricula is severely harmful

the vN monopoly in our

curricula is severely harmful

Von Neumann:the tail is wagging the dog

we need dual paradigm education

we need dual paradigm education

very high performance & electricity bill issues

very high performance & electricity bill issues

legacy issueslegacy issues

symbioticsymbiotic

23© 2006,

[email protected]


The wrong basic mind set

we need a a dual paradigm approach

this is a severe eduational challenge

our IT expert labor force lacks the rite basic mind set

24© 2006,

[email protected]


For high school and undergraduate education

we need a an archtype simple common model


instead of a wide variety of sophisticated architectures

25© 2006,

[email protected]


>> Outline <<







26© 2006,

[email protected]


integration density

the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude

the effective integration density of rDPAs* may come close to Moore’s law

*) reconfigurable DataPath Arrays (coarse-grained reconfigurability)

27© 2006,

[email protected]


rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

28© 2006,

[email protected]


SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
















S

+

29© 2006,

[email protected]










Compare it to software solution on CPU

S = R + (if C then A else B endif);C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B


add & store


operate & reg. transfers

store result 1 100

total 5 500

S

+

Clock200S

+

S = R + (if C then A else B endif);

30© 2006,

[email protected]


hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A


read operand* 1 100operate & reg. transfers

if not C then read B


add & store


operate & reg. transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

S

+

ABR C

clock200 MHz(5 nanosec)

=1

no m

emor

y cy

cles

:

no m

emor

y cy

cles

:

spee

d-up

fac

tor

= 1

00

spee

d-up

fac

tor

= 1

00

31© 2006,

[email protected]


moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams

32© 2006,

[email protected]


Bringing together data and processor

Move the stoolby

Configware

Place the location of execution into the data pipe

33© 2006,

[email protected]


Data-stream-based

instead of instruction-triggered

execution should be transport-triggered

transport should be done within compiled pipelines,

not by move engines*

*) which are instruction-stream-based !

34© 2006,

[email protected]


For high school and undergraduate education

we should send CTOs and professors back to school


35© 2006,

[email protected]


The wrong model

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU

upon this schematics …… question by a Japanese Corporate vVIP

36© 2006,

[email protected]


The wrong mind set ....

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education

S

+

ABR C

clock200 MHz(5 nanosec)

=1

(Question by a Japanese Corporate vVIP: [RAW’99])

37© 2006,

[email protected]


>> Outline <<

• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer


38© 2006,

[email protected]


Universal HPC co-architecture for:some Goals

embedded vHPC (nomadic, automotive, ...)desktop vHPC (scientific computing ...)

Application co-development environment forHardware non-experts, ....Acceptability by software-type users, ...

Meet product lifetime >> embedded syst. life:FPGA emulation logistics from

development downto maintenance and repair stationsexamples: automotive, aerospace,

industrial, ..

39© 2006,

[email protected]


Architecture: A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

▪ concurrent processes and

▪ and for multithreading,

▪ Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

CPU

mod

eDP

U m

ode

40© 2006,

[email protected]


“Super Pentium” configuration examplerDPUrDPU rDPUrDPU rDPUrDPU




rDPUrDPU rDPUrDPU rDPUrDPU












CPUCPU

CPUCPU CPUCPU

CPUCPU

twin paradigm machine

CPUCPU CPUCPU

CPUCPU CPUCPU

41© 2006,

[email protected]


e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

42© 2006,

[email protected]


feasible under 500 MHz

means low electricity cost and allows very high inegration density

43© 2006,

[email protected]


pipeline

apropos compiled pipeline …

44© 2006,

[email protected]


Dual Paradigm Application Development Support

instruction-stream-

based

software code



configware codedata-stream-based

CPU

software/configwareco-compiler

high level languageplacement & routing

in the compiler

optimizes

interconnect

bandwidth by

preferring nearest

neighbor connect

45© 2006,

[email protected]


Software / Configware Co-Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

SWcompiler

CWcompiler

C language source

Partitioner





Placement &

Routing(Move the Locality of Operation

)Resource

Parameters

supportingdifferentplatforms

46© 2006,

[email protected]


Software / Configware very high level Synthesis

instruction-stream-

based

software code




CPU

term-rewriting-basedvhl synthesis system

Math formula ....[Arvind, or,Mauricio Ayala]

47© 2006,

[email protected]


>> Conclusions <<






•Conclusions http://www.uni-kl.de

48© 2006,

[email protected]


flexibility (for accelerators)

Objectives

avoiding specific silicon

rapid prototyping, field-patching, emulation

cheap, compact vHPC

for every area which needs:

49© 2006,

[email protected]


Reconfigurable Computing opens many spectacular new horizons:

Conclusion (1)

Cheap vHPC without needing specific silicon, no mask ....

Massive reduction of the electricity bill: locally and national

Cheap embedded vHPC Cheap desktop supercomputer (a new market)

Fast and cheap prototyping

Replacing expensive hardwired accelerators

Supporting fault tolerance, self-repair and self-organization

Flexibility for systems with unstable multiple standards by dynamic reconfigurability

Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)

50© 2006,

[email protected]


Universal vHPC co-architecture demonstrator

Conclusion (2)Needed:

The compilation tool problem to be solvedLanguage selection problem to be solvedEducation backlog problems to be solved

Use this to develop a very good high school and undergraduate lab course

A motivator: preparing for the top 500 contest

For widely spreading its use successfully:

select killer applications for demo

51© 2006,

[email protected]


thank you

52© 2006,

[email protected]


END

53© 2006,

[email protected]


backup

54© 2006,

[email protected]


Compilation: Software vs. Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB

55© 2006,

[email protected]


configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source

needed

algorithm: variable

resources: fixedsoftware

CPU

56© 2006,

[email protected]


Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

simulated annealing

simulated annealing

57© 2006,

[email protected]


Co-Compiler for Hardwired Kress/Kung Machine[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

data

source

automatic SW / CW partitioner

58© 2006,

[email protected]


The first archetype machine model

mainframe

CPU

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”

59© 2006,

[email protected]


The 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”


simple basic .Machine Paradigm

60© 2006,

[email protected]


Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer

61© 2006,

[email protected]


DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MASM: Auto-

Sequencing Memory

50 & more on-chip ASM are feasible

50 & more on-chip ASM are feasible

62© 2006,

[email protected]


The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array

63© 2006,

[email protected]


(Kress-Kung machine paradigm) drastically reducing memory

cycles

Data Counter instead of Program CounterGeneralization of the DMA

ASM: Auto-Sequencing Memory

datacounter

GAG RAM

ASM

GAG & enabling technology:multiple publications 1989 … -Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL

Storge Scheme optimization methodology, etc.*

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik

64© 2006,

[email protected]


fine-grained RC: 1st DeHon‘s 1st Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]1012

density:density:

FPGAphysical

65© 2006,

[email protected]


coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1st Law

rDPA

FPGArouted

>> 10 000

(Gordon Moore curve)

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

e.g.

KressArray

family

1980 1990 2000 2010100

103

106

109

transistors / microchip

1012

66© 2006,

[email protected]


More compute power by Configware than Software

Conclusion: most compute power from Configware

75% of all (micro)processors are embedded 4 : 1

avarage acceleration factor >2-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

25% embedded µProc. accelerated by FPGA(s)

1 : 4

(a very cautious estimation**)

**) Dataquest interaction pending

-> 1 : 1-> Every 2nd µProc accelerated by FPGA(s)

(difference probably an order of magnitude)

67© 2006,

[email protected]


Conclusion (3)

Self-Repair and Self-Organization methodologyEmbedded r-emulation logistics methodology

Universal vHPC co-architecture demonstrator

select a killer application for demo

For widely spreading its use successfully:

68© 2006,

[email protected]


Dual Paradigm Application Development Support

instruction-stream-

based

software code




CPU

software/configwareco-compiler

high level languageMATLAB

adapter

other example