55
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

  • Upload
    tejana

  • View
    30

  • Download
    1

Embed Size (px)

DESCRIPTION

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis. Greg Stitt Department of Electrical and Computer Engineering University of Florida. Introduction. Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc. - PowerPoint PPT Presentation

Citation preview

Page 1: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Greg StittDepartment of Electrical and Computer Engineering

University of Florida

Page 2: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

2/55

Introduction

Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell

phones, etc. Future architectures - Speech/image recognition, self-

guiding cars, computation biology, etc.

Page 3: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

3/55

Introduction

FPGAs (Field Programmable Gate Arrays) – Implement custom circuits

10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05],

But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into

mainstream Make FPGAs “Invisible”

uPFPGA

Perf

orm

ance

FPGAs capable of large performance improvements

Page 4: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

4/55

Introduction – Hardware/Software Partitioning

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

for (i=0; i < 16; i++) y[i] += c[i] * x[i]......

C Code for FIR Filter

Processor Processor

~1000 cycles

Compiler

0102030405060708090

100

Time Energy

Sw

Hardware/software partitioning selects performance critical regions for hardware implementation

[Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94]

Processor FPGA

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Designer creates custom hardware using hardware description language (HDL)

Hardware for loop

0102030405060708090

100

Time Energy

Hw/ SwSw

~ 10 cycles Speedup = 1000 cycles/ 10

cycles = 100x

Page 5: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

5/55

Introduction – High-level Synthesis

Libraries/Object Code

Libraries/Object Code

Updated Binary

High-level Code

Decompilation

High-level Synthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

DecompilationHw/Sw Partitioning

Compiler

Page 6: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

6/55

Introduction – High-level Synthesis

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

Libraries/Object Code

Libraries/Object Code

Updated Binary

High-level Code

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

DecompilationHigh-level Synthesis

Page 7: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

7/55

Introduction – High-level Synthesis

Problem: Describing circuit using HDL is time consuming/difficult

Solution: High-level synthesis Create circuit from high-

level code [Gupta, DeMicheli 92][Camposano,

Wolf 91][Rabaey 96][Gajski, Dutt 92]

Allows developers to use higher-level specification

Potentially, enables synthesis for software developers

for (i=0; i < 16; i++) y[i] += c[i] * x[i]

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

DecompilationHigh-level Synthesis

Page 8: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

8/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-threaded Warp Processing Custom Communication

Page 9: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

9/55

Problems with High-Level Synthesis

Problem: High-level synthesis is unattractive to software developers

Requires specialized language

SystemC, NapaC, HandelC, …

Requires specialized compiler

Spark, ROCCC, CatapultC, …

Limited commercial success

Software developers reluctant to change tools

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Non-Standard Software Tool Flow

Updated BinarySpecialized Language

DecompilationSpecialized Compiler

Page 10: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

10/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftware

Solution: Make synthesis “invisible” 2 Requirements

Standard software tool flow

Perform compilation before synthesis

Hide synthesis tool Move synthesis on

chip Similar to dynamic

binary translation [Transmeta]

But, translate to hw

DecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Move compilation before synthesis

Standard Software Tool Flow

Page 11: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

11/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftwareDecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Solution: Make synthesis “invisible” 2 Requirements

Standard software tool flow

Perform compilation before synthesis

Hide synthesis tool Move synthesis on

chip Similar to dynamic

binary translation [Transmeta]

But, translate to hw

Warp processor looks like standard uP but invisibly synthesizes hardware

Page 12: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

12/55

Warp Processing – “Invisible” Synthesis

Libraries/Object Code

Libraries/Object Code

Updated BinaryHigh-Level Code

DecompilationSynthesis

BitstreamBitstream

uP FPGA

Linker

HardwareHardwareSoftwareSoftwareDecompilationSynthesis

DecompilationCompiler

Updated BinaryHigh-level CodeLibraries/

Object Code

Libraries/Object Code

Updated BinarySoftware Binary

HardwareHardwareSoftwareSoftware

Advantages Supports all

languages,compilers, IDEs

Supports synthesis of assembly code

Support synthesis of library code

Also, enables dynamic optimizations

Updated BinaryC, C++, Java, Matlab

Decompilationgcc, g++, javac, keil

Warp processor looks like standard uP but invisibly synthesizes hardware

Page 13: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

13/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

Page 14: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

14/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

ProfilerI Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

Page 15: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

15/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

Page 16: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

16/55

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

Page 17: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

17/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD converts critical region

into control data flow graph (CDFG)55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Page 18: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

18/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

Page 19: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

19/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

Page 20: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

20/55

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary88

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

Page 21: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

21/55

µP

Cache

Expandable Logic

RAM

Expandable RAM

uP

Performance

Profiler

µP

Cache

Warp Tools

DMA

FPGAFPGA

FPGA FPGA

RAM Expandable RAM – System detects RAM during start, improves performance invisibly

Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware.

Expandable Logic

Page 22: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

22/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplication

Portable Gaming

Performance

Unacceptable Performance

Page 23: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

23/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplication

Portable Gaming

Performance

. . . .

. . . .

•User can customize FPGAs to the desired amount of performance•Performance improvement is invisible – doesn’t require new binary from the developer

Page 24: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

24/55

Expandable Logic

Allows for customization of platforms User can select FPGAs based on used

applicationsApplicationWeb Browser

Performance

Acceptable Performance

No-FPGA

•Platform designer doesn’t have to decide on fixed amount of FPGA.

•User doesn’t have to pay for FPGA that isn’t needed

Page 25: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

25/55

uPI$

D$

FPGA

Profiler

On-chip CAD

Warp Processing Background: Basic Technology

Challenge: CAD tools normally require powerful workstations

Develop extremely efficient on-chip CAD tools

Requires efficient synthesis Requires specialized FPGA, physical

design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04],

University of Arizona

BinaryBinary

BinaryHW

Synthesis

Technology Mapping

Placement & Routing

Logic Optimization

BinaryUpdated Binary

JIT F

PG

A

com

pila

tio

n

Page 26: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

26/55

Warp Processing Background: On-Chip CAD

60 MB

9.1 s

Xilinx ISE

Manually performed

3.6MB0.2 s

On-chip CAD

On a 75Mhz ARM7: only 1.4 s

46x improvement30% perf. penalty

Log.

Opt

.

Tech

. Map

Plac

e

Rou

te

RT

Syn.

Synt

hesi

s

Page 27: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

27/55

Warp Processing: Initial Results - Embedded Applications

Average speedup of 6.3x Achieved completely transparently

Also, energy savings of 66%

0

3

6

9

12

15

brev

g3fa

x url

rocm

pktflo

wca

nrdr

bitm

np

tblo

ok

ttspr

k

mat

rix idct

g721

mpe

g2 fir

mat

mul

Avera

ge:

Med

ian:

Benchmarks

Sp

ee

du

p

Page 28: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

28/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-threaded Warp Processing Custom Communication

Page 29: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

29/55

Binary Synthesis Warp processors perform

synthesis from software binary – “binary synthesis”

Problem: No high-level information

Synthesis needs high-level constructs

> 10x slowdown

Can we recover high-level information for synthesis?

Make binary synthesis (and Warp processing) competitive with high-level synthesis

for (i=0; i < 128; i++) y[i] += c[i] * x[i]....

for (i=0; i < 128; i++) y[i] += c[i] * x[i]....

Compiler

Addi r1, r0, 0Ld r3, 256(r1)Ld r4, 512(r1)Subi r2, r1, 128Jnz r2, -5

No high-level constructs – arrays, loops, etc.

Binary Synthesis

Processor FPGAHardware can be > 10x to 100x

Page 30: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

30/55

Decompilation We realized decompilation recovers

high-level information But, generally used for binary translation or

source-code recovery May not be suitable for synthesis

We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain

Determined relevant techniques Adapted existing techniques for synthesis

Page 31: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

31/55

Decompilation – Control/Data Flow Graph Recovery

Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps

Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks

[Cifuentes 99, 00]

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly

Page 32: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

32/55

Decompilation – Data Flow Analysis

Original purpose - remove temporary registers Area overhead – 130%

Need new techniques for binary synthesis

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

Page 33: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

33/55

Decompilation – Data Flow Analysis

Strength Reduction – Compare-with-zero instructions

Operator Size Reduction

Sub reg3, reg4, reg5 Bz reg3, -5

reg4 reg5

Sub

reg3

=

0

Branch?

Not needed, wastes area

32-bit reg4

32-bit +

32-bit reg5

32-bit reg3

Lb reg4, 0(reg1)Mvi reg5, 16Add reg3, reg4, reg5 8-bit +

8-bit reg3Only 8-bit adder needed

reg4

=

reg5

Branch?

Optimized DFG

Area Overhead Reduced to 10%

8-bit reg4 5-bit reg5

Optimized DFG

Load Byte 16

Page 34: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

34/55

Decompilation – Function Recovery

Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

Page 35: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

35/55

Decompilation – Control Structure Recovery

Recover loops, if statements Uses interval analysis techniques

[Cifuentes 94]

100% success rate

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

Page 36: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

36/55

Decompilation – Array Recovery

Detect linear memory patterns and row-major ordering calculations

~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00]

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

Corresponding Assembly

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

Page 37: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

37/55

Comparison of Decompiled Code and Original Code

Decompiled code almost identical to original code

Only difference is variable names Binary synthesis is competitive with high-level

synthesis

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

Original C Code

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Decompiled Code

Almost Identical Representations

Page 38: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

38/55

Libraries/Object Code

Binary Synthesis Tool Flow

Binary Synthesis

BinaryBinary

DecompilationDecompilation

HardwareHardwareSoftwareSoftware

Libraries/Object Code

Hardware Netlists

Hardware Netlists

BitstreamBitstream

ProfilingSynthesisProfilingBinary Updater

Hw/Sw Estimation

Hw/Sw Estimation

Hw/Sw Partitioning

Hw/Sw Partitioning

ProfilingProfiling

Updated Binary

High-level Source

DecompilationCompiler

BinaryBinary

BitstreamBitstream

uP FPGA

Updated Binary

Updated Binary

Initially, high-level source is compiled and linked to form a binary

Recovers high-level information needed for synthesis

Modifies binary to use synthesized hardware

~30,000 lines of C code

Page 39: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

39/55

0123456789

101112131415

Sp

eed

up

FIR Fi

lter

Beam

form

er

Vite

rbi

Brev

Url

BITMNP0

1

IDCTR

N01

PNTR

CH01

Aver

age

High-level

Binary-level

Binary Synthesis is Competitive with High-Level Synthesis

Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better

Commercial products beginning to appear Critical Blue, Binachip

Small difference in speedup

Page 40: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

40/55

Binary Synthesis with Software Compiler Optimizations

But, binaries generated with few optimizations Optimizations for software may hurt

hardware Need new decompilation techniquesC code

SW Compiler

Optimized Binary

uP FPGA

Binary Synthesis

Binary is optimized for software

Hardware synthesized from optimized binary may be inefficient

Page 41: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

41/55

Loop Rerolling

Solution: We introduce loop rerolling to undo loop unrolling

Problem: Loop unrolling may cause inefficient hardware

Longer synthesis times Super-linear heuristics Unrolling 100 times =>

synthesis time is 1002 times longer

Larger area requirements Unrolling by compiler unlikely

to match unrolling by synthesis Loop structure needed for

advanced synthesis techniques

Non-unrolled Unrolled

Synthesis Execution Times

Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5

Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1

Non-unrolled Loop Unrolled Loop

Page 42: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

42/55

Loop Rerolling – Identifying Unrolled Loops

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Original C Code

Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop

2 unrolled iterationsEach iteration = abc (Ld, Add, St)

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Binary

x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;

Unrolled Loop

Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D

Map to String

BABCABCD

String Representatio

n

Idea - Identify consecutively repeating instruction sequences

abc c db

abcabcd c abcd d abcd d

dabcd

Suffix Tree

[Ukkonen 95]

Page 43: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

43/55

Loop Rerolling

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Original C Code

Unrolled Loop Identificiation

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Determine relationship of constants

1)

Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3

Replace constants with induction variable expression

2)

reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;

Rerolled, decompiled code

3)

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Average Speedup of 1.6x

Page 44: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

44/55

Strength Promotion

+

++

<< <<

B[i+1] 4B[i+1] 1

+

<< <<

B[i] 3 B[i] 1

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

However, some of the strength reduction was beneficial

Strength promotion lets synthesis decide on strength reduction, not software compiler

Average Speedup of 1.5

Identify strength-reduced subgraphs

+

++

<< <<

B[i+1] 4B[i+1] 1

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

Replace with multiplication

++

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

++

+

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*

++

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*

B[i] 66

*

1

++

B[i+1] 18B[i] 10

+

<< <<

B[i+2] 5B[i+2]1

+

<< <<

B[i+3]6 B[i+3]

+

A[i]

* *

Synthesis reapplies strength reduction to get optimal DFG

Problem: Strength reduction may cause inefficient hardware

Page 45: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

45/55

Multiple ISA/Optimization Results

What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible

What about different instructions sets? Side effects may degrade hardware performance

0

5

10

15

20

25

30

Speedups similar on MIPS for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar on ARM for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar between ARM and MIPS

Complex instructions of ARM didn’t hurt synthesis

MicroBlaze speedups much larger

MicroBlaze is a slower microprocessor

-O3 optimizations were very beneficial to hardware

0

5

10

15

20

25

30

MIP

S -O1

MIP

S -O3

ARM -O

1

ARM -O

3

Micr

oBlaz

e -O

1

Micr

oBlaz

e -O

3

Sp

eed

up

Page 46: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

46/55

High-level vs. Binary Synthesis: Proprietary H.264 Decoder

MPEG2 H.264

High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor

H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality

Page 47: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

47/55

High-level vs. Binary Synthesis: Proprietary H.264 Decoder

Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Sp

eed

up

Speedup (High-level)

Speedup (Binary)

Binary synthesis competitive with high- level synthesis

Page 48: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

48/55

Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis

Key techniques for synthesis from binaries Decompilation

Current and Future Directions Multi-Threaded Warp Processing Custom Communication

Page 49: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

49/55

Thread Warping - Overview

Profiler

µP

Warp Tools

Warp FPGA

µP

µP µPOS

a( ) b( )

b( )

for (i=0; i < 10; i++) createThread( b );

Function a( )

OS

Thread Queue

b( ) b( ) b( ) b( )b( ) b( )b( )b( )

Warp Toolsb( )

Warp FPGA

b( )

b( )

b( )

b( )b( )

b( ) b( )

b( )

OS can only schedule 2 threads

Remaining 8 threads placed in thread queue

Warp tools create custom accelerators for b( )

OS schedules 4 threads to custom accelerators

3x more thread parallelism

Architectural Trend – Include more cores on chip

Result – More multi-threaded applications

Page 50: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

50/55

Thread Warping - Overview

Profiler

µP

Warp Tools

Warp FPGA

µP

µP µPOS

a( ) b( )

b( )

for (i=0; i < 10; i++) createThread( b );

Function a( )

Warp Toolsb( )

Profiler

Profiler detects performance critical loop in b( )

Warp FPGA

b( )

b( )

b( )

b( ) Warp tools create larger/faster accelerators

b( )b( ) b( )b( )

Potentially > 100x speedup

Page 51: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

51/55

130 502 63 130 38308

01020304050

Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean

4-uP

TW

8-uP

16-uP

32-uP

64-uP

Thread Warping - ResultsThread warping 120x faster than 4-uP (ARM) system

Comparison of thread warping (TW) and multi-core

Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA

Page 52: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

52/55

Warp Processing – Custom Communication

µP µP

µP µP

Problem: Best topology is application dependent

Bus Mesh

Bus Mesh

App1

App2

NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]

Perf

orm

ance

Perf

orm

ance

Page 53: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

53/55

Warp Processing – Custom Communication

FPGA

NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]

Problem: Best topology is application dependent

Bus Mesh

Bus Mesh

App1

App2

µP µP

µP µP

Warp processing can dynamically choose topology – 2x to 100x improvement

FPGA

µP µP

µP µP

FPGA

µP µP

µP µP

Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing”

Perf

orm

ance

Perf

orm

ance

Page 54: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

54/55

Summary

uPI$

D$

FPGA

Profiler

On-chip CAD

Updated BinaryAny Language

Updated BinaryStandard Binary

DecompilationAny Compiler

Developer is unaware of FPGA/synthesis

BinaryBinary

BinaryHW

Binary Synthesis

JIT FPGA Compilation

BinaryUpdated Binary

Decompilation makes possible

FPGA

Expandable Logic

Warp Processing

uP

Performance

Warp processing invisibly achieves > 100x speedups

Page 55: Warp Processing:  Making FPGAs Ubiquitous via Invisible Synthesis

55/55

References Patent

Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004

1. Hardware/Software Partitioning of Software Binaries G. Stitt and F. VahidIEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170.

2. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681.

3. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES)

4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007.

5. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554.

6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-290.

7. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397.

8. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255.

Supported by NSF, SRC, Intel, IBM, Xilinx