33
A Framework for Run - Time Reconfigurable Computing D ANIEL L LAMOCCA Electrical and Computer Engineering Department, Oakland University October, 24 th , 2014

A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Embed Size (px)

Citation preview

Page 1: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

A Framework for Run-Time Reconfigurable Computing

DANIEL LLAMOCCA

Electrical and Computer Engineering Department,

Oakland University

October, 24th, 2014

Page 2: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

OutlineMotivation

General Approach

Implementation Details

Application Examples:o Pixel Processor

o 2D FIR Filter

Research Plans

Teaching Plans

Conclusions

Page 3: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Motivation

Dynamic Reconfigurable Computing Management will enable us to deliver:

A dynamically self-adaptive system (by dynamic allocation of computational resources and dynamic frequency control) that satisfies time-varying Multi-variable requirements (or constraints).

Optimal hardware realizations: We want to investigate optimal solutions that can meet time-varying Multi-variable requirements .For example, if the variables were Energy, Performance, and Accuracy, then the system should minimize energy consumption, and at the same time maximize performance and precision, whilesatisfying the given multi-variable requirements.

Digital systems can be characterized by a series of properties:

Energy, Performance, Precision, Resource Usage, Bandwidth, etc.

The controlling of these variables at run-time is defined as Dynamic Reconfigurable Computing Management.

Digital

signal/image/video

DIGITALSYSTEM

Digital

signal/image/videoor features

Control

Energy

PerformancePrecision

Resources

...

... Parameters

IN OUT

Page 4: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Motivation

Examples:

Task 1: A video processing system is asked to deliver real time performance at 30 frames per second (fps) on limited battery life that will also need to operate for at least 10 hours. This is a multi-objective optimization problem. If solutions are found, pick the system realization that delivers the highest accuracy.

Task 2: Now, suppose that we are asked to deliver performance at 100 frames per second (fps) at some minimum level of accuracy (60dB). In this case, we select the hardware realization with the lowest energy requirements while meeting the performance and accuracy constraints.

The system can then carry out independent tasks in time:

t1 t2 t3 time...

DIGITALSYSTEM

Set 1 of

Parameters

Performance 30 fps

Energy available for at least 10 hours

Performance 100 fps

Accuracy 60 dB

DIGITALSYSTEM

Set 2 of

Parameters

TASK 1 TASK 2

Multi-variable Space:

Energy-Performance-Accuracy

Time-Varyingconstraints

Page 5: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

MotivationDynamic Reconfigurable Computing Management can rely on: Dynamic Partial Reconfiguration and Dynamic Frequency Control on FPGAs.

Dynamic Partial Reconfiguration (DPR)

DPR technology enables the adaptation of hardware resources by modifying or switching off portions of the FPGA while the rest remains intact, continuing its operation.

A Partial Reconfiguration Region (PRR) is a region whose hardware configuration can be modified at run-time.

Xilinx® devices: the PRR is dynamically reconfigured via the internal c0nfiguration access port (ICAP).

Dynamic Frequency Control

Digital Clock Managers (DCMs) inside FPGAs provide a wide range of clock management features.

The Dynamic Reconfiguration Port (DRP) of the DCM enables dynamic control of the frequency and phase. We can dynamically adjust the frequency without reloading a new bitstream to the FPGA.

clkfx

clkin

rst

clkfb clk0

daddr[6:0]

di[15:0]

dwe

dclk

den

do[15..0]

drdy

DRP

DCM_ADV

DC

R s

lave

inte

rface

processor

DCR MASTER

BUFG

BUFG

reference clock

DC

R B

US

DCR SLAVE

dcr_a_bus

dcr_sl_dbus

dcr_read

dcr_write

sl_dcrack

sl_dcr_dbus

clkfx

Dynamic FrequencyControl on Virtex-4 FPGAs

VariableClock

FPGA

Module 1

Module 2

Module n

t1

t2

tn

PRR

static region

dynamicregion

Page 6: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (1/7)For a given system, Dynamic Reconfigurable Computing

Management is carried in the following manner:

1) Definition of Objective Functions

2) Development of efficient cores

3) Parameterization of Hardware Cores

4) Multi-objective Pareto Optimization in the Multi-Variable Space

5) Dynamic management based on real-time multi-variable constraints

Page 7: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (2/7)1) Definition of objective functions:

A wide range of quantities (e.g., energy, performance, precision, hardware usage) can considered as the objective functions of system parameters. These properties may have a slightly different definition depending on the application:

Energy can be measured as the total energy spent during the system operation, or the energy spent during an operation (e.g., energy per video frame). In some instances, measuring Power is more useful.

Performance can be measured by: Megasamples per second, frames per second, Megabytes per second, etc.

Precision can be measured by: numerical representation, or accuracy with respect to an idealized result (e.g., PSNR).

There can be objective functions that pertain to an specific application. For example, in image compression, the bitrate metric evaluates the compression efficiency of a hardware architecture (e.g. JPEG processor). Or bandwidth for communication networks.

Page 8: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (3/7)2) Development of efficient cores: The hardware architectures should

use techniques that:i) Minimize the amount of computational resources (e.g. LUT-based

approaches, Distributed Arithmetic),ii) Exploit parallelism and pipelining so as to obtain high performance

architectures.iii)Make intensive use of DPR and/or take advantage of DPR.

The cores should be described using Hardware Description Language (HDL) at the Register Transfer Level (RTL). The best effort must be made so that these cores remain portable across devices and vendors.

+++

++

+

X(0) X(1) X(2) X(3) X(4) X(5) X(6)

Yo

PIPELINED ADDER TREE

x[M

-1]

s[0]s[M-2]s[M-1]

X_in B

x[N-1] x[N-2] x[0]x[1]

X_in

x[0]

x[N-1] x[M+1]

x[M]x[M-1]x[1]

x[N-2]B

only whenN is evenB+1

E E E E

E

E E E

E E E

E

B B B B

x[M]...

...

...

...for anti-

symmetry

B+1 B+1

E

+

E E E E E E...

E_COF

COFNH ...

E_COF

COFNH

NH+B NH+B NH+B NH+B

Y

Addertree +

NH+B+1 NH+B+1 NH+B+1

Y

Addertree

NON-SYMMETRIC FILTER SYMMETRIC/ANTI-SYMMETRIC FILTER

Page 9: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (4/7)3) Parameterization of hardware cores:

Fine control of the objective functions (e.g., energy, performance, accuracy) is greatly helped by realistic parameterization of the hardware cores (e.g., I/O bit-width, number of parallel cores).

Parameterized HDL code allows us to quickly create a set of hardware realizations by varying the parameters. Each realization comes with different values for the objective functions, which we ultimately control by varying the hardware parameters.

Example: Parameterization of the ‘Pixel processor’ architecture:

NC (number of cores), NI (number of input bits per pixel),

NO (number of output bits per pixel), F (function to be implemented),

LUT values (text file with LUT values)

PIXEL

PROCESSOR

NC LUT values(from text file)

NI NO F

NINC NONC

Page 10: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (5/7)4) Multi-objective Optimization in the Multi-Variable Space:

The Energy-Performance-Accuracy space is shown along with the Pareto- optimalpoints.In some cases, we may want to explore a space of just 2 variables, e.g., the Energy-Accuracy space.

Multi-variable space: Represented by a set of hardware realizations along with their objective function values. We create it by varying the system parameters.Optimality: A hardware realization is defined to be optimal in the multi-objective (Pareto) sense if it is not possible to improve on all criteria without deteriorating in at least one of them. Objective 1

Obje

ctive

2

SET OF HARDWARE

REALIZATIONS

Parameters Objective FunctionsSet 1 Values 1Set 2 Values 2Set 3 Values 3... ....

Multi-variable

Space

Example 3-variable space: Energy-Performance-Accuracy Space: An optimal hardware realization is defined as one that minimizes energy, while maximizing performance and accuracy.

En

erg

y

-Accuracy

Pareto front

-Accuracy -Performance

En

erg

y

Pareto front

Page 11: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (6/7)5) Dynamic management based on real-time multi-variable constraints:

Once the Pareto front has been extracted, we can cast optimization problems based on multi-variable constraints. We will use the Energy-Performance-Accuracy Space to explain this idea.Example: We are given constraints on the 3 variables. The feasible set is then represented by the golden points. We prioritize energy consumption, so we select the realization from the feasible set that also minimizes energy consumption. This can be cast as the following optimization problem:

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑅𝑖

𝐸𝑛𝑒𝑟𝑔𝑦(𝑅𝑖) 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜:

𝐸𝑛𝑒𝑟𝑔𝑦(𝑅𝑖) ≤ 2𝑚𝐽𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑅𝑖) ≥ 50𝑑𝐵

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑅𝑖 ≥ 30𝑓𝑝𝑠

-Performance-Accuracy

constr

ain

ts

Energ

y

Energ

y

-Accuracy

constraints

Circled point: Realization from the feasible set that minimizes energy consumption.

If we ignore one variable (say, performance), we have a 2D optimization problem.

Page 12: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

General approach (7/7)5) Dynamic management based on real-time multi-variable

constraints: Now, we are given time-varying simultaneous constraints: different set of constraints are applied at different moments in time.The system receives stimuli in the form of multi-variable constraints and reconfigures itself via DPR and/or Dynamic Frequency Control to meet the multi-variable constraints. The figure shows examples with 3 and 2 simultaneous constraints.

EnergyAccuracy (psnr)

Performance (fps)

min50dB

200

30 mJmax

300

----

max

t1 t2 t3

En

erg

y

-Accuracy

PO 1

Pareto front

PO 3

PO m

PO 4

PO 2

'm' Pareto points

-Performance

-Accuracy

Pareto front

En

erg

y PO 1

PO 3

PO m

PO 4

PO 2

'm' Pareto points

EnergyAccuracy (psnr)

min--

30 mJmax

--max

t1 t2 t3

30mJ

Page 13: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Implementation Details (1/2)Embedded FPGA system that supports Dynamic Partial Reconfiguration and Dynamic Frequency Control:Pareto-optimal point: Represented by <bitstream, frequency of operation>- Hardware realization that becomes active in the FPGA via Dynamic Partial Reconfiguration (DPR) and/or Dynamic Frequency Control. If the system receives a multi-variable constraint: It looks for a solution in the Pareto-optimal set: <bitstream*, freq*> It reconfigures the FPGA dynamic region(s) and /or frequency of operation, so as to meet the multi-variable constraints.Example (one Dynamic Region): The PRM (Partial Reconfigurable Module) is a hardware core that performs an specific task and that can be modified at run-time.

PPC/

mBlaze/ARM

PLB/AXI

System ACE

memoryEthernet

MAC

PLB/AXI Peripheral

PRM

PRR

ICA

Pport

'n' bitstreamsin memory

DPRDMAcore

M S

ICAPcore

frequency& PR ctrl

iFIFOin

terf

ace

CFcard

oFIFO

clkfx

PR_done

Module

1

Module

2

Module

n

...n m n ≤ m

Energ

y

-Accuracy

PO 1

Pareto front

PO 3

PO m

PO 4

PO 2

'm' Pareto pointsModule 1

'n' uniquebitstreams

freq.

f1

f2

f2

f3

f1

-Performance

Module 2

Module 1

Module 1

Module n

...

...

Mu

lti-

vari

ab

leco

ns

train

ts

*PRR: Partial Reconfigurable Region

Page 14: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Implementation Details (2/2) A Pareto point is represented by ‘k’ bistreams and the frequency of operation if: The system has ‘k’ dynamic regions. The requested operation requires the

dynamic regions to have a unique combination of bitstreams. The system has one dynamic region, but the requested operation requires

reconfiguring the dynamic region ‘k’ times.Generalization: A Pareto point is represented by:

<bitstream1, bitstream2, …, bitstreamk, frequency of operation>

n ≤ m, p ≤ m

f1

f2

f1

f3

fp

...

'n' unique sets of partial bitstreams

prm 11 prm 21 prm k1...

prm 1n prm 2n prm kn...

prm 12 prm 22 prm k2...

prm 13 prm 23 prm k3

prm 12 prm 22 prm k2

...

partial bitstreams

Hardware core

prm 1 prm 2 prm k

Static Region

...

...

...

Design Parameters

clockfrequency

...

freq.Objective 1

Pareto front

Obje

ctive

2

PO 1

PO 3

PO m

PO 4

PO 2

'm' Pareto points

'k' Dynamic Regions

In some cases, adifferent Pareto pointcan exist where only

the frequenqychanges

n ≤ m, p ≤ m

f1

f2

f1

f3

fp

...

'n' unique sets of partial bitstreams

prm 11 prm 21 prm k1...

prm 1n prm 2n prm kn...

prm 12 prm 22 prm k2...

prm 13 prm 23 prm k3

prm 12 prm 22 prm k2

...

partial bitstreams

Hardware core

prm

Static Region

...

...

Design Parameters

clockfrequency

...

freq.Objective 1

Pareto front

Obje

ctive

2PO 1

PO 3

PO m

PO 4

PO 2

'm' Pareto points

One Dynamic Region

In some cases, adifferent Pareto pointcan exist where only

the frequenqychanges

Page 15: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Digital signal, image, and video processing applications

The following systems are discussed:

Pixel Processor and Dynamic PPA/EPA Management

2D Separable FIR Filter Dynamic EPA Management.

Page 16: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Pixel Processor (1/4) LUT-based architecture.

Single-pixel operations (e.g., gamma correction, Huffman encoding, histogram equalization, contrast stretching) can be dynamically swapped. Parameter F modifies the function.

In addition to dynamically modifying the input-output function, we allow for the modification of:

Input pixel bitwidth (NI)

Output pixel bitwidth (NO),

Number of parallel processing elements (NC)

t1t2t3

Input Frame

Inv. Gamma correction

gamma = 2

Contrast Stretching

alpha = 2, m = 0.5

Gamma correction

gamma = 0.5

Module 3 Module 2 Module 1

FPGA

PRR

PIXEL

PROCESSOR

NC LUT values

(from text f ile)

NI NO F

NINC NONC

* VHDL IP core available at:

dllamocca.org

Page 17: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Pixel Processor (2/4) Embedded System: One Dynamic Region. Pareto point represented by: <bitstream, frequency>. Pixel processor interface: PLB (Processor Local Bus) slave burst interface. The

figure shows a PRR with NC=4, NI=NO=8. The system dynamically reconfigures: NC, NI, NO, FUNCTION, under the

following constraints: NINC32, and NONC 32 (because of the 32-bit PLB) Five ‘clkfx’ frequencies allowed: 100.00, 66.66, 50.0, 40.00, and 33.33 MHz. FIFOs required to properly isolate different clock regions (PLB clk= 100 MHz and

‘clkfx’).

ord

en

ird

en

LUT8-to-8

LUT8-to-8

LUT8-to-8

32 32

upix_ip

LUT8-to-8

Bus2IP

_D

ata

SlaveReg 0

oFIFO

FWFT

DOrden

IP2B

us_D

ata

32 IP

2B

us_W

rack

IP2B

us_R

dack

Bus2IP

_W

rCE

(0)

Bus2IP

_R

dC

E(1

)

FSM

Bus2IP

_C

lkDIwren

PLB Bus

PLB Interface

512x32

Bus2IP

_B

E

rdwr

IP2B

us_A

dd

rAck

PLB46 Slave Burst IP

Bus2IP

_R

dR

eq

Bus2IP

_W

rReq

full

em

pty

ow

ren

ofull oempty

iFIFO

FWFT

DOrden

DIwren

full

em

pty

iwre

nifull iempty

FSM

clkfx

iwren

SlaveReg 1

iwre

n

512x32

PowerPC

PLB

System ACE

memoryEthernet

MACICAPcore

ICAPport

CFcard

freq. ctrlDCR Bus clkfx

EP

Aco

nstr

ain

ts

'n' bitstreamsin memory

Mo

dule

n

Mo

dule

2

Mo

dule

1

DMAcore

M S

Hardware core

PixelProcessor

Inte

rface

PRRPRR

Implemented on a ML405 Development BoardDevice: XC4VFX20-11FF672

Page 18: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Pixel Processor (3/4)

Multi-objective optimization of the Power-Performance-Accuracy space:

Function: log(1 + I) + Full Histogram stretching. Image: Lena (VGA: 640x480).

8-bit input image (NI=8 fixed). Pareto points are clustered as a function of NO (# of output bits). A similar trend occurs with NC (# of cores) (not shown)

Left side shows how power and performance depend on frequency.

0

2000

400055 60 65 70 75 80 85

0

20

40

60

80

100

fps

EPP Results -Function 7

psnr(dB)

Pow

er(

mW

)

02000

400060 80100

0

50

100

fps

EPP Results -Function 7

psnr(dB)

Pow

er(

mW

)

NO = 8

NO = 9

NO = 10

NO = 11

NO = 12

0

2000

400055 60 65 70 75 80 85

0

20

40

60

80

100

fps

PPA Results -Function 7

psnr(dB)

Pow

er(

mW

)

100

66.66

50

40

33.33

NC

=4

NC

=6

NC=8

NC=10

NO=12

NC

=2

NO=8

HA

HP

LPw

NO Accuracy, Power

NCPower, fps

Page 19: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Pixel Processor (4/4) Dynamic Energy-Performance-Accuracy management: We show two

examples on 2D (ignoring performance) with time-varying constraints:

Video sequence:

foreman (CIF: 352x288)

Function:

Gamma correction:

I, = 1/0.45

Video sequence:

missamerica(QCIF: 176x144)

Function:

Nonlinear contrast

Stretching: 1/(1 + m/I),

m = 0.5, =2

* Published in IEEE Transactions on Circuits

and Systems for Video Technology

0 50 100 150 200 250 30045

50

55

60

65

70

75

80

85

Frame #

PS

NR

(dB

)

frame #30

avg=59.07std=0.05

diff. w/ lena=0.25%

avg=70.48std=0.06

diff. w/ lena=0.21%

frame #120

avg=82.58std=0.21

diff. w/lena=0.47%

frame #200

avg=48.53std=0.06

diff. w/lena=3.23%

frame #270

304050607080902

3

4

5

6

7

8

9

10

psnr(dB)

Energ

y p

er

fram

e(u

J)

EPA Results -Function 6

Energy per frameAccuracy (psnr)

7 uJ55dB

min70 dB

--max

min50dB

NI=8, NO=10,NC=6

NI=NO=8, NC=8

NI=8, NO=12, NC=10

NI=7, NO=7, NC=10

0 50 100 15045

50

55

60

65

70

75

80

85

Frame #

PS

NR

(dB

)

avg=59.43std=0.045

diff. w/lena=0.78%

frame #15

avg=70.89std=0.035

diff. w/ lena=0.42%

frame #55

avg=82.53std=0.030

diff. w/ lena=1.67%

frame #100

avg=49.73std=0.032

diff. w/ lena=1.39%

frame #130

304050607080900.5

1

1.5

2

2.5

psnr(dB)

Energ

y p

er

fram

e(u

J)

EPA Results -Function 5

Energy per frameAccuracy (psnr)

2 uJ55dB

min70 dB

--max

min50dB

NI=7, NO=8, NC=10

NI=NO=8, NC=8

NI=8, NO=12, NC=10

NI=8, NO=10,NC=6

Page 20: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable FIR Filter (1/7)

EX_in E E E

y

c[0] c[1] c[2 c[3

EX_in E E E

y

shifts and adds

LUT array

Multiply and add Filter Distributed Arithmethic Filter

TWO TYPES OF IMPLEMENTATIONS

FIR_DAX_in B

E

NO

N NH

[NO

NQ

]

OP

SY

MM

ET

RY

COEFFICIENTS

Y

L

sclr

BNumber of taps

Input bit-w idth

Coeff icient bit-w idth

Output f ixed point format

LUT input size

Output truncation scheme Efficient implementation of a 1D FIR Filter via DPR: Dynamic Partial Reconfiguration turns the fixed-coefficient DA filter into a variable-coefficient DA filter, at the expense of partial reconfiguration time overhead.

Parameterization of the VHDL-coded FIR filter core:

Distributed Arithmetic (DA) approach is more efficient since it is a LUT-based approach that turns the multiplications into shifts and adds. But it requires the coefficients to be constant.

1D FIR Filter implementation

* VHDL IP core available at:

dllamocca.org

Page 21: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable FIR Filter (2/7)

2D Separable Filter Implementation:

Separable FIR filters allow for efficient implementations by means of two 1D FIR Filters.

The reconfiguration rate is constant (twice per frame).

Cyclic Dynamic reconfiguration of two 1-D filters (usually full-filter reconfiguration):

- Implement row filter

- Replace by column filter

- Implement column filter

- Replace by row filter

DPR

tr1 tc1FPGA

ROW COL

COL

tr2 tc2

ROW COL

1 Frame is streamed

Processed row -by-row frame is streamed

2

3

4

Replacing row filter by column filter via DPR

Replacing column filter by row filter via DPR

ROW

DPRROWCOL

Processing

frame 1

Processing

frame 2

5 Go to Step 1 to process a new frame

ROW PRR

row filter

col f ilter

COL

* A comparison of this 2D FIR Filter and a GPU implementation for different number of coefficients was published 2011 IEEE Field Programmable Logic Conference (FPL’2011)

Page 22: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable FIR Filter (3/7) Embedded System: PLB Interface: The interface is inside the PRR (Partial Reconfigurable

Region), so that we can dynamically modify the I/O bitwidth. Each 2D filter realization is represented by 2 bitstreams (1 dynamic region)

Rowfilter

Colfilter

Row i

Col iDPR

DPR

DYNAMIC MANAGERGenerates Pareto Optimal point POi and

loads it into the 2D FIR filter

Row i Col i

PO i, 1 ≤ i ≤ n

B: User constraints

Consider new constraints based on image type, user input, or output

EPA constraintgenerated?

Look for PO point that satifies theEPA constraints

PO point exists?

Load <row i,col i> bitstreams

into the 2D FIR filter via DPR

no

no

yes

yes

ParametersN,NH,OB,NXr,NXc

Energy

Performance

Accuracy

Pointer tobitstream row i

Pointer tobitstream col i

Pareto Optimal PointPOi in memory:

2D FIR Filter

A: im

age type

C: outp

ut fr

am

es

mBlaze

PLB

System ACE

memoryEthernet

MAC

Real Filter peripheral

1D FIR IP core

PRR

ICA

Pport

EP

Aco

ns

train

ts

'2n' bitstreamsin memory

Filter 1 Filter 2 Filter n

Row

1

Col 1

Row

2

Col 2

Row

n

Col n

DPRDMAcore

M S

ICAPcore

frequency& PR ctrl

iFIFO

inte

rface

CFcard

oFIFO

clkfx

PR_done

...

Energ

y p

er

fram

e

-Accuracy

PO 1

Pareto front

PO 3

PO m

PO 4

PO 2

'm' Pareto points

-Performancen ≤ m

Row 1 Col 1

Row 2 Col 2

Row 1 Col 1

Row 3 Col 3

Row n Col n

bitstream pair'n' unique pairs fr

eq.

f1

f2

f3

f2

f4

...

...

Implemented on a ML605 Development Board

Page 23: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

325

330

33550 60 70 80 90 100 110

0.2

0.25

0.3

0.35

fps

psnr(dB)

Results according to NH (coefficient bitw idth)

Energ

y p

er

fram

e(m

J)

325

330

335

50 60 70 80 90 100 110

0.2

0.25

0.3

0.35

fps

Frame size:352x288

psnr(dB)

Energ

y p

er

fram

e(m

J)

325

330

335

50

100

150

0.2

0.25

0.3

0.35

fps

Frame size:352x288

psnr(dB)

Epf(

mJ)

N = 24

N = 20

N = 16

N = 12

N = 8 325330

335

50

100

150

0

0.2

0.4

fps

Results according to NH (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

NH = 16

NH = 12

NH = 10

LE, HP1, LA

HA, HE, LP

OB=8

U1

HP2

HP3

HP4

HP5U2,U3U4

U5 U6

HA, HE, LP

LE, HP1, LA

2D Separable FIR Filter (4/7)Multi-objective optimization of the Energy-Performance-Accuracy space:

2D Filter Parameters: N (# of coefficients),NH (# of bits per coefficients), OB (# of bits per output pixel),

Results: Low-pass Gaussian Filter, x=y=1.5 (symmetric coefficients).Image: Lena (CIF: 352x288).HA: highest-accuracy. HP: highest-performance,LE: lowest energy -pi

0

pi

-pi

0

pi0

0.5

1

1.5

x

Gaussian f ilter - x=

y=1.5

y

Magnitu

de

Page 24: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable FIR Filter (5/7)Multi-objective optimization of the EPA space

320

325

330

335

20

40

60

80

100

0.2

0.25

0.3

0.35

0.4

fps

Results according to NH (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

320

325

330

335

20

40

60

80

100

0.2

0.25

0.3

0.35

0.4

fps

Frame size:352x288

psnr(dB)

Energ

y p

er

fram

e(m

J)

320

330

340

0

50

100

0.1

0.2

0.3

0.4

0.5

fps

Frame size:352x288

psnr(dB)

Energ

y p

er

fram

e(m

J)

N = 32

N = 24

N = 20

N = 16

N = 12

N = 8

OB=8

LE, HP, LA

HA, HE

(a) (b)

325330

335

50

100

150

0

0.2

0.4

fps

Results according to NH (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

NH = 16

NH = 12

NH = 10

HP

HP

LP

LP

332fps, 23.6dB, 0.155mJ

332fps, 23.6dB, 0.155mJ HA, HE

LE, HP, LA

324

326

328

330

332

334

2030

4050

6070

0.2

0.25

0.3

0.35

fps

Frame size:352x288

psnr(dB)

Energ

y p

er

fram

e(m

J)

324

326

328

330

332

334

2030

4050

6070

0.2

0.25

0.3

0.35

fps

Results according to NH (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

325

330

335

50

100

150

0.2

0.25

0.3

0.35

fps

Frame size:352x288

psnr(dB)

Epf(

mJ)

N = 24

N = 20

N = 16

N = 12

N = 8

OB=8332fps, 29.6dB, 0.156mJLE, HP, LA

326fps, 67dB, 0.323mJHA, HE, LP

325330

335

50

100

150

0

0.2

0.4

fps

Results according to NH (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

NH = 16

NH = 12

NH = 10

HP

HP

LP

LP

HA, HE, LP

LE, HP, LA

Low-pass Gaussian Filter: x=4,y=2

Diff. of Gaussians (DoG) filter: 1=2,2=4

-pi

0

pi

-pi

0

pi0

0.5

1

x

Gaussian f ilter - x=4,

y=2

y

Magnitu

de

-pi

0

pi

-pi

0

pi

0

0.5

x

DoG - 1=2,

2=4

y

Magnitu

de

Page 25: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable FIR Filter (6/7)Dynamic EPA Management (1st example): Applied on the Pareto front of the DoG filter .Video: foreman’(CIF: 352x288)

Dynamic management basedon user constraints:Time-Varying constraints are Provided at run-time by the user

Dynamic management basedon output frames:Metric: Percentage change (T) ofthe filter output (betweenconsecutive frames). Large scenesare associated with large values of T

Right side: State diagram.Low accuracy: no significantvariation in T.Medium accuracy: some variation in T.High accuracy: large variation in T.

Left side: Detection of Scene changes, especiallyFrames 185 to 191.

* to appear in ACM Transactions onReconfigurable Technologyand Systems

0 50 100 150 200 250 3000

0.01

0.02

0.03

0.04

0.05

0.06

Diff

ere

nce o

f sum

of absolu

te v

alu

es

Frame #

0 50 100 150 200 250 30040

50

60

70

PS

NR

(dB

)

0 50 100 150 200 250 30020

30

40

50

60

70

80

90

Frame #

PS

NR

(dB

)

2030405060708090

0.2

0.25

0.3

0.35

0.4

psnr(dB)

Energ

y p

er

fram

e(m

J)

N=20, NH=10, OB=8

N=32, NH=16, OB=16

N=24, NH=16, OB=16

N=8, NH=12, OB=8

N=24, NH=10OB=16

Energy per frameAccuracy (psnr)

0.3 mJ45dB

0.3mJmax

min--

min65dB

--max

avg=49.65

std=0.0977

avg=61.14

std=0.1425

avg=23.56std=0.18

avg=66.18std=0.5568

avg=88.45std=0.0475

frame # 30

frame # 90

frame # 150

frame # 215

frame # 270

frame # 31

frame # 191

frame # 260

frame # 245

Threshold

fram

e #

185

fra

me

# 1

91

30 framesbelow T=0.01

10 framesabove T=0.01

15 framesbelow T=0.01

MEDIUMACCURACY

62.33 dB

0.281 mJ

LOWACCURACY

49.58 dB

0.214 mJ

HIGH ACCURACY

65.82 dB

0.346 mJ

START15 frames

above T=0.01

If the conditions do not hold,

stay in the current state

Page 26: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

2D Separable Complex FIR Filter (7/7)Here, the input image is complex and the filter coefficients are complex. These types of filters are very useful in AM-FM decompositions for image analysis applications.

Xr B

E

N NH

[NO

NQ

]

OP

SYMMETRYCOEFFICIENTS

(from text file)

L

sclr

B

imag

v

COEFFICIENTS (real)SYMMETRY (real)

E sclr v

1D FIR core

1D FIR core

1D FIR core

1D FIR core

YrNO

NO Yi

-

+

Xi BCOEFFICIENTS (imag)SYMMETRY (imag)

E sclr v

COEFFICIENTS (imag)SYMMETRY (imag)

E sclr v

COEFFICIENTS (real)SYMMETRY (real)

E sclr v

realreal imag

205

210

215

4060

80100

0.2

0.25

0.3

0.35

fps

Results according to "N" (coefficient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

205

210

215

0

50

100

0

0.2

0.4

fps

Results according to "N" (coeff icient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

N=31

N=29

N=23

N=19

N=15

N=11

N=9

N=7

205

210

215

0

50

100

0

0.2

0.4

fps

Results according to "N" (coeff icient bitw idth)

psnr(dB)

Energ

y p

er

fram

e(m

J)

N=31

N=29

N=23

N=19

N=15

N=11

N=9

N=7

OB=8209.6fps, 98dB, 0.289mJ

N=31,NH=16, OB=16 HA, HE

LE, HP

213.3fps, 47dB, 0.173mJN=9, NH=10, OB=8

Frequency: 100 MHz

205

210

21540

6080

100

0.2

0.25

0.3

0.35

fps

psnr(dB)

Energ

y p

er

fram

e(m

J)

Energy per frameAccuracy (psnr)

min50dB

0.25mJ60dB

0.25mJmax

--max

N=31, NH=16, OB=1698dB, 0.289mJ

N=29,NH=12,OB=1690.68dB, 0.2463mJ

N=15, NH=10,OB=1662.11dB, 0.2093mJ

N=7, NH=10, OB=1651.47dB, 0.1821mJ

0 50 100 150 200 250 30050

60

70

80

90

100

Frame #

PS

NR

(dB

)

avg=58.78

std=0.3348

avg=66.56

std=0.4168

avg=93.47

std=0.0631 avg=98.22std=0.0578

frame # 190

frame # 280frame # 120

frame # 30

Results: Gabor complex filter, video sequence: news (CIF: 352x288),.* Published in 2012 IEEE International Conference on Field Programmable Logic and Applications

2D Gabor at 45

Page 27: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Research Areas (1/5)RECONFIGURABLE COMPUTING:

Run-Time Reconfigurable Architectures under Time-Varying Constraints

Automatic generation of time-varying constraints. Self-aware Computing and Self-adaptive techniques. Design Space Exploration for large multi-objective spaces.

Advanced Topics on Computer Architecture Fully-pipelined architectures for: Signal, Image, and Video

Processing: Discrete Cosine Transforms, 1D/2D Filterbanks. Non-standard numerical representations: Trade-offs between double

floating point precision and fixed-point precision. Specialized architectures for CORDIC (trigonometric, linear, and

hyperbolic functions), square root, fast division, multiplication. Use of non-standard numerical representations.

LUT-based design

Page 28: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Research Areas (2/5)Example: Radon Transform:

• 7x7 Radon Transform core. Presented in SSIAI’2014 and ICIP’2014 conferences

RO(6,4)

RI(6,5)

RI(5,5)

RI(4,5)

RI(3,5)

RI(2,5)

RI(1,5)

RI(0,3)

RO(0,0)RO(0,0)

RO(0,0)

E

RI(0,0) RI(0,1) RI(0,2) RI(0,4) RI(0,5) RI(0,6)

X(0) X(1) X(2) X(3) X(4) X(5) X(6)

E E E E E E

RO(0,1)RO(1,1)

RO(0,2)RO(2,2)

RO(0,3)RO(3,3)

RO(0,4)RO(4,4)

RO(0,5)RO(5,5)

RO(0,6)RO(6,6)

RO(1,1)RO(0,1)

E

RI(1,0) RI(1,1) RI(1,2) RI(1,3) RI(1,4) RI(1,6)

E E E E E E

RO(1,2)RO(1,2)

RO(1,3)RO(2,3)

RO(1,4)RO(3,4)

RO(1,5)RO(4,5)

RO(1,6)RO(5,6)

RO(1,0)RO(6,0)

RO(2,2)RO(0,2)

E

RI(2,0) RI(2,1) RI(2,2) RI(2,3) RI(2,4) RI(2,6)

E E E E E E

RO(2,3)RO(1,3)

RO(2,4)RO(2,4)

RO(2,5)RO(3,5)

RO(2,6)RO(4,6)

RO(2,0)RO(5,0)

RO(2,1)RO(6,1)

RO(3,3)RO(0,3)

E

RI(3,0) RI(3,1) RI(3,2) RI(3,3) RI(3,4) RI(3,6)

E E E E E E

RO(3,4)RO(1,4)

RO(3,5)RO(2,5)

RO(3,6)RO(3,6)

RO(3,0)RO(4,0)

RO(3,1)RO(5,1)

RO(3,2)RO(6,2)

RO(4,4)RO(0,4)

E

RI(4,0) RI(4,1) RI(4,2) RI(4,3) RI(4,4) RI(4,6)

E E E E E E

RO(4,5)RO(1,5)

RO(4,6)RO(2,6)

RO(4,0)RO(3,0)

RO(4,1)RO(4,1)

RO(4,2)RO(5,2)

RO(4,3)RO(6,3)

RO(5,5)RO(0,5)

E

RI(5,0) RI(5,1) RI(5,2) RI(5,3) RI(5,4) RI(5,6)

E E E E E E

RO(5,6)RO(1,6)

RO(5,0)RO(2,0)

RO(5,1)RO(3,1)

RO(5,2)RO(4,2)

RO(5,3)RO(5,3)

RO(5,4)RO(6,4)

RO(6,6)RO(0,6)

E

RI(6,0) RI(6,1) RI(6,2) RI(6,3) RI(6,4) RI(6,6)

E E E E E E

RO(6,0)RO(1,0)

RO(6,1)RO(2,1)

RO(6,2)RO(3,2)

RO(6,3)RO(4,3)

RO(6,4)RO(5,4)

RO(6,5)RO(6,5)

+++

++

+

+

Addertree

Z(6

)

+

Addertree

Z(5

)

+

Addertree

Z(4

)

+

Addertree

Z(3

)

+

Addertree

Z(2

)

+

Addertree

Z(0

)

+

Addertree

Z(1

)

RO(0,1) RO(0,2) RO(0,3) RO(0,4) RO(0,5) RO(0,6)

RO(1,0) RO(1,1) RO(1,2) RO(1,3) RO(1,4) RO(1,5) RO(1,6)

RO(2,0) RO(2,1) RO(2,2) RO(2,3) RO(2,4) RO(2,5) RO(2,6)

RO(3,0) RO(3,1) RO(3,2) RO(3,3) RO(3,4) RO(3,5) RO(3,6)

RO(4,0) RO(4,1) RO(4,2) RO(4,3) RO(4,4) RO(4,5) RO(4,6)

RO(5,0) RO(5,1) RO(5,2) RO(5,3) RO(5,4) RO(5,5) RO(5,6)

RO(6,0) RO(6,1) RO(6,2) RO(6,3) RO(6,5) RO(6,6)

s

E

Z(0)

Z(1)

Z(N-1)

v

RADON

CORE

FSMEn

E s

2

...

...

...

...

(b)

(a)

(c)

B N

BOX(0)

X(1)

X(N-1)

BO = B + log2N

BO

BO

B

B

B

RO(0,i) RO(1,i) RO(2,i) RO(3,i) RO(4,i) RO(5,i) RO(6,i)

Z(i)

Page 29: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Research Areas (3/5) Example: HEVC (High Efficiency Video Coding Standard) HEVC Intra Prediction Core: Presented in the SSIAI’2014 conference. The HEVC Intra prediction core computes the following modes: Angular (33),

planar (1), and DC (1). Current work: Implementation of HEVC Transform, Scaling, and

Quantization. Future work: Self-reconfigurable hardware implementation of HEVC Encoder

Sullivan, G., et al, "Overview of the High Efficiency Video Coding (HEVC) Standard", IEEE TCSVT, V.22, No. 12, Dec. 2012

Page 30: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Research Areas (4/5)FPGA-based Embedded Systems Applications: automotive, networking, bioengineering, software-

defined radio, communications, video compression, video filtering. Software-Hardware Co-Design. Dealing with different processors: ARM, MicroBlaze (Xilinx), Nios

(Altera), OpenRISC (Open-source). Development of embedded interfaces for a variety of buses:

Advanced eXtensible Interface (AXI, Xilinx), Avalon switch fabric (ALTERA), Wishbone (open source), etc.

External communication interfaces (SpaceWire, CAN, Ethernet). Current work: Open-Source embedded system that supports DPR

via Wishbone.

Run-Time Reconfiguration on FPGAs Objective: Develop high-speed dynamic reconfiguration controllers: Support for standard buses: AXI, Wishbone. Hardware and software techniques the provide dramatic increases

in reconfiguration speed. Dynamic Frequency Control on FGPAs

Page 31: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Research Areas (5/5)Specialized techniques on FPGAs Low Power techniques Advanced coding mechanism for efficient parameterization using

VHDL and Verilog HDL. Test-bench generation Crossing clock domains

GPU PROGRAMMING:High performance implementation of Digital Signal, Image, and

Video Processing Algorithms.Integration with multi-threaded implementations on CPUsComparisons with FPGA implementations.

EMBEDDED SYSTEMSMicrocontroller-based SystemEfficient architectures for embedded interfacesEmbedded design using System-on-Chip that incorporates analog,

programmable logic, memory, and microcontroller (e.g. Cypress devices)

Page 32: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Teaching PlansECE378: Digital Logic and Microprocessor Design

(Winter 2015) Digital System Design Microprocessor Design in VHDL Digital Synthesis with VHDL Parameterized VHDL coding

ECE495/595: Reconfigurable Computing (Fall 2015) Hardware/Software co-design on FPGAs Self-Reconfigurable systems: Partial Reconfiguration Advanced topics in Computer Arithmetic Applications in: Digital Signal, Image, and Video Processing Communication interfaces (SpaceWire, CAN,

Ethernet)

Reconfigurable Systems

Static Dynamic

Embedded Systems

Applications: DSP,

automotive,

communicationsInte

rfaci

ng

Digital Logic Design

Page 33: A Framework for Run-Time Reconfigurable Computingllamocca/Research/... · A Framework for Run-Time Reconfigurable Computing DANIEL LLAMOCCA Electrical and Computer Engineering Department,

Conclusions A framework was presented for Dynamic Management of

Optimization of Run-Time Reconfigurable Architectures. Theexamples presented (Pixel Processor, 2D FIR Filter) weresuccessfully tested on several standard video sequences.

The results suggest that the general framework can beapplied to a variety of digital systems. This framework willlead to exciting new methods. As an example, consider theautomatic generation of time-varying constraints. Forexample: detection of a scene triggers a requirement forincreased accuracy, a scene remaining still triggers arequirement for a decrease in energy consumption.

The presented work opens up new excitinginterdisciplinary research opportunities (e.g.:automotive applications, design space exploration,automatic constraints generation, pervasive healthcareapplications, low power applications).