RCIM 2008 - - UniCal

Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating

Digital Signal Processing

Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari.

Department of Electronics, Computer Science and Systems (DEIS)University of Calabria, Rende (CS)

Outline

� Motivation

� The proposed Coarse Grain ReconfigurableArray (CGRA)

� Architectural overview

� Computational model

� Post Layout Results

� Comparison

� Conclusion

The ChallengeNowadays, Digital Signal Processing (DSP) is extensively used for

several applications

� Multimedia� Image analysis and processing� Speech processing� Wireless communication

These applications impose strict hardware requirements

� High performance� Real-time operations� High computational load

� Intensive arithmetic operations

(add, sub, shift, mult, mult-acc)

� Energy-efficiency� Portable devices

� Flexibility� Support multiple applications� Match the rapid evolving of the algorithms

Executing DSP on various architectures

Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances

Increasing Performances

Increasing Flexibility

Full CustomSolutions

General Purpose Processors

&Programmable Digital Signal Processors

Reconfigurable Computing

CGRA FPGA

Reconfigurable Computing

� FPGAs are very flexible, …� Gate-level functions

� General routing

… ,but the flexibility is very expensive� FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations.

� Long reconfiguration time

� CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures� Compromise programmability and fixed functionality

� Flexible and efficient within an application domain

Architectural Overview

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

RAM

PE

I/O DATA & CONFIGURATION CENTRAL CONTROLLER

Host Interface External Memory Interface

Elab. Data Config. Data

Config. & Elab. Data Data Addr. RAM

PE

Reconfigurable Cell

Lached Programmable Switches

�Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance �Run-time reconfigurable cells to achieve a high flexibility within the target application domain� Distributed control logic to reduce control complexity and enhancing data parallelism

The Reconfigurable Cell

Dual Port SRAM

(256*8-bit)

Data_OutA/B_ext

AddrA/B_ext

Config. Data

Control Unit

Addr_Out_ext

Input Stage

Output Stage

Ram Interface

Data_InA/B_ext

control signals Controls Signals

PE (8-bit)

Config. Mem

� I/O interface similar to a conventional RAM

� 2 input/output data ports

� 2 input address ports

� 1 output address port

� I/O control signals

� Dual Port SRAM

(256*8-bits) data memory

� Reconfigurable 8-bit PE

� Internal Control Unit

� Two operative states

� Loading

� Executing

Functionality of the RC in the executingstate

RAM

PE

RAM

PE

RAM

PE

RAM

PE

(a) (b) (c) (d)

a) feed-forward mode; b) feed-back mode;c) route-through mode;d) route-through mode (double throughput)

The Processing Element

MULT2 (8X4-bit)

MULT1 (8X4-bit)

HA-based Compressor (4-bit)

3:2 (FA-based) Compressor (8-bit)

Adder3 (4-bit)

Adder2 (8-bit)

Adder1 (4-bit)

Register (4-bit)

Register (8-bit)

Register (4-bit)

00000001 00000001

00000000 0000

0001

0000

S0 S2 S1 S3

S4

S5

S6 S6

O[3:0] O[11:4] O[15:12]

S7=cin co1 co2

A-Register (8-bit)

B-Register (8-bit)

� Single clock cycleoperations

� ADD, SUB,ACC, INC, DEC, MUL, MUL-ACC, SHIFT

� Fast and low-cost

The Control Unit

Configuaration Data

Instr. Counter

Instruction Decoder

AddrA_int AddrB_int Addr_ext

PE & I/O control signals

op_code

Address Descriptors #ops

Config. Memory

Handshake Signals

Hanshake & Elab. Control

Addresses Generator

� Instructions define the execution of vector/block operations on a large data stream

� Each instruction consist of several fields

� op_code specifies the operation code;

� #ops specifies the number of the operations to be performed in the current instruction;

� address descriptorsspecify the data organization in the memory.

The Address Generator

Continuous vector forward scan (Step=1, Subset=8, Skip=0)

Sparse vector forward scan (Step=2, Subset=4, Skip=0)

Rotating vector forward scan (Step=1, Subset=8, Skip=-7)

Continuous vector reverse scan (Step=-1, Subset=8, Skip=0)

Sparse vector reverse scan (Step=-2, Subset=4, Skip=0)

Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)

Continuous vector (column mode) forward/reverse scan (Step=n/-n, Subset=8, Skip=0)

Sparse vector (column mode) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0)

Block scan (forward/reverse mode) (Step=1/-1, Subset=3, Skip=n-3/-n+3)

addr_register

current_address

address_calculation_adder

base_address step

control_signal

step_register

skip_register

skip

down counter

subset

=0 end_subset

The Interconnection Topology

neighbor interconnections interleaved interconnections

E W

N NE

SW S SE

NW

N-bit

2N-bit

Programmable Latched Switches

Applications Mapping: Block-level pipelining

Execute Load Execute Load Execute Load

Execute Load Execute Load Execute Load

Execute Load Execute Load Load

Load RC(i-1)

RC(i)

RC(i+1)

RAM(i-1)

PE(i-1)

RAM(i)

PE(i)

RAM(i+1)

PE(i+1)

� The computation is organized in concurrently executing kernels� Each kernel is implemented by a RC

� A kernel consumes a set of input data, performs one or more computations, and produces a set of output data

� RCs communicate by sending addressed packets of data.

� Memory data loading of each cell is overlapped with data producing of previous cell

� An execution is performed as soon as all necessary data input are available

� Data syncronization mechanism is realized by handshake signals� No explicit temporal scheduling of execution is required

Applications Mapping: Flexible computationalload balancing

RAM(1)

PE(1)

RAM(2)

PE(2)

RAM(3)

PE(3)

RAM(4)

PE(4)

RAM(2)

PE(2)

RAM(3)

PE(3)

RAM(4)

PE(4)

RAM(1)

PE(1)

Functionparallel

Data parallel

Parallelism in both vertical/temporal and horizontal/spatial directions

� Horizontal comp. load balancing achieved via data parallelism

� Vertical comp. load balancing achieved by increasing the number of pipeline stages

Architecture evaluation

� Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device� The implemented system includes 64 RCs organized in 4x4 quadrants

� The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr�RGB, 2d-DCT, 2d-FIR) .

� Physical Evaluation for the ST 90nm CMOS technology� Reconfigurable Cell

� Synthesis done with Synopsys Design Compiler� Physical Design done with Cadence SoC Encounter, also considering

manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues.

� Interconnections� Preliminary electrical simulations were performed

� Obtained results were compared to 90nm CMOS Virtex-4 FPGA

RC Layout

PE

Configuration Memory

Control Unit

RAM Interface

Output Stage

Input Stage

Dual Port SRAM (256*8-bit)

� Technology� CMOS 90nm

� Suppy voltage� 1.0 V

� Frequency� 1 GHz

� Core Area� 79.52 um2

� Avg. Dyn. Power @1 GHz� 20 mW

� Leakage Power� 627.6 uW

Resources usage/energy/performance trade-off comparisons: New to Xilinx Virtex-4

•Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.

14.22.1786 Slices + 3 Bram / 2.919

20.810.222 RCs / 1.749

2D-DCT(8x8)

18.41.3440 Slices + 2 Bram/ 1.657

23.910.520 RCs / 1.590

2D separable

4x4 FIR

29.11.7436 Slices + 2 Bram / 1.572

45.913.313 RCs /1.034

Color Space Conversion

Energy Efficiency

[MOPS/W](8*8-image

block)

Throughput[MOPS]

(8*8-image block)

Resources / Area [mm2]

Energy Efficiency

[MOPS/W](8*8-image

block)

Throughput[MOPS]

(8*8-image block)

Resources/ Area [mm2]

Virtex-4 FPGA (CORE Generator)Proposed Reconfigurable ArrayAlgorithm

Conclusion

� Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications

� Performance improvement at a low cost� Exploit spatial and temporal parallelism� High arithmetic processing capability� high bandwidth and low latency memory access

� Performance/energy/area evaluations for representative tasks belonging to the target application domain

� Obtained results demonstrate significative advantages with respect to conventional FPGA

� Speedups ranging from 4.8X to 8X � Energy efficiency improvement ranging from 24% to 58% � Area saving up to 40%