Upload
usrdresd
View
1.240
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating
Digital Signal Processing
Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari.
Department of Electronics, Computer Science and Systems (DEIS)University of Calabria, Rende (CS)
Outline
� Motivation
� The proposed Coarse Grain ReconfigurableArray (CGRA)
� Architectural overview
� Computational model
� Post Layout Results
� Comparison
� Conclusion
The ChallengeNowadays, Digital Signal Processing (DSP) is extensively used for
several applications
� Multimedia� Image analysis and processing� Speech processing� Wireless communication
These applications impose strict hardware requirements
� High performance� Real-time operations� High computational load
� Intensive arithmetic operations
(add, sub, shift, mult, mult-acc)
� Energy-efficiency� Portable devices
� Flexibility� Support multiple applications� Match the rapid evolving of the algorithms
Executing DSP on various architectures
Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances
Increasing Performances
Increasing Flexibility
Full CustomSolutions
General Purpose Processors
&Programmable Digital Signal Processors
Reconfigurable Computing
CGRA FPGA
Reconfigurable Computing
� FPGAs are very flexible, …� Gate-level functions
� General routing
… ,but the flexibility is very expensive� FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations.
� Long reconfiguration time
� CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures� Compromise programmability and fixed functionality
� Flexible and efficient within an application domain
Architectural Overview
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
RAM
PE
I/O DATA & CONFIGURATION CENTRAL CONTROLLER
Host Interface External Memory Interface
Elab. Data Config. Data
Config. & Elab. Data Data Addr. RAM
PE
Reconfigurable Cell
Lached Programmable Switches
�Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance �Run-time reconfigurable cells to achieve a high flexibility within the target application domain� Distributed control logic to reduce control complexity and enhancing data parallelism
The Reconfigurable Cell
Dual Port SRAM
(256*8-bit)
Data_OutA/B_ext
AddrA/B_ext
Config. Data
Control Unit
Addr_Out_ext
Input Stage
Output Stage
Ram Interface
Data_InA/B_ext
control signals Controls Signals
PE (8-bit)
Config. Mem
� I/O interface similar to a conventional RAM
� 2 input/output data ports
� 2 input address ports
� 1 output address port
� I/O control signals
� Dual Port SRAM
(256*8-bits) data memory
� Reconfigurable 8-bit PE
� Internal Control Unit
� Two operative states
� Loading
� Executing
Functionality of the RC in the executingstate
RAM
PE
RAM
PE
RAM
PE
RAM
PE
(a) (b) (c) (d)
a) feed-forward mode; b) feed-back mode;c) route-through mode;d) route-through mode (double throughput)
The Processing Element
MULT2 (8X4-bit)
MULT1 (8X4-bit)
HA-based Compressor (4-bit)
3:2 (FA-based) Compressor (8-bit)
Adder3 (4-bit)
Adder2 (8-bit)
Adder1 (4-bit)
Register (4-bit)
Register (8-bit)
Register (4-bit)
00000001 00000001
00000000 0000
0001
0000
S0 S2 S1 S3
S4
S5
S6 S6
O[3:0] O[11:4] O[15:12]
S7=cin co1 co2
A-Register (8-bit)
B-Register (8-bit)
� Single clock cycleoperations
� ADD, SUB,ACC, INC, DEC, MUL, MUL-ACC, SHIFT
� Fast and low-cost
The Control Unit
Configuaration Data
Instr. Counter
Instruction Decoder
AddrA_int AddrB_int Addr_ext
PE & I/O control signals
op_code
Address Descriptors #ops
Config. Memory
Handshake Signals
Hanshake & Elab. Control
Addresses Generator
� Instructions define the execution of vector/block operations on a large data stream
� Each instruction consist of several fields
� op_code specifies the operation code;
� #ops specifies the number of the operations to be performed in the current instruction;
� address descriptorsspecify the data organization in the memory.
The Address Generator
Continuous vector forward scan (Step=1, Subset=8, Skip=0)
Sparse vector forward scan (Step=2, Subset=4, Skip=0)
Rotating vector forward scan (Step=1, Subset=8, Skip=-7)
Continuous vector reverse scan (Step=-1, Subset=8, Skip=0)
Sparse vector reverse scan (Step=-2, Subset=4, Skip=0)
Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)
Continuous vector (column mode) forward/reverse scan (Step=n/-n, Subset=8, Skip=0)
Sparse vector (column mode) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0)
Block scan (forward/reverse mode) (Step=1/-1, Subset=3, Skip=n-3/-n+3)
addr_register
current_address
address_calculation_adder
base_address step
control_signal
step_register
skip_register
skip
down counter
subset
=0 end_subset
The Interconnection Topology
neighbor interconnections interleaved interconnections
E W
N NE
SW S SE
NW
N-bit
2N-bit
Programmable Latched Switches
Applications Mapping: Block-level pipelining
Execute Load Execute Load Execute Load
Execute Load Execute Load Execute Load
Execute Load Execute Load Load
Load RC(i-1)
RC(i)
RC(i+1)
RAM(i-1)
PE(i-1)
RAM(i)
PE(i)
RAM(i+1)
PE(i+1)
� The computation is organized in concurrently executing kernels� Each kernel is implemented by a RC
� A kernel consumes a set of input data, performs one or more computations, and produces a set of output data
� RCs communicate by sending addressed packets of data.
� Memory data loading of each cell is overlapped with data producing of previous cell
� An execution is performed as soon as all necessary data input are available
� Data syncronization mechanism is realized by handshake signals� No explicit temporal scheduling of execution is required
Applications Mapping: Flexible computationalload balancing
RAM(1)
PE(1)
RAM(2)
PE(2)
RAM(3)
PE(3)
RAM(4)
PE(4)
RAM(2)
PE(2)
RAM(3)
PE(3)
RAM(4)
PE(4)
RAM(1)
PE(1)
Functionparallel
Data parallel
Parallelism in both vertical/temporal and horizontal/spatial directions
� Horizontal comp. load balancing achieved via data parallelism
� Vertical comp. load balancing achieved by increasing the number of pipeline stages
Architecture evaluation
� Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device� The implemented system includes 64 RCs organized in 4x4 quadrants
� The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr�RGB, 2d-DCT, 2d-FIR) .
� Physical Evaluation for the ST 90nm CMOS technology� Reconfigurable Cell
� Synthesis done with Synopsys Design Compiler� Physical Design done with Cadence SoC Encounter, also considering
manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues.
� Interconnections� Preliminary electrical simulations were performed
� Obtained results were compared to 90nm CMOS Virtex-4 FPGA
RC Layout
PE
Configuration Memory
Control Unit
RAM Interface
Output Stage
Input Stage
Dual Port SRAM (256*8-bit)
� Technology� CMOS 90nm
� Suppy voltage� 1.0 V
� Frequency� 1 GHz
� Core Area� 79.52 um2
� Avg. Dyn. Power @1 GHz� 20 mW
� Leakage Power� 627.6 uW
Resources usage/energy/performance trade-off comparisons: New to Xilinx Virtex-4
•Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.
14.22.1786 Slices + 3 Bram / 2.919
20.810.222 RCs / 1.749
2D-DCT(8x8)
18.41.3440 Slices + 2 Bram/ 1.657
23.910.520 RCs / 1.590
2D separable
4x4 FIR
29.11.7436 Slices + 2 Bram / 1.572
45.913.313 RCs /1.034
Color Space Conversion
Energy Efficiency
[MOPS/W](8*8-image
block)
Throughput[MOPS]
(8*8-image block)
Resources / Area [mm2]
Energy Efficiency
[MOPS/W](8*8-image
block)
Throughput[MOPS]
(8*8-image block)
Resources/ Area [mm2]
Virtex-4 FPGA (CORE Generator)Proposed Reconfigurable ArrayAlgorithm
Conclusion
� Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications
� Performance improvement at a low cost� Exploit spatial and temporal parallelism� High arithmetic processing capability� high bandwidth and low latency memory access
� Performance/energy/area evaluations for representative tasks belonging to the target application domain
� Obtained results demonstrate significative advantages with respect to conventional FPGA
� Speedups ranging from 4.8X to 8X � Energy efficiency improvement ranging from 24% to 58% � Area saving up to 40%