View
50
Download
0
Category
Tags:
Preview:
DESCRIPTION
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION. 03/26/2012. OUTLINE. Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results. INTRODUCTION. Goal - PowerPoint PPT Presentation
Citation preview
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION
03/26/20121
OUTLINE
Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results
2
INTRODUCTION Goal
Application specific hybrid coarse grained reconfigurable architecture using NoC
Purpose Support Variable Block Size Motion Estimation
(VBSME) First approach
No ASIC and other coarse grained reconfigurable
architectures Difference
Use of intelligent NoC routers Support full and fast search algorithms 3
4
MOTIVATION
H.264
Motion Estimation
Ө(f)=
5
MOTION ESTIMATION
Previous Frame
Current Frame
Current 16x16 Block
Mot
ion
Vecto
r
Search Window
Sum of Absolute Difference (SAD)
SYSTEM-ON-CHIP (SOC)
Single chip systems Common components
Microprocessor Memory Co-processor Other blocks
Increased processing power and data intensive applications Facilitating communication between individual
blocks has become a challenge
6
TECHNOLOGY ADVANCEMENT
7
DELAY VS. PROCESS TECHNOLOGY
8
NETWORK-ON-CHIP (NOC)
Efficient communication via use of transfer protocols
Need to take into consideration the strict constraints of SoC environment
Types of communication structure Bus Point-to-point Network
9
COMMUNICATION STRUCTURES
10
BUS VS. NETWORK
Bus Pros & Cons Network Pros & Cons
Every unit attached adds parasitic capacitance
x ✓ Local performance not degraded with scaling
Bus timing is difficult x ✓ Network wires can be pipelined
Bus arbitration can become a bottleneck
x ✓ Routing decisions are distributed
Bus testability problematic and slow
x ✓ Locally placed BIST is fast and easy
Bandwidth is limited and shared by all
x ✓ Bandwidth scales with network size
Bus latency is wire speed once granted
✓ x Network contention may cause latency
Very compatible ✓ x IPs need smart wrappers
Simple to understand ✓ x Relatively complicated
11
EXAMPLE
12
EXAMPLE OF NOC
13
ROUTER ARCHITECTURE
14
BACKGROUND
ME General purpose processors, ASIC, FPGA and
coarse grain Only FBSME VBSME with redundant hardware
General purpose processors Can exploit parallelism Limited by the inherent sequential nature and
data access via registers
15
CONTINUED…
ASIC No support to all block sizes of H.264 Support provided at the cost of high area
overhead Coarse grained
Overcome the drawbacks of LUT based FPGAs Elements with coarser granularity Fewer configuration bits Under utilization of resources
16
ASIC Approaches
Topology SAD accumulation
2D systolic array
•Large number of registers•Store partial SADs•Area overhead•High latency
•Mesh based architecture•Store partial SADs•Area overhead•High latency•No VBSME
Partial Sum
Parallel Sum
1D systolic array
1D systolic array
2D systolic array
Partial Sum
Parallel Sum
2D systolic array
Partial Sum
Parallel Sum
•Reference pixels broadcasted•SAD computation for each 4x4 block pipelined•Each processing element computes pixel difference, accumulates it to the previous partial SAD and sends the computed partial SAD to the next processing element•Large number of registers
•All pixel differences of a 4x4 block computed in parallel•Reference pixels are reused•Direction of data transfer depends on search pattern
17
OU’S APPROACH
16 SAD modules to process 16 4x4 motion vectors
VBSME processor Chain of adders and comparators to compute
larger SADs PE array
Basic computational element of SAD module Cascade of 4 1D arrays
1D array 1D systolic array of 4 PEs Each PE computes a 1 pixel SAD
18
Module 0Module 0
Module 1Module 1
Module 15Module 15
current_block_data_0 search_block_data_0
current_block_data_1
current_block_data_15
search_block_data_1
search_block_data_15
SAD_0
SAD_1
SAD_15
MV_0
MV_1
MV_15
strip_sel read_addr_B
read_addr_A
write_addr
SAD Modules
MUX for SADMUX for SAD
1D Array
0
1D Array
0
1D Array
3
1D Array
3
block_strip_B
block_strip_A
DD DDcurrent_block_data_i
4 bits
1 bit 1 bit
32 bits
32 bits
SAD_i
MV_i
PE Array19
PEPE
PEPE
PEPE
PEPE
ACCMACCM
DD
DD
DD
DD
DDDD
DD
DD DD
DD DD DD
32 bits 32 bits
1D Array
20
PUTTING IT TOGETHER
Clock cycle Columns of current 4x4 sub-block scheduled using a
delay line Two sets of search block columns broadcasted
4 block matching operations executed concurrently per SAD module
4x4 SADs -> 4x4 motion vectors Chain of adders and comparators
4x4 SADs -> 4x8 SADs -> … 16x16 SADs Chain of adders and comparators
Drawbacks No reuse of search data between modules Resource wastage
21
22
ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES
ChESS*(M x 0.8M)/256 x 17 x 17
MATRIX*(M x0.8M)/256 x 17 x 17
RaPiD*272+32M+14.45M2
* Performance (clock cycles) [Frame Size: M x 0.8M]
• Resource utilization
• Generic interconnect
PROPOSED ARCHITECTURE
2D architecture 16 CPEs 4 PE2s 1 PE3 Main Memory Memory Interface
CPE (Configurable Processing Element) PE1 NoC router Network Interface Current and reference block from main memory
23
CPE(1,1)CPE(1,1)
CPE(2,1)CPE(2,1)
CPE(3,1)CPE(3,1)
CPE(4,1)CPE(4,1)
CPE(1,2)CPE(1,2)
CPE(2,2)CPE(2,2)
CPE(3,2)CPE(3,2)
CPE(4,2)CPE(4,2)
CPE(1,3)CPE(1,3)
CPE(2,3)CPE(2,3)
CPE(3,3)CPE(3,3)
CPE(4,3)CPE(4,3)
CPE(1,4)CPE(1,4)
CPE(2,4)CPE(2,4)
CPE(3,4)CPE(3,4)
CPE(4,4)CPE(4,4)
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
r_d
c_d
c_d
c_d
c_d
PE 2(1)PE
2(1)
PE 2(3)PE
2(3)
PE 2(2)PE
2(2)
PE 2(4)PE
2(4)
PE 3PE 3
Main MemoryMain Memory Memory Interface (MI)
Memory Interface (MI)
data_load_control
(16 bits)
reference_block_id (5 bits)
c_d_(x,y)
(32 bits)
r_d_(x,y)
(32 bits)
32 bits
14 bits
12 bits
24
18 bit sub
18 bit sub
CPRCPR
RPRRPR
28 bit sub
28 bit sub
CPRCPR
RPRRPR
38 bit sub
38 bit sub
CPRCPR
RPRRPR
48 bit sub
48 bit sub
CPRCPR
RPRRPR
58 bit sub
58 bit sub
CPRCPR
RPRRPR
68 bit sub
68 bit sub
CPRCPR
RPRRPR
78 bit sub
78 bit sub
CPRCPR
RPRRPR
88 bit sub
88 bit sub
CPRCPR
RPRRPR
98 bit sub
98 bit sub
CPRCPR
RPRRPR
108 bit sub
108 bit sub
CPRCPR
RPRRPR
118 bit sub
118 bit sub
CPRCPR
RPRRPR
128 bit sub
128 bit sub
CPRCPR
RPRRPR
138 bit sub
138 bit sub
CPRCPR
RPRRPR
148 bit sub
148 bit sub
CPRCPR
RPRRPR
158 bit sub
158 bit sub
CPRCPR
RPRRPR
168 bit sub
168 bit sub
CPRCPR
RPRRPR
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
10 bit
adder
12 bit
adder
12 bit
adder
COMPCOMP
REGREG
r_d c_d To/From NI
To/From East
To/From South
4x4 mv
25
CONTROL UNIT
CONTROL UNIT
PACKETIZATION UNIT
PACKETIZATION UNIT
DEPACKETIZATION UNIT
DEPACKETIZATION UNIT
reference_block_id to MI
data_load_control to MI
Network Interface
NETWORK INTERFACE
26
00
11
33
55
4422
Ring Buffer
First Index Last Index
Header DecoderHeader Decoder
PE 1East
West
North
South
PE 1
EastWest
North
South
Input Controller
Input Controller
Output Controller
Output Controller
ack ackrequest requestReceives
packets from NI/ adjacent router
Stores packets
•XY routing protocol•Extracts direction of data transfer from header packet•Updates number of hops
Sends packets to NI or adjacent router
Input/Output Control Signals
27
NOC ROUTER
Input Controller
Output Controller
Input Controller
Output Controller
Router 1 Router 2
Step 1: Send a message from Router 1 to Router 2
req (1 bit)
Busy?
Buffer space available?
ack (1 bit)
Step 2: Send a 1 bit request signal to Router 2Step 3: Router 2 first checks if it is busy. If not checks for available buffer spaceStep 4: Send ack if space availableStep 5: Send the packet
packet
32 bit
28
PE2 AND PE3
AddersMuxesDe-muxes
ComparatorsRegisters 29
FAST SEARCH ALGORITHM
Diamond Search
•9 candidate search points•Numbers represent order of processing the reference frames•Directed edges labeled with data transmission equations derived based on data dependencies
30
EXAMPLE
Frame
Macro-block
SAD
31
CONTINUED…
32
DATA TRANSFER
Data Transfer between PE1(1,1) and PE1(1,3)
Individual PointsIntersecting Points
33
DATA LOAD SCHEDULE
34
OTHER FAST SEARCH ALGORITHMS
Hexagon
Big Hexagon Spiral
35
FULL SEARCH
36
CONTINUED…
37
RESULTS
38
CONTINUED…
39
40
Recommended