Upload
noah-anderson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
A Programmable Single Chip Digital Signal
Processing Engine
MAPLD 2005
Paul Chiang, MathStar Inc.Pius Ng, Apache Design Solutions
2 MAPLD 2005/206Chiang
Presentation Outline
• Space born signal processing tasks• FPOA architecture highlights • programmability and expandability• System partition on FPOA device• Spatial processing - 5x5 filter solution• Temporal processing – motion estimation• Internal bus and I/O throughput• Resource utilization and future expansion
3 MAPLD 2005/206Chiang
A System of Digital Signal Processing
DataExtraction
InputData
Spatial orTemporal Processing
Frequency or Time domainProcessing
FeatureExtraction
Characterization
• mux/de-mux• Average filter• min/max select
• spatial edge filter• temporal difference filter
• time domain low/high/bandpass filter• frequency transformation• frequency domain low/high/bandpass filter
• apply equation that defines feature• checking threshold
• analyze and characterize signals
4 MAPLD 2005/206Chiang
Processing Requirements
• High computation requirement on the following basic operations: add/sub and mul/mac,
• Mixed control functions such as loop control and decision making
• High I/O bandwidth to enable balanced processing vs. data input/output
• Large and fast temporary memory space to facilitate real-time processing
• Fast programmable and direct data transfer enables massive parallel processing
5 MAPLD 2005/206Chiang
FPOA Architecture Summary
• Heterogeneous Array of 16-bitSilicon Objects MAC, ALU, Truth Tables, Register File,
Internal RAM Single Clock Cycle Execution for All
Objects• Homogeneous 2-Layer Programmable
Interconnect Mesh• Tightly Integrated Data and Control Flow• Integrated DDRII RLDRAM & SRAM
Controllers• High Speed I/O at Device Boundaries:
SerDes, LVDS, HSTL
6 MAPLD 2005/206Chiang
Reconfigurable Interconnect Network
• Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits
• Nearest Neighbors Range = 1 (N/E/S/W + diagonal)
• Party Lines Single cycle range = hop to 3 (skip
2) @ 1GHz Extra clock cycles for digital
retiming• 1 extra 25-object neighborhood• More clock cycles entire chip
7 MAPLD 2005/206Chiang
FPOA Solution
• Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second
• Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth
• 400 Silicon Objects running at 1 GHz ALU: add/sub, and combinational logic MAC: mul/mac Register File (RF): fast distributed data
storage Internal RAM (IRAM): intermediate data
storage
• Party lines and muxes to support flexible internal bus as well as dedicated connections
8 MAPLD 2005/206Chiang
Example FPOA Partition
XR
AM
I/F
XRAM I/F
INT
BU
SC
ontr
olle
r
Local Bus I/F
Host Local Bus
A/D
IF
A/D
A/D
TempIRAM_1
TempIRAM_0
SpatialProcessor
Data SelectionLogic
TemporalProcessor
DataRealignmentFeature
Extraction
Sptial/TemporalController
9 MAPLD 2005/206Chiang
5x5 Convolution Filter
• Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4]
for i = 2; i < m – 3; i++for j = 2; j < n – 3; j++
temp = 0;for k = -2; k < 3; k++for l = -2; l < 3; l++
temp = D[i+k, j+l] * W[k+2, l+2] + tempend_of_lend_of_kY[i, j] = temp;
end_of_jend_of_i
10 MAPLD 2005/206Chiang
Computation Requirements
• Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample
• The whole convolution filter operation requires
25 * M * N MAC operations• With a standard 720x480 image data and 30
frames per second, the convolution filter operation requires
259 MMAC per second
11 MAPLD 2005/206Chiang
Data Storage
• 2D data storage in a 1D linear memory where 4 16-bit word can be accessed concurrently
• Example of an 8x8 2D matrix stored in a 1D memory
0x0004 D14 D15 D16 D170x0003 D10 D11 D12 D130x0002 D04 D05 D06 D070x0001 D00 D01 D02 D03
Address (hex)
0x000F D74 D75 D76 D770x000E D70 D71 D72 D73
D00 D10 D20 D30 D40 D50 D60 D70D01 D11 D21 D31 D41 D51 D1 D71D02 D12 D22 D32 D42 D52 D62 D72D03 D13 D23 D33 D43 D53 D63 D73D04 D14 D24 D34 D44 D54 D64 D74D05 D15 D25 D35 D45 D55 D65 D75D06 D16 D26 D36 D46 D56 D66 D76D07 D17 D27 D37 D47 D57 D67 D77
• • •
12 MAPLD 2005/206Chiang
Data Access Analysis
• Samples are stored in the external memory with slower access speed
• Maximize data bandwidth by accessing 4 words at a time
• Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory
• Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation
13 MAPLD 2005/206Chiang
Data Processing Analysis
MUL MUL MUL MUL
D30D30D20D30D20D10D30D20D10D00
D31D31D21D31D21D11D31D21D11D01
D32D32D22D32D22D12D32D22D12D02
D33D33D23D33D23D13D33D23D13D03
W00W10W00W20W10W00W30W20W10W00
W01W11W01W21W11W01W31W21W11W01
W02W12W02W22W12W02W32W22W12W02
W03W13W03W23W13W03W33W23W13W03
Adder Tree
Y22 Y32 Y42 Y52
Y00 Y10 Y20 Y30 Y40 Y50Y01 Y11 Y21 Y31 Y41 Y51Y02 Y12 Y22 Y32 Y42 Y52Y03 Y13 Y23 Y33 Y43 Y53Y04 Y14 Y24 Y34 Y44 Y54Y05 Y15 Y25 Y35 Y45 Y55Y06 Y16 Y26 Y36 Y46 Y56Y07 Y17 Y27 Y37 Y47 Y57Y08 Y18 Y28 Y38 Y48 Y58Y09 Y19 Y29 Y39 Y49 Y59
Note 1: with a 5x5 filter the first two rows and columns are skippedNote 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52
14 MAPLD 2005/206Chiang
FPOA Solution
• Temporary data storage
5 RFs, 3 ALUs• Data access
control 3 ALUs
• Multiplier 4 MACs
• Adder Tree 9 ALUs
• Temporary Results 2 RFs, 1 IRAM,
2 ALUs
MAC
MAC
MAC
MAC
Coef.RF
SampleRF
ControlLogic
DataAccessControl
AdderTree
TemporaryResults
Input Samples
Results
15 MAPLD 2005/206Chiang
5x5 Convolution Filter Performance
• FPOA Resources ALU: 17 RF: 7 MAC: 4 IRAM: 1 Total: 28 SOs + 1 IRAM
• Data throughput 20 results every 125 cycles
16 MAPLD 2005/206Chiang
Motion Estimation
• Identify the movement of a similar pattern over time
• The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7]
sum = 0;for i = 0 to 7
for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp)end_of_j
end_of_i
17 MAPLD 2005/206Chiang
SAD Computation Dataflow
• 3 cycles throughput• Generates two partial sums of positive differences
…...Adder Tree
X70X60X50X40X30X20X10X00
X71X61X51X41X31X21X11X01
X72X62X52X42X32X22X12X02
X77X67X57X47X37X27X17X07
Y70Y60Y50Y40Y30Y20Y10Y00
Y70Y60Y50Y40Y30Y20Y10Y00
Y70Y60Y50Y40Y30Y20Y10Y00
Y70Y60Y50Y40Y30Y20Y10Y00
C_S_A C_S_A C_S_A C_S_A
SAD output
Compare
Sub Y &Add
Sub X &Add
X Y
X > YY > X
18 MAPLD 2005/206Chiang
SAD Performance
• FPOA Resources ALU: 35 RF: 1 Total: 36 SOs
• Data throughput 24 cycles per 8x8 block
19 MAPLD 2005/206Chiang
Internal System Bus
• Link all processing modules and the external host to the external memory for data accesses to the external system memory
• Host controlled round-robin access from module to module
• User defined package format to utilize the 16-bit party line and minimize the access overhead
20 MAPLD 2005/206Chiang
System Bus Implementation
Memory Controller
Processing Element #1
XRAM RLDRAM
Processing Element #2
Processing Element #3
21 MAPLD 2005/206Chiang
System Bus Performance
• FPOA Resources ALU: 20
• Cycles XRAM read: 4 cycles XRAM write: 4 cycles Module switch: 10 cycles
22 MAPLD 2005/206Chiang
Performance of an Example Space Satellite Application
• Processing Throughput About 10 Million Samples per second
• FPOA Resources (% of a device with 400 SOs and running at 400 MHz) Cycle utilization: 21% SO utilization: 51% IRAM utilization: 25% XRAM b/w: 49% (100 MHz DDR RLDRAM)