Upload
angel-ramsey
View
260
Download
0
Embed Size (px)
Citation preview
Xilinx DSP 1
Xilinx Core Solutions Group
DSP
Traditional DSP:DSP Processors
Multiply
Add
Single MAC
– One MAC (Multiply Accumulate)– Time-Shared– Performance ceiling
+ Programmable+ Off-the-shelf, standard part+ Hardware multiplier
SequentialProcessing
Xilinx DSPHigh Performance Alternative - Parallel Processing
Multiply
Add
Multiply
Add
Multiply
Add
Multiply
Add• • •
+ Programmable+ Off-the-shelf, standard part+ Many Multiplies in one clock cycle!+ Extend the performance of DSP Processors
Multiple MACs, Parallel Processing
Xilinx DSP Solution
• CORE Generator
System-LevelTools
• DSP LogiCOREs
• Tools Integration
Existing Xilinx DSP Design Methodology
COREGenerator
M1
XC4000X/Spartan/Virtex
CORE Generator
Parameterize DSP LogiCOREs
Connect the cores with HLD or schematic
Addition of DSP System Level Tool
DSP System level tools— Used by all DSP systems engineers— 100,000 copy installed base
Fit into existing DSP environment
Connect through the CORE Generator SystemLINX interface
SystemLevelTools
COREGenerator
M1
Performance
XC4085XL > 10x Faster than 320C6x
Bill
ion
s o
f M
AC
s p
er
Se
co
nd
4005XL 4013XL 4036XL 4062XL320C6x
1
2
3
4
5
4085XL
16-bit FIR Filter Benchmark
• • •
REG
10 bits 1
2
8
32-Tap FIR
AdderTreeR
EG
10 bits
18-bitsREG
32-Tap FIR
32-Tap FIR
• • •
1
2
8
32-Tap FIR
32-Tap FIR
32-Tap FIR
120 Million Samples per Second512-Tap Decimating FIR
3.8 Billion MACs>10 DSP uPs
5,120 Flip-Flops— Just for
data buffer
XC4085XL 150,000 Gates
Lowest Cost C6x
XilinxXC4000XL
$0.25
$0.20
$0.15
$0.10
$0.05Pri
ce p
er M
illi
on
M
AC
s p
er S
eco
nd
Price
DSP LogiCOREs Exploit FPGA Architecture
16-wordRAM
Matrix of 16 by 1 RAM primitives– Look-up-table logic– FIFOs, shift-registers, …– Multiple small memories
10,000 RAM primitives on a chipRegular, monolithic, scalable structureEfficient: 1 - 3 Million MACs per CLB
F/F
Distributed RAM & Distributed Arithmetic (DA):Perfect Match
4-InputLUT
4-InputLUT
ADDor
ACC.
Basic DA Structure MatchesXC4000 Architecture
N-bits
• • •
DA Algorithms:
• 4-Input Look-Up-Tables (LUT) Scaled with adders
• For higher performance Use more LUTs = more parallelism
• Efficiency similar to custom solutionAchievable with LUT logicMore ASIC gate equivalentsMore cost effective
• • •
Common DSP Functions Filters
— FIR— IIR
Transforms— FFT— DCT
Modulation— Multipliers— SIN tables
Basics— Multiply / add— Storage
X
C0
X0
X
C1
X1
X
C2
X2
•
•
•
• • •
• • •
SAMPLE DATAN BITS WIDE
K TAPS LONGK SUM’s
OUTPUTDATA
SUM
FIR FILTER
•
•
•
FIR Filter
1. Serial Distributed Arithmetic FIR– SDA FIR - Single Channel
– SDA FIR - Dual Channel
2. Parallel Distributed Arithmetic FIR
FIR Filter LogiCOREs
Two Basic Types:
Combine basic PDA or SDA FIR cores to solve many problems
SDA FIR FiltersSerial Distributed Arithmetic
• Parallel In, Parallel Out, Bit-Serial Internally• All taps processed in parallel• Full precession through entire core• One clock cycle required for each data bit• One additional clock cycle for symmetric filters
EXAMPLE: 10-bit data, 80 taps, symmetrical FIR:
• For a bit level clock = 90 MHz• Max sample rate = 90 MHz / 11 clks = 8.2 Million samples/sec.• Process 80 taps every 122 nsec.• 656 Million MACs, 257 CLBs, 2.55 Million MACs / CLB
SDA FIR Properties
• Coefficient bit-width determines size# CLBs = function of D.A. LUT width
• Data bit-width determines max sample rateOne serial clock per bit
• Output data width does not effect CLB count
For a Given # of Taps:
What to Ask Data sample rate Number of taps Data word width Coefficient width Coefficient Symmetry Same input & output sample rate?
Number of CLBs
Serial Distributed Arithmetic Data Word = Coefficient Size:
# CLBs 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit
8 tap Symm 33 36 39 42 45 52 55
Non 46 54 59 64 69 77 85
16 tap Symm 61 69 71 76 81 96 102
Non 80 95 104 112 123 138 142
24 tap Symm 89 101 108 116 127 146 154
Non 101 114 127 140 153 174 187
32 tap Symm 107 118 126 137 148 175 182
Non
40 tap Symm
Non
48 tap Symm 158 173 187 202 217 246 261
64 tap Symm 197 215 233 250 268 305 323
80 tap Symm
Sample Symm 13.3 8.9 7.3 6.2 5.3 4.7 4.2 3.8Rate Non 16.0 10.0 8.0 6.7 5.7 5.0 4.4 4.0
XC4000E-1 MHz MHz MHz MHz MHz MHz MHz MHz
Serial Distributed Arithmetic FIR Filters
5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit
53
80
93
116 138 154 165 179 191 226 239
236 257 278 299 320 364 385
Distributed RAM is More EfficientBuild the Time-Skew Buffer with Distributed RAM not Flip Flops
16 x 1 Shift Register
FF FF FF FF FF FF • FF• •
1 Logic Cell
16 Logic Cells
One 16x1 RAM Cell Primitive
FF
16 x 1 Shift Register
For SDA FIR Filters:
0
400
800
1200
1600
De
vic
e S
ize
(LC
s)
16-Taps16-Bits
16-Taps8-Bits
64-Taps9-Bits
64-Taps16-Bits
SDA FIR Filters
Xilinx Distributed RAM - Uses One Third the Area
XilinxDistributed
RAM
BlockRAM
Best Device UtilizationDistributed RAM well suited to DSP
PDA FIR Filter CoreParallel Distributed Arithmetic FIR Filters
• Fully parallel implementation• All taps processed in parallel (same as SDA)
• All bits processed in parallel
• Up to 100 million samples per second
• 2 billion MACs per 20-tap core
PDA FIR
Clock
Inputs Outputs
Data_IN DATA_OUT
CKCascadeData_Out
CascadeMid_Out
CascadeMid_In C_M_OUTC_M_IN
C_D_OUT
PDA FIR Filters
• Parameterized• Input data: 4 to 24 bits
• Coefficients: 4 to 24 bits
• Symmetric, non-symmetric, negative symmetry
• Output data: 2 to 31 bits
• Taps: 2 to 20 per core
• Automatically trims unused coefficient ROMs
• Supports cascading multiple filter cores
The high data sample rate solution
CORE Generator Software
LogiCORE:
AllianceCORE:Data Sheets
Web Mechanism to download new cores
SystemLINX: Ability to call CORE
Generator from Third Party Tools
One lineDocumentation
CORE Generator Methodology
1. Select a CORE
2. Enter parameters
3. Generate Core
160 CLBHOW ?
LogiCORE - SDA FilterFilter Design
Package
DSP CORE Generator Outputs Schematic symbol
VHDL or Verilog HDL instantiation code
Simulation model
Design netlist with constraints
20 rows by 9 columns160 CLBs used
32 Tap FIR Filter
Predictable Performance regardless number of cores
DSPCORE
Generator
FIR FilterRecipe
Parameters
Predictable Size & Performance• Built for System Performance - Not Benchmarks.• Generated with RPM (Relationally Placed Macro).
RPM Macro LevelAdvantages RPM System Level
Advantages• Predictable size.
• Close proximity of communicating elements
• Alignment of Critical paths
• Accessible I/O signals
• Improves Density
• Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!)
• Consistent performance anywhere on the die.
• Packing density very high
• Adequate set-up times
Filling a device with Xilinx Cores does not reduce performance
Same core installed in different locations
80 MHz
80 MHz
Performance Independent of core location
Xilinx LogiCOREs deliver the same performance for any placement
Non-segmented routing FPGAs can’t do this
80 MHz
80 MHz80 MHz
80 MHz
Performance Independent of Device Utilization
Xilinx has performance independent of the number of cores added Non-segmented routing FPGAs can’t do this
40
50
60
70
80
1 2 3 4 8
12x12 Area Efficient Multiplier
Number of Instances
Sp
eed
(M
Hz)
. . . . . .
NonSegmented
Segmented = More Predictable and Repeatable
XilinxSegmented
Best FPGA PerformanceXilinx is more Predictable
80 MHz 80 MHz 80 MHz
Performance Independent of Device Size
Same performance for a 4005 or 4085 Non-segmented routing FPGAs can’t do this
Design Flow
~~~ 4:1
ComplexDemod
~~
4:1
32-TAP FIRDecimate
48-TAPFIR
4K x 16RAM
Base-band processor
I
Q
COS
SIN
20 MHz
4 multipliers
5 MHz
Low Pass
~~
~~
~~
• Generate each module.• Use Schematic or HDL at a system level.
Mixer
Implementing the Mixer
This mixer supports sample rates in excess of 85MHz. It even supports sample rates up to 45.6MHz using the slowest Xilinx device(E-4)
Joining the CoresHere VHDL is used to link the cores into a system. Schematic symbols may also be used.
skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk);
form_sine_address:for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10);end generate form_sine_address;
sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, --select SINE output when high
c => clk);
All component declaration andport map code provided by Coregen
Power Dissipation AdvantageOften the Limiting Factor In DSP
Xilinx Advantage over competitive FPGAs— Segmented routing is essential in DSP applications— Altera Runs 3X HOTTER than Xilinx!
Xilinx advantage over DSP processors:— TI Runs 2X HOTTER 320c6
– Independent study by Stanford
STOP
Too MuchHeat
0
5
10
0 20 40 60Clock Frequency (MHz)
Po
we
r (W
)
80 100
Segmented = Lower Power, Faster Operation
Ceramic
PlasticNon
-Seg
men
ted
Xilinx Segmented
Segmented Interconnect YieldsLower Power
PackageThermal
Limit
FIR FilterCORE
100 Million
Samples / sec.
Where to find opportunities Look for high performance applications
— Multiple DSP processors— Fixed function DSP parts— Gate array / custom DSP
Data rates typically above 1 MHz Multiple channels required
DSP Applications
Image &Video Processing Communications Industrial, Military
Medical ImagingCopiersCamerasSecurity SystemsVideo editorsInspection SysFingerprint ID
Motor controlNumerical controlTest equipmentVibration analysisPower suppliesRadarSecure comm.
Wireless CommCellular / PCSModems
SatelliteCableADSL
Telephone Test
Where FPGA Solutions Fit
FPGAs ideal for high sample rates and computational intensity
MHz sample rates
FPGAs
Fixed-point arithmetic
kHz sample ratesSingle channel
Processors
Fixed-point arithmetic
ProcessorsFloating-point arithmetic
Audio RF, Video, Multiple Channels