Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Xilinx DSP 1

Xilinx Core Solutions Group

DSP

Traditional DSP:DSP Processors

Multiply

Add

Single MAC

– One MAC (Multiply Accumulate)– Time-Shared– Performance ceiling

+ Programmable+ Off-the-shelf, standard part+ Hardware multiplier

SequentialProcessing

Xilinx DSPHigh Performance Alternative - Parallel Processing

Multiply

Add

Multiply

Add

Multiply

Add

Multiply

Add• • •

+ Programmable+ Off-the-shelf, standard part+ Many Multiplies in one clock cycle!+ Extend the performance of DSP Processors

Multiple MACs, Parallel Processing

Xilinx DSP Solution

• CORE Generator

System-LevelTools

• DSP LogiCOREs

• Tools Integration

Existing Xilinx DSP Design Methodology

COREGenerator

M1

XC4000X/Spartan/Virtex

CORE Generator

Parameterize DSP LogiCOREs

Connect the cores with HLD or schematic

Addition of DSP System Level Tool

DSP System level tools— Used by all DSP systems engineers— 100,000 copy installed base

Fit into existing DSP environment

Connect through the CORE Generator SystemLINX interface

SystemLevelTools

COREGenerator

M1

Performance

XC4085XL > 10x Faster than 320C6x

Bill

ion

s o

f M

AC

s p

er

Se

co

nd

4005XL 4013XL 4036XL 4062XL320C6x

1

2

3

4

5

4085XL

16-bit FIR Filter Benchmark

• • •

REG

10 bits 1

2

8

32-Tap FIR

AdderTreeR

EG

10 bits

18-bitsREG

32-Tap FIR

32-Tap FIR

• • •

1

2

8

32-Tap FIR

32-Tap FIR

32-Tap FIR

120 Million Samples per Second512-Tap Decimating FIR

3.8 Billion MACs>10 DSP uPs

5,120 Flip-Flops— Just for

data buffer

XC4085XL 150,000 Gates

Lowest Cost C6x

XilinxXC4000XL

$0.25

$0.20

$0.15

$0.10

$0.05Pri

ce p

er M

illi

on

M

AC

s p

er S

eco

nd

Price

DSP LogiCOREs Exploit FPGA Architecture

16-wordRAM

Matrix of 16 by 1 RAM primitives– Look-up-table logic– FIFOs, shift-registers, …– Multiple small memories

10,000 RAM primitives on a chipRegular, monolithic, scalable structureEfficient: 1 - 3 Million MACs per CLB

F/F

Distributed RAM & Distributed Arithmetic (DA):Perfect Match

4-InputLUT

4-InputLUT

ADDor

ACC.

Basic DA Structure MatchesXC4000 Architecture

N-bits

• • •

DA Algorithms:

• 4-Input Look-Up-Tables (LUT) Scaled with adders

• For higher performance Use more LUTs = more parallelism

• Efficiency similar to custom solutionAchievable with LUT logicMore ASIC gate equivalentsMore cost effective

• • •

Common DSP Functions Filters

— FIR— IIR

Transforms— FFT— DCT

Modulation— Multipliers— SIN tables

Basics— Multiply / add— Storage

X

C0

X0

X

C1

X1

X

C2

X2

•

•

•

• • •

• • •

SAMPLE DATAN BITS WIDE

K TAPS LONGK SUM’s

OUTPUTDATA

SUM

FIR FILTER

•

•

•

FIR Filter

1. Serial Distributed Arithmetic FIR– SDA FIR - Single Channel

– SDA FIR - Dual Channel

2. Parallel Distributed Arithmetic FIR

FIR Filter LogiCOREs

Two Basic Types:

Combine basic PDA or SDA FIR cores to solve many problems

SDA FIR FiltersSerial Distributed Arithmetic

• Parallel In, Parallel Out, Bit-Serial Internally• All taps processed in parallel• Full precession through entire core• One clock cycle required for each data bit• One additional clock cycle for symmetric filters

EXAMPLE: 10-bit data, 80 taps, symmetrical FIR:

• For a bit level clock = 90 MHz• Max sample rate = 90 MHz / 11 clks = 8.2 Million samples/sec.• Process 80 taps every 122 nsec.• 656 Million MACs, 257 CLBs, 2.55 Million MACs / CLB

SDA FIR Properties

• Coefficient bit-width determines size# CLBs = function of D.A. LUT width

• Data bit-width determines max sample rateOne serial clock per bit

• Output data width does not effect CLB count

For a Given # of Taps:

What to Ask Data sample rate Number of taps Data word width Coefficient width Coefficient Symmetry Same input & output sample rate?

Number of CLBs

Serial Distributed Arithmetic Data Word = Coefficient Size:

# CLBs 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit

8 tap Symm 33 36 39 42 45 52 55

Non 46 54 59 64 69 77 85

16 tap Symm 61 69 71 76 81 96 102

Non 80 95 104 112 123 138 142

24 tap Symm 89 101 108 116 127 146 154

Non 101 114 127 140 153 174 187

32 tap Symm 107 118 126 137 148 175 182

Non

40 tap Symm

Non

48 tap Symm 158 173 187 202 217 246 261

64 tap Symm 197 215 233 250 268 305 323

80 tap Symm

Sample Symm 13.3 8.9 7.3 6.2 5.3 4.7 4.2 3.8Rate Non 16.0 10.0 8.0 6.7 5.7 5.0 4.4 4.0

XC4000E-1 MHz MHz MHz MHz MHz MHz MHz MHz

Serial Distributed Arithmetic FIR Filters

5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit

53

80

93

116 138 154 165 179 191 226 239

236 257 278 299 320 364 385

Distributed RAM is More EfficientBuild the Time-Skew Buffer with Distributed RAM not Flip Flops

16 x 1 Shift Register

FF FF FF FF FF FF • FF• •

1 Logic Cell

16 Logic Cells

One 16x1 RAM Cell Primitive

FF

16 x 1 Shift Register

For SDA FIR Filters:

0

400

800

1200

1600

De

vic

e S

ize

(LC

s)

16-Taps16-Bits

16-Taps8-Bits

64-Taps9-Bits

64-Taps16-Bits

SDA FIR Filters

Xilinx Distributed RAM - Uses One Third the Area

XilinxDistributed

RAM

BlockRAM

Best Device UtilizationDistributed RAM well suited to DSP

PDA FIR Filter CoreParallel Distributed Arithmetic FIR Filters

• Fully parallel implementation• All taps processed in parallel (same as SDA)

• All bits processed in parallel

• Up to 100 million samples per second

• 2 billion MACs per 20-tap core

PDA FIR

Clock

Inputs Outputs

Data_IN DATA_OUT

CKCascadeData_Out

CascadeMid_Out

CascadeMid_In C_M_OUTC_M_IN

C_D_OUT

PDA FIR Filters

• Parameterized• Input data: 4 to 24 bits

• Coefficients: 4 to 24 bits

• Symmetric, non-symmetric, negative symmetry

• Output data: 2 to 31 bits

• Taps: 2 to 20 per core

• Automatically trims unused coefficient ROMs

• Supports cascading multiple filter cores

The high data sample rate solution

CORE Generator Software

LogiCORE:

AllianceCORE:Data Sheets

Web Mechanism to download new cores

SystemLINX: Ability to call CORE

Generator from Third Party Tools

One lineDocumentation

CORE Generator Methodology

1. Select a CORE

2. Enter parameters

3. Generate Core

160 CLBHOW ?

LogiCORE - SDA FilterFilter Design

Package

DSP CORE Generator Outputs Schematic symbol

VHDL or Verilog HDL instantiation code

Simulation model

Design netlist with constraints

20 rows by 9 columns160 CLBs used

32 Tap FIR Filter

Predictable Performance regardless number of cores

DSPCORE

Generator

FIR FilterRecipe

Parameters

Predictable Size & Performance• Built for System Performance - Not Benchmarks.• Generated with RPM (Relationally Placed Macro).

RPM Macro LevelAdvantages RPM System Level

Advantages• Predictable size.

• Close proximity of communicating elements

• Alignment of Critical paths

• Accessible I/O signals

• Improves Density

• Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!)

• Consistent performance anywhere on the die.

• Packing density very high

• Adequate set-up times

Filling a device with Xilinx Cores does not reduce performance

Same core installed in different locations

80 MHz

80 MHz

Performance Independent of core location

Xilinx LogiCOREs deliver the same performance for any placement

Non-segmented routing FPGAs can’t do this

80 MHz

80 MHz80 MHz

80 MHz

Performance Independent of Device Utilization

Xilinx has performance independent of the number of cores added Non-segmented routing FPGAs can’t do this

40

50

60

70

80

1 2 3 4 8

12x12 Area Efficient Multiplier

Number of Instances

Sp

eed

(M

Hz)

. . . . . .

NonSegmented

Segmented = More Predictable and Repeatable

XilinxSegmented

Best FPGA PerformanceXilinx is more Predictable

80 MHz 80 MHz 80 MHz

Performance Independent of Device Size

Same performance for a 4005 or 4085 Non-segmented routing FPGAs can’t do this

Design Flow

~~~ 4:1

ComplexDemod

~~

4:1

32-TAP FIRDecimate

48-TAPFIR

4K x 16RAM

Base-band processor

I

Q

COS

SIN

20 MHz

4 multipliers

5 MHz

Low Pass

~~

~~

~~

• Generate each module.• Use Schematic or HDL at a system level.

Mixer

Implementing the Mixer

This mixer supports sample rates in excess of 85MHz. It even supports sample rates up to 45.6MHz using the slowest Xilinx device(E-4)

Joining the CoresHere VHDL is used to link the cores into a system. Schematic symbols may also be used.

skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk);

form_sine_address:for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10);end generate form_sine_address;

sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, --select SINE output when high

c => clk);

All component declaration andport map code provided by Coregen

Power Dissipation AdvantageOften the Limiting Factor In DSP

Xilinx Advantage over competitive FPGAs— Segmented routing is essential in DSP applications— Altera Runs 3X HOTTER than Xilinx!

Xilinx advantage over DSP processors:— TI Runs 2X HOTTER 320c6

– Independent study by Stanford

STOP

Too MuchHeat

0

5

10

0 20 40 60Clock Frequency (MHz)

Po

we

r (W

)

80 100

Segmented = Lower Power, Faster Operation

Ceramic

PlasticNon

-Seg

men

ted

Xilinx Segmented

Segmented Interconnect YieldsLower Power

PackageThermal

Limit

FIR FilterCORE

100 Million

Samples / sec.

Where to find opportunities Look for high performance applications

— Multiple DSP processors— Fixed function DSP parts— Gate array / custom DSP

Data rates typically above 1 MHz Multiple channels required

DSP Applications

Image &Video Processing Communications Industrial, Military

Medical ImagingCopiersCamerasSecurity SystemsVideo editorsInspection SysFingerprint ID

Motor controlNumerical controlTest equipmentVibration analysisPower suppliesRadarSecure comm.

Wireless CommCellular / PCSModems

SatelliteCableADSL

Telephone Test

Where FPGA Solutions Fit

FPGAs ideal for high sample rates and computational intensity

MHz sample rates

FPGAs

Fixed-point arithmetic

kHz sample ratesSingle channel

Processors

Fixed-point arithmetic

ProcessorsFloating-point arithmetic

Audio RF, Video, Multiple Channels

Documents

Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –