36
Digital Signal Processing with Xilinx FPGAs Yin-Tsung Hwang The materials are largely based on the Xilinx Seminar Notes presented by Bruce Newgard 2 Configurable Hardware DSP Solutions Introduction to digital filters Distributed Arithmetic (DA) DA FIR filter example 8 Tap Slice High speed FIR filter Low speed FIR filter IIR bi-quad filter correlator Summary

Digital Signal Processing with Xilinx FPGAssocdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_101/night/... · 2020. 9. 14. · DA FIR Filter Design in XC 4000E 10-Tap 10-bit example •

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Digital Signal Processing with Xilinx FPGAs

    Yin-Tsung Hwang

    The materials are largely based on the Xilinx Seminar Notes presented by Bruce Newgard

    2

    Configurable Hardware DSP Solutions Introduction to digital filters Distributed Arithmetic (DA) DA FIR filter example 8 Tap Slice High speed FIR filter Low speed FIR filter IIR bi-quad filter correlator Summary

  • Digital Filter Basics

    4

    Introduction to Digital Filters Key component in many DSP applications

    channel equalization, echo cancellation digital vs analog filters

    programmability better frequency response

    Classifications Finite Impulse Response (FIR) filter

    Infinite Impulse Response (IIR) filter

    1

    0)()(

    M

    kk knxcny

    1

    0 1)()()(

    M

    k

    N

    ppk pnybknxcny

  • 5

    Finite Impulse Response Filter

    ck: filter coefficients (constants) x(n): input at time instance n y(n): output at time instance n M: filter tap order

    M could be as large as 1000 a series of multiply and accumulate operations

    No. of MAC operations /sec = sampling frequency filter tap order

    )1()1()(

    )()(

    110

    1

    0

    Mnxcnxcnxc

    knxcny

    M

    M

    kk

    6

    High Pass Filter Example

    0 10 20 30 40 50 60 70-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    n

    |h(n

    )|

    coefficients

    -2 -1.5 -1 -0.5 0 0.5 1

    -1

    -0.5

    0

    0.5

    1

    Real part

    Imag

    inar

    y pa

    rt

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000-60

    -50

    -40

    -30

    -20

    -10

    0

    frequency

    deci

    bels

    Magnitude Response in dB

    Ws=1500Hz Wp=4000HzSampling freq. : 5KHzTap order : 67filter type :HPF filter(FIR)

  • 7

    Low Pass Filter Example

    Wp=1500Hz Ws=3000HzSample_freq=5KHzTap order=67filter type : LPF filter(FIR)

    0 10 20 30 40 50 60 70-0.02

    0

    0.02

    0.04

    0.06

    0.08

    0.1

    n

    |h(n

    )|coefficient

    -2 -1.5 -1 -0.5 0 0.5 1

    -1

    -0.5

    0

    0.5

    1

    Real part

    Imag

    inar

    y pa

    rt

    0 0 .5 1 1 .5 2 2 .5

    x 104

    -140

    -120

    -100

    -80

    -60

    -40

    -20

    0

    20

    frequency

    deci

    bels

    Magn itude Response in dB

    8

    Basic FIR Filter Block Diagram

  • 9

    FIR Implementation Using Programmable DSP Processor

    Software solution 1 parallel multiplier, accumulator Time sharing through micro-

    coding relative low sample rate multiple chip solution no migration path complex real time programming

    For each sample data wordFor each tap

    Multiply c(i) times x(i)Add result to accumulator

    Distributed Arithmetic Basics

  • 11

    2’s Complement Multiplication

    12

    A Series of Multiply & Add

    ×

    +

    +

    ×

    +

    +

    ×

    +

    coefficient Input sampleWeighted

    partial product

    multiply result

    +final result

    Parallel multiplier

    accumulator

  • 13

    Distributed Arithmetic Approach (1)

    + + + +

    LSB MSBLSB+1

    +final result

    partial sumCan be implementedby a look up table

    Accumulator + shifter

    14

    Distributed Arithmetic Approach (2) x x x x3,3 0,0 3,2 0,0 3,1 0,0 3,0 0,0a a a a

    Sum Sum Sum Sum SumSum

    x x x x2,3 1,0 2,2 1,0 2,1 1,0 2,0 1,0a a a a x x x x1,3 2,0 2,2 2,0 1,1 2,0 1,0 2,0a a a a x x x x0,3 3,0 0,2 3,0 3,1 0,0 3,0 0,0a a a a

    x x x x3,3 0,1 3,2 0,1 3,1 0,1 3,0 0,1a a a a x x x x2,3 1,1 2,2 1,1 2,1 1,1 2,0 1,1a a a a x x x x1,3 2,1 1,2 2,1 1,1 2,1 1,0 2,1a a a a x x x x0,3 3,1 0,2 3,1 0,1 3,1 0,0 3,1a a a a

    Sum Sum Sum Sum SumSum x x x x3,3 0,2 3,2 0,2 3,1 0,2 3,0 0,2a a a a x x x x2,3 1,2 2,2 1,2 2,1 1,2 2,0 1,2a a a a x x x x1,3 2,2 1,2 2,2 1,1 2,2 1,0 2,2a a a a x x x x0,3 3,2 0,2 3,2 0,1 3,2 0,0 3,2a a a a

    x x x x3,3 0,3 3,2 0,3 3,1 0,3 3,0 0,3a a a a

    Sum Sum Sum Sum SumSum

    x x x x2,3 1,3 2,2 1,3 2,1 1,3 2,0 1,3a a a a x x x x1,3 2,3 1,2 2,3 1,1 2,3 1,0 2,3a a a a x x x x0,3 3,3 0,2 3,3 0,1 3,3 0,0 3,3a a a a

    Sum Sum Sum Sum SumSum

    SumSum Sum Sum SumSumSumSum Sum Sum SumSum

    SumSum Sum Sum SumSumSumSum Sum Sum SumSum

    +1

    +1

    +1

    +1

    1

    1

    1

    1

    P 0P 1P 2P 3P 4P 5P 6P 7P 8P 9

    Need a 4-operandparallel adder

    Need a scalingaccumulator

  • 15

    DA One-Tap FIR Filter Reduces to multiply a variable x(n) with a constant c0

    16

    DA Two-Tap FIR Filter

  • 17

    DA Three-Tap FIR Filter

    Look up table implementation can be both faster and area efficient than a multi-operand adder

    18

    Recall XC4000X Family4028EX* 4036EX* 4044EX 4052XL 4062XL 4085XL 40125XV4028EX* 4036EX* 4044EX 4052XL 4062XL 4085XL 40125XV

    Typ Logic Gates

    Typ System Gates(Logic + Select -

    RAM)

    Avail RAM bits

    Number CLBs

    Flip-Flops

    I/O

    Supply Voltage

    Packages:

    Typ Logic Gates

    Typ System Gates(Logic + Select -

    RAM)

    Avail RAM bits

    Number CLBs

    Flip-Flops

    I/O

    Supply Voltage

    Packages:

    56,000 72,000 90,000 110,000 130,000 175,000 250,000

    32,768 41,472 51,200 61,952 73,728 100,352 157,9681,024 1,296 1,600 1,936 2,304 3,136 4,6242,560 3,168 3,840 4,576 5,376 7,168 10,336256 288 320 352 384 448 544

    HQ208 HQ208 HQ208HQ240 HQ240 HQ240 HQ240 HQ240HQ304 HQ304 HQ304 HQ304 HQ304 BG352 BG352 BG352 BG352 BG352

    BG432 BG432 BG432 BG432BG560 BG560 BG560 BG560

    PG299 PG411 PG411 PG411 PG475 PG559 PG599

    * 30% of CLBsas RAM

    28,000 36,000 44,000 52,000 62,000 85,000 125,000

    5/3 5/3 5/3 3 3 3 2.5

  • The Development of a Distributed Arithmetic FIR Filter

    10 bit 10 tap XC4000E Family example

    20

    DA FIR Filter Design in XC 4000E10-Tap 10-bit example

    • N clocks per sample word• Fast clock• No multiplier required• Embedded hardware solution• LUT holds coefficients & Mult.

  • 21

    LUT Size in DA FIR Design• Look up table scales exponentially• 10-tap 10-bit needs 210×10 bits• need to reduce the LUT size• take advantages of linear phasesymmetrical FIR filter

    22

    10-Tap 10-Bit Symmetrical FIR Filter

  • 23

    Look Up Table Implementation

    Holds all partial products LUT is as wide as coefficient use MEMGEN to generate

    LUT

    32×10 memory

    Look UpTable

    A0

    A1

    A2

    A3

    A4320 bits

    DATA10

    24

    Serial Time Skew BufferSample data word size = Nfilter tap size = k

    • one N-bit shift register per tap• use XC4000E RAM to build

    shift register• one 16-bit shift register per

    1/2 CLB

    Using FFs10-bit 10-tap50 CLBs

    Using FFs10-bit 10-tap50 CLBs

    Using RAMs10-bit 10-tap10 CLBs

    Using RAMs10-bit 10-tap10 CLBs

    Shift register implemented in RAM

  • 25

    Bit Serial Adder

    Distributed arithmetic lookup table

    26

  • 27

    1‘s Complementer MSB has negative

    weighting inverts data on the last

    cycle 2 bits per CLB

    28

    Scaling Accumulator Adds data to

    1/2*(SUMOUT) 2 bits per CLB needs N+1 bits double precision with

    an extra shift register can use LogiBlox for

    RPM

  • 10-bit 10-tap linear phase FIR filter

    29

    30

    Implementation Block Diagram

    Total of 44 CLBs: Fits in a 4003E (with extra 56 CLBs for system use) about 1,300 equivalent gates: little interconnect between blocks

  • 31

    Performance No. of 10-tap 10-bit sym. FIR per 4000E device

    XC4000part

    4003E 4005E 4006E 4008E 4010E 4013E 4020E 4025E

    Number ofinstances

    2 4 5 7 9 11 15 22

    FIR 10B10T macro can be clocked at 70 MHz

    10 bit word requires 11 clocks 10 bit sample word rate is 6.4

    MHz

    word sizesample rate

    6 8 10 12 14 1610.0 7.8 6.4 5.4 4.7 4.1

    bitsMHz

    32

    Double Rate DA FIR Filter (1)

    Process 2 bits per clock # of clocks = (N/2)+1

  • 33

    Double Rate DA FIR Filter (2)

    two taps require 4-input LUT without symmetry four taps require 4-input LUT with symmetrical FIR time skew buffer is twice as many CLBs twice the data word sample rate both LUTs are the same

    Designing large multi-tap filter Xilinx 8-tap FIR filter SLICE building blocks

    34

  • Issue: LUT scales exponentially

    35

    32-tap FIR filter using 8-tap slices

    36

  • 8-tap FIR filter slice building blocks

    37

    8-tap FIR filter slice

    38

  • 8-tap FIR filter slice

    39

    Very high speed sampling rates

    Multiple parallel multipliers

    40

  • Multiply variable with a constant

    41

    Multiply variable with a constant (1)

    42

  • Multiply variable with a constant (2)

    43

    High speed parallel FIR filter

    44

  • Fully parallel distributed arithmetic

    45

    8-tap parallel DA slice (1)

    46

  • 8-tap parallel DA slice (2) Support sampling rates 50 ~ 70 Msps Data and coefficient sizes are independent of each

    other 8-bit data, 8-bit coefficient require 122 CLBs per 8-

    tap slice 16-tap, 8-bit filter requires 250 CLBs 32-tap, 8-bit filter requires 508 CLBs

    47

    CLB count for 8-tap PDA slice

    Approximate number of XC4000 CLBs48

  • Serial sequential architecture

    • Efficient CLB counts• Large number of taps• Moderate sampling rates• Non-symmetric filter OK

    49

    Lower sampling rate applications

    Serial sequential architecture

    50

  • Serial sequential FIR filter (1)

    51

    Serial sequential FIR filter (2)

    52

  • Serial sequential FIR filter (3)

    53

    64-tap serial sequential FIR filter

    54

  • Serial sequential FIR filter designs

    55

    Size estimate

    Serial sequential FIR filter designs

    56

    Speed estimate

  • 8-bit word FIR filter structures

    57

    FIR filter implementation options

    58

    8-bit word example

  • 12-bit word FIR filter structures

    59

    FIR filter implementation options

    60

    12-bit word example

  • IIR Filter Designs

    61

    Bi-quad FIR filter – direct form

    62

    (lowest quantization noise)

  • IIR filter – bi-quad implementation Requires 32-deep LUT 2 parallel to serial converters 60 CLBs for 16-bit word

    63

    64

    IIR filter – bi-quad implementation

  • Correlator design

    65

    Using LUTs for correlator design Any n-stage correlator can be decomposed into

    (n/4) 4-stage correlators LUTs contain all possible outputs for each 4-stage

    correlation Example: correlation pattern = 1011

    Store 4h at address 13h in LUT (4 bit matches) Store 3h at addresses 3, F, 9, A in LUT (1 bit error)

    Bit rate can exceed 120MHz (XC 3100A)

    66

  • Correlator LUT example

    67

    Input search pattern= 1101

    16-stage correlator using LUT

    68

  • Summary

    70

    Xilinx v.s. DSP Processor When does it make senses to use FPGAs?

    High to medium sample rate systems small word lengths lots of taps fast correlators single chip solution required low cost migration path (HardWire) incremental cost of DSP chip DSP application specific chips

    Design Once !

  • 71

    XDSP FPGA Applications Signal Synthesis Modulation / Demodulation Fast Fourier Transforms Neural Networks Video Signal Processing (2D, 3D Filters) and more …….

    72

    Possibilities

    An alternative to software DSP processor solution existing 4000E/EX are efficient at signal processing system level application specific solution on a single chip standard product configurable solution automatic migration path to a lower cost high volume

    solution