Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

1

Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Tor Aamodt and Paul Chow

University of Toronto

Tor Aamodt & Paul ChowUniversity of Toronto

Numerical Error Minimizing Floating-Point toFixed-Point ANSI C Compilation 2 / 38

Presentation Outline

Background / Motivation

Floating-to-Fixed-Point Conversion

Architectural Support

Experimental Results

Summary / Future Directions



Background: University of Toronto DSP Project

Motivation: DSP Compiler/Architecture Co-design First Generation Silicon (Sean Peng’s M.A.Sc. Thesis) taped-

out Sept. 30, 1999: 108 pin PGA / 0.35 µm CMOS / 63 MHz 16-bit Fixed-Point VLIW with Two-Level Instruction Fetching Harvard Memory Architecture 5 stage pipeline: IF1 IF2 ID EX WB 7 function units:

2 integer units: 16.0 multiply & 1.15 multiply operations 2 address units: modulo addressing 2 memory units: each tied to one data memory bank 1 control unit



Background:

Fixed-Point versus Floating-Point

32 bit Floating-Point (IEEE):

Fixed-Point:

sign bit

sign bit

8 bit exponent (excess 127)

fractional part

IWL

integer part

23+1 bit normalizedmantissa



Background:

Fixed-Point versus Floating-Point

Property WL-bit Fixed-Point 32 bit Floating-Point

Dynamic Range of |x| [0,2IWL) (2-126, 2127)

Precision of x: |x / x| x -1 2(1+IWL - WL) 2-23

Function Unit Cost significantly less

This factor motivates us to find ways of coping with the shortcomings of fixed-point representations



Motivation

Why convert floating-point code to fixed-point code? Saves area and power.

Why automate the process? Manual conversion is time-consuming and error-prone.

What qualities are we looking for in an automated conversion system? Good signal quality*. Fast code.



Background: Fixed-point Numerical Representations in Signal Processing

Consider a program P with associated inputs x(k) SP. Example: P an IIR filter, SP the set of all human speech samples x(k).

Signal Scaling: Integer Word Length (IWL)

definition: IW Ld ef x S P

lo g | |m ax,2

Input, program variable, intermediate result, output For all definitions of , and all inputs x + an infinitesimally small number. Why? e.g. log22 = 1



Background:

Fixed-Point Arithmetic Operations

n

>> n (binary point alignment)

>> 1

( + 1)

Overflow Guard BitsAddition / Subtraction

B:

A:

Multiplication

IWLA+ IWLB

A*B:

IWLB

IWLA

A:

B:

???




Background Material / Motivation


Architecture Support





Conversion Process:

Previous Work

‘Worst-Case Evaluation’: Markus Willems et. al. FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign. ICASSP, April 1997.

A ‘Statistical’ Approach: Ki-Il Kum, Jiyang Kang, and Wonyong Sung. A Floating-Point to Fixed-Point C Converter for Fixed-Point Digital Signal Processors. In Proc. 2nd SUIF Compiler Workshop, August 1997.



Conversion Process: OverviewInput C File

SUIF Front End

Math Library Replacement

Alias Analysis &ID Assignment

Instrument CodeProfile to obtainDynamic Ranges

Generate ScalingOperations

Code Generation /Detect & GenerateFMLS operations

UofT DSP Simulator

float *p, x, y, A[N], B[N];

for( int i=0; i < N; i++ ){ p = (condition) ? A : B; y += x*p[i];}

float fubar( float *p ){ float sum = 0.0; for( int i=0; i < N; i++) sum += p[i];}

“sin(x)” “utdsp_sin(x)”



Conversion Process: Collecting Dynamic Range Information

y +

*

*

a

x[i+1]

b

x[i]

Equivalent Expression Tree:

ID Assignment:

“1” : tmp_1

“2” : tmp_2

“0” :

profile(tmp_1,1);

profile(tmp_2,2);

profile(y,0);

Code Instrumentation:

Consider the ANSI C code:

float a, b, x[N]; y = a*x[i] + b*x[i+1];

tmp_1 = a*x[i];

tmp_2 = b*x[i+1];

y = tmp_1 * tmp_2;

fin



Conversion Process:

Desired Result

Continuation of Previous Example :

float a, b, x[N];y = a*x[i] + b*x[i+1];

int a, b, x[N];

y = a•x[i] >> 2 + b•x[i+1];

2. Scaling Operations

1. Type Conversion

3. Fractional Fixed-Point Operations



Conversion Process:

Type Conversion / Scaling Operation Generation

Type conversion: {float, double} int

Scaling Operations are added to expression trees using a post-order traversal...

Two previous algorithms from the literature for generating scaling operations...

Neither use Intermediate Result Profile data, instead, they combine range information from leaf nodes in a bottom-up fashion.

Is Useful Information Lost?



Conversion Process:

IRP: Using Intermediate Result Profile Data ‘Worst-Case Evaluation’: Markus Willems et. al.

FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign. ICASSP, April 1997.

A ‘Statistical’ Approach: Ki-Il Kum, Jiyang Kang, and Wonyong Sung. A Floating-Point to Fixed-Point C Converter for Fixed-Point Digital Signal Processors. In Proc. 2nd SUIF Compiler Workshop, August 1997.

UTDSP Algorithms: IRP, IRP-SA Each node has a measured IWL and a current IWLMeasured: IWL as determined by profilingCurrent: IWL due to scaling operations within



Scaling Operation Generation

IWLA measured

IWLA current

IWLA op B measured

IWLA op B current

IWLB measured

IWLB current

Converted Sub-Expressions

Example: “A op B”:

op

A B

?



IRP: Additive Operations

where: nA = IWLA current - IWLA measured

nB = IWLA current - IWLB measured

n = IWLA measured - IWLB measured

“A B” “(A << nA) (B >> [n-nB])”

IWLA+B current = IWLA measured

n

“A ± B”

B:

A:

For example, assume |A| > |B|, andIWLA+B measured IWLA measured

>> n



IRP: Multiplication

“A • B” “(A << nA) • (B << nB)”

where: nA = IWLA current - IWLA measured

nB = IWLA current - IWLB measured

IWLA•B current = IWLA measured + IWLB measured

Note: Typo in Notes!IWLA•B current = nA + nB



IRP-SA: Using ‘Shift Absorption’

Problem:

Question: Is information discarded unnecessarily here?

y = (a*x[i] + b*x[i+1]>>1) << 1

y = (a*x[i]<<1) + b*x[i+1]

Answer: Yes! Consider the following alternative:

Assuming 2’s-complement arithmetic, this expression results in a more precise answer.











Architectural Support

Left Shift

A*B:

A:

B:

Common occurrence (using IRP-SA):

A•B << n

Fractional Multiplication with integrated Left Shift:












Four test-cases presented in paper:(1) 4th Order IIR Filter

(2) 1024 Point Radix 2 Decimation in Time FFT

(3) Nonlinear Feedback Control System

(4) 16th Order Lattice Filter

Look at (1) in detail, summarize results for others.

Explore some interesting properties exhibited in (4) that are indicative of possible future improvements.



Experimental Results:

4th Order IIR Filter4th Order Chebyshev Type II Low-Pass FilterDesigned using MATLAB’s cheby2 commandTransfer Function:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-300

-200

-100

0

100

Normalized Frequency (´p rad/sample)

Pha

se (

degr

ees)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-100-80-60-40-20

020


Mag

nitu

de (

dB)




4th Order IIR Filter (cont’d)

Filter Realization:MATLAB’s tfsos command (pole-zero pairing)2 Cascaded Direct-Form IIR filters

Algorithm14 Bit 16 Bit

w/o FMLS w/ FMLSw/o FMLS w/ FMLS

SNU-4

WC

IRP

IRP-SA

44.7 dB44.7 dB 56.4 dB 56.4 dB

45.6 dB 45.6 dB 57.1 dB57.1 dB

49.2 dB 49.3 dB 60.9 dB 62.0 dB

48.8 dB 53.5 dB 61.0 dB 66.9 dB




4th Order IIR Filter (cont’d)

(A2[0]*t2 << 3) - (A2[1]*D2[0] << 3) + (A2[2]*D2[1] << 3)

IRP:

IRP-SA:

(A2[0]*t2 - A2[1]*D2[0] << 1) + (A2[2]*D2[1] << 1 ) << 2




1024-Point Radix-2 FFT



SNU-4

WC

IRP

IRP-SA

28.7 dB28.7 dB 36.7 dB 36.7 dB

28.7 dB 28.7 dB 36.7 dB36.7 dB

28.7 dB 34.9 dB 36.7 dB 44.6 dB

28.7 dB 34.9 dB 36.7 dB 44.6 dB




Rotational Inverted Pendulum

U of T System Control GroupNon-linear Testbench




Rotational Inverted Pendulum



SNU-4

WC

IRP

IRP-SA

42.7 dB4.0 dB 30.7 dB 54.9 dB

47.3 dB 54.3 dB 66.1 dB59.2 dB

53.1 dB 58.4 dB 65.8 dB 71.8 dB

52.8 dB 59.4 dB 64.4 dB 72.0 dB




Rotational Inverted Pendulum - 12-bit Controller Comparison

WC : 32.8 dBIRP-SA: 41.1 dBIRP-SA w/ fmls: 48.0 dB




16th Order Lattice Filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1000

-500

0

500

1000


Pha

se (

degr

ees)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-80

-60

-40

-20

0

20


Mag

nitu

de (

dB)

16th Order Elliptic Bandpass Filter Transfer Function




Lattice Filter

Algorithm 32 Bit w/o Loop Unrolling 16 Bit w/ Loop Unrolling


SNU-4

WC

IRP

IRP-SA

22.8 dB22.8 dB 47.1 dB 47.0 dB

28.1 dB 28.1 dB 48.3 dB48.3 dB

36.1 dB 36.2 dB 51.3 dB 51.3 dB

36.1 dB 36.2 dB 51.3 dB 50.9 dB




Lattice Filter#define N 16;double state[N+1], K[N], V[N+1];

double lattice( double x ){ double y = 0.0; for( int i=0; i < N; i++ ) { x = x - K[N-i-1] * state[N-i-1]; state[N-i] = state[N-i-1] + K[N-i-1]*x; y = y + V[N-i]*state[N-i]; } state[0] = x; return y + V[0]*state[0];}




Lattice Filter

Observation: Wide Dynamic Ranges of “state”, “V”, “x”, and “y” are due to ‘Name Dependencies’ of array elements and accumulators when assigning integer word lengths.

Can use Loop Unrolling + Renaming to break dependencies and achieve far better results (iteration dependant analysis mentioned in FRIDGE paper—however no experimental results reported)











Summary

Intermediate result profile data can used to reduce numerical error of fixed-point code.

A fractional multiply with integrated left shift operation can improve the results, especially when combined with the IRP-SA algorithm.

Improvements between 3.0 dB and 12.8 dB have been observed so far.



Future Directions

Structural Transformations

Extended Precision Arithmetic

Overflows due to accumulated rounding error — use two profiling phases to estimate the effect of ‘second-order’ interactions.

Documents

Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation