26
What is the WHT anyway, and why are there so many ways to compute it? Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…

What is the WHT anyway, and why are there so many ways to compute it?

Embed Size (px)

DESCRIPTION

What is the WHT anyway, and why are there so many ways to compute it?. Jeremy Johnson. 1, 2, 6, 24, 112, 568, 3032, 16768,…. Walsh-Hadamard Transform. y = WHT N x, N = 2 n. WHT Algorithms. Factor WHT N into a product of sparse structured matrices Compute: y = (M 1 M 2 … M t )x - PowerPoint PPT Presentation

Citation preview

What is the WHT anyway, and why are there so many ways to

compute it?

Jeremy Johnson

1, 2, 6, 24, 112, 568, 3032, 16768,…

Walsh-Hadamard Transform

• y = WHTN x, N = 2n

n

N WHTWHTWHT 22...

11

11WHT2

1111

1111

1111

1111

11

11

11

11

WHTWHTWHT 224

WHT Algorithms

• Factor WHTN into a product of sparse structured matrices

• Compute: y = (M1 M2 … Mt)xyt = Mtx…

y2 = M2 y3

y = M1 y2

Factoring the WHT Matrix

• AC DCD• A • A C) = (A C

• Im nmn

1100

1100

0011

0011

1010

0101

1010

0101

1111

1111

1111

1111

4WHT

WHT2 WHT2WHT2WHT2

Recursive and Iterative Factorization

WHT8 WHT2WHT4

WHT2WHT2 I2WHT

WHT2WHT2 I2I2WHT

WHT2WHT2 I2I2WHT

WHT2WHT2 I2I2WHT

WHT2WHT2 I4WHT

Recursive Algorithm

11

11

11

11

11

11

11

1111

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

WHT8 = (WHT2 I4)(I2 (WHT2 I2)(I2 WHT2))

Iterative Algorithm

11

11

11

11

11

11

11

11

11

11

1

11

11

11

11

11

11

11

11

11

11

11

11

11

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

WHT8 = (WHT2 I4)(I2 WHT2 I2)(I4 WHT2)

WHT Algorithms

• Recursive

• Iterative

• General

n

i

ini

N1

IWHTIWHT 2221

WHTIIWHTWHT 2 2/2/2 NNN

nnn t

t

i

nnnnnn tiii

1

1

where

,2222 IWHTIWHT 111

WHT Implementation

• Definition/formula– N=N1* N2Nt Ni=2ni – x=WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))

• Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni

for j=0,…,R-1 for k=0,…,S-1 S=S* Ni;

Mb,s

t

i 1

)

nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt

i

ii

i

i

NSkSjNN

NSkSjN xWHTx ,,

i

Partition Trees

4

1 3

1 2

1 1

4

1 1 11

Right Recursive

Iterative

9

3 4 2

1 2 1

1 1

4

13

12

11

Left Recursive

4

2 2

1 1 1 1

Balanced

Ordered Partitions

• There is a 1-1 mapping from ordered partitions of n onto (n-1)-bit binary numbers.

There are 2n-1 ordered partitions of n.

1|1 1|1 1 1 1|1 1 1+2+4+2 = 9

162 = 1 0 1 0 0 0 1 0

Enumerating Partition Trees

3 3

12

00 01

3

12

11

01

3

21

10

3

1 2

1 1

10

3

11

11

1

Search Space

• Optimization of the WHT becomes a search, over the space of partition trees, for the fastest algorithm.

• The number of trees:

TTT...

1

1

1n

nnnn t

t

n

n 1 2 3 4 5 6 7 8T_n 1 2 6 24 112 568 3672 16768

Size of Search Space

• Let T(z) be the generating function for Tn

Tn = (n/n3/2), where =4+8 6.8284

• Restricting to binary trees

Tn = (5n/n3/2)

))(1/()()1/()( 2 zTzTzzzT

2)()1/()( zBzzzB

WHT PackagePüschel & Johnson (ICASSP ’00)• Allows easy implementation of any of the possible WHT algorithms• Partition tree representation W(n)=small[n] | split[W(n1),…W(nt)]• Tools

– Measure runtime of any algorithm

– Measure hardware events

– Search for good implementation

• Dynamic programming

• Evolutionary algorithm

Histogram (n = 16, 10,000 samples)

•Wide range in performance despite equal number of arithmetic operations (n2n flops)

•Pentium III consumes more run time (more pipeline stages)

•Ultra SPARC II spans a larger range

Operation Count

Theorem. Let WN be a WHT algorithm of size N. Then the number of floating point operations (flops) used by WN is Nlg(N).

Proof. By induction.

2222

)(2)(

11

1

nninnn

Nflopsnn

Nflops

nnn

WWt

ii

it

i

i

i

t

i

i

Instruction Count Model

)()()A()IC(3

1

3

1AL nlninn

ll

ii

A(n) = number of calls to WHT procedure= number of instructions outside loopsAl(n) = Number of calls to base case of size l l = number of instructions in base case of size l

Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body

Small[1].file "s_1.c"

.version "01.01"

gcc2_compiled.:

.text

.align 4

.globl apply_small1

.type apply_small1,@function

apply_small1:

movl 8(%esp),%edx //load stride S to EDX

movl 12(%esp),%eax //load x array's base address to EAX

fldl (%eax) // st(0)=R7=x[0]

fldl (%eax,%edx,8) //st(0)=R6=x[S]

fld %st(1) //st(0)=R5=x[0]

fadd %st(1),%st // R5=x[0]+x[S]

fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]

fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????

fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]

fstpl (%eax) //store x[0]=x[0]+x[S]

fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]

ret

Recurrences

leaf a ,0)A(

... ),A(1)A(1

12

n

nnnn

n

nnnti

t

i

i

lnl

ln

llA leaves ofnumber where,)( 2

Recurrences

leaf a ,0)(

... ,)()(

... ,)()(

... ),()(

L

2L2L

2L2L

L2L

11

23

1

...

122

11

11

11

n

nnnn

nnnn

nnnn

n

nnnnn

nnnnn

nntn

i

ti

t

i

ti

t

i

ti

t

i

ii

ii

i

Histogram using Instruction Model (P3)

l = 12, l = 34, and l = 106 = 271 = 18, 2 = 18, and 1 = 20

Algorithm Comparison

Recursive/Iterative Runtime

0.00E+002.00E-014.00E-016.00E-018.00E-011.00E+00

1.20E+001.40E+001.60E+001.80E+002.00E+00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

WHT size(2̂ n)

ratio r1/i1

Rec &Bal/It Instruction Count

0

0.5

1

1.5

2

2.5

1 3 5 7 9 11 13 15 17 19

rr1/i1

lr1/i1

bal1/i1

Rec&It/Best Runtime

0.00E+00

2.00E+00

4.00E+00

6.00E+00

8.00E+00

1.00E+01

1.20E+01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

WHT size(2̂ n)

ratio

r1/b

r3/b

i1/b

i3/b

b/b

Small/It Runtime

0.00E+002.00E+00

4.00E+006.00E+00

8.00E+001.00E+01

1.20E+01

1 2 3 4 5 6 7 8

WHT size(2̂ n)

ratio I_1/rt

r_1/rt

Dynamic Programming

minn1+… + nt= n

Cost(T T…

n1 nt

n),

where Tn is the optimial tree of size n.

This depends on the assumption that Cost only depends onthe size of a tree and not where it is located.(true for IC, but false for runtime).

For IC, the optimal tree is iterative with appropriate leaves.For runtime, DP is a good heuristic (used with binary trees).

Optimal FormulasPentium

[1], [2], [3], [4], [5], [6]

[7]

[[4],[4]]

[[5],[4]]

[[5],[5]]

[[5],[6]]

[[2],[[5],[5]]]

[[2],[[5],[6]]]

[[2],[[2],[[5],[5]]]]

[[2],[[2],[[5],[6]]]]

[[2],[[2],[[2],[[5],[5]]]]]

[[2],[[2],[[2],[[5],[6]]]]]

[[2],[[2],[[2],[[2],[[5],[5]]]]]]

[[2],[[2],[[2],[[2],[[5],[6]]]]]]

[[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]]

UltraSPARC[1], [2], [3], [4], [5],

[6]

[[3],[4]]

[[4],[4]]

[[4],[5]]

[[5],[5]]

[[5],[6]]

[[4],[[4],[4]]]

[[[4],[5]],[4]]

[[4],[[5],[5]]]

[[[5],[5]],[5]]

[[[5],[5]],[6]]

[[4],[[[4],[5]],[4]]]

[[4],[[4],[[5],[5]]]]

[[4],[[[5],[5]],[5]]]

[[5],[[[5],[5]],[5]]]

Different Strides

• Dynamic programming assumption is not true. Execution time depends on stride.

0 2 4 6 8 10 12 14 16 18 200.5

1

1.5

2

2.5

3

3.5

4

4.5

time/

time

for

strid

e 1

stride

W(1)W(2)W(3)W(4)W(5)