Upload
elwin-singleton
View
218
Download
2
Embed Size (px)
Citation preview
What is the WHT anyway, and why are there so many ways to
compute it?
Jeremy Johnson
1, 2, 6, 24, 112, 568, 3032, 16768,…
Walsh-Hadamard Transform
• y = WHTN x, N = 2n
n
N WHTWHTWHT 22...
11
11WHT2
1111
1111
1111
1111
11
11
11
11
WHTWHTWHT 224
WHT Algorithms
• Factor WHTN into a product of sparse structured matrices
• Compute: y = (M1 M2 … Mt)xyt = Mtx…
y2 = M2 y3
y = M1 y2
Factoring the WHT Matrix
• AC DCD• A • A C) = (A C
• Im nmn
1100
1100
0011
0011
1010
0101
1010
0101
1111
1111
1111
1111
4WHT
WHT2 WHT2WHT2WHT2
Recursive and Iterative Factorization
WHT8 WHT2WHT4
WHT2WHT2 I2WHT
WHT2WHT2 I2I2WHT
WHT2WHT2 I2I2WHT
WHT2WHT2 I2I2WHT
WHT2WHT2 I4WHT
WHT8 =WHT2WHT2
I4WHT
11
11
11
11
11
11
11
11
11
11
1
11
11
11
11
11
11
11
11
11
11
11
11
11
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
WHT Algorithms
• Recursive
• Iterative
• General
n
i
ini
N1
IWHTIWHT 2221
WHTIIWHTWHT 2 2/2/2 NNN
nnn t
t
i
nnnnnn tiii
1
1
where
,2222 IWHTIWHT 111
WHT Implementation
• Definition/formula– N=N1* N2Nt Ni=2ni – x=WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni
for j=0,…,R-1 for k=0,…,S-1 S=S* Ni;
Mb,s
t
i 1
)
nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt
i
ii
i
i
NSkSjNN
NSkSjN xWHTx ,,
i
Partition Trees
4
1 3
1 2
1 1
4
1 1 11
Right Recursive
Iterative
9
3 4 2
1 2 1
1 1
4
13
12
11
Left Recursive
4
2 2
1 1 1 1
Balanced
Ordered Partitions
• There is a 1-1 mapping from ordered partitions of n onto (n-1)-bit binary numbers.
There are 2n-1 ordered partitions of n.
1|1 1|1 1 1 1|1 1 1+2+4+2 = 9
162 = 1 0 1 0 0 0 1 0
Enumerating Partition Trees
3 3
12
00 01
3
12
11
01
3
21
10
3
1 2
1 1
10
3
11
11
1
Counting Partition Trees
1,1
1,1 TTT 1
1
n
nn
nnnn t
tn
8.6),/(
)22(2
8811
))T(1(
)T()1(
2462
2/3
2
432
0
T
)T(
T)T(
n
z
zzzzzz
n
n
n
nn
z
zz
z
zzz
WHT PackagePüschel & Johnson (ICASSP ’00)• Allows easy implementation of any of the possible WHT algorithms• Partition tree representation W(n)=small[n] | split[W(n1),…W(nt)]• Tools
– Measure runtime of any algorithm
– Measure hardware events (coupled with PCL)
– Search for good implementation
• Dynamic programming
• Evolutionary algorithm
Histogram (n = 16, 10,000 samples)
•Wide range in performance despite equal number of arithmetic operations (n2n flops)
•Pentium III consumes more run time (more pipeline stages)
•Ultra SPARC II spans a larger range
Operation Count
Theorem. Let WN be a WHT algorithm of size N. Then the number of floating point operations (flops) used by WN is Nlg(N).
Proof. By induction.
2222
)(2)(
11
1
nninnn
Nflopsnn
Nflops
nnn
WWt
ii
it
i
i
i
t
i
i
Instruction Count Model
)()()A()IC(3
1
3
1AL nlninn
ll
ii
A(n) = number of calls to WHT procedure= number of instructions outside loopsAl(n) = Number of calls to base case of size l l = number of instructions in base case of size l
Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body
Small[1].file "s_1.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl apply_small1
.type apply_small1,@function
apply_small1:
movl 8(%esp),%edx //load stride S to EDX
movl 12(%esp),%eax //load x array's base address to EAX
fldl (%eax) // st(0)=R7=x[0]
fldl (%eax,%edx,8) //st(0)=R6=x[S]
fld %st(1) //st(0)=R5=x[0]
fadd %st(1),%st // R5=x[0]+x[S]
fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]
fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????
fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]
fstpl (%eax) //store x[0]=x[0]+x[S]
fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]
ret
Recurrences
leaf a ,0)A(
... ),A(1)A(1
12
n
nnnn
n
nnnti
t
i
i
lnl
ln
llA leaves ofnumber where,)( 2
Recurrences
leaf a ,0)(
... ,)()(
... ,)()(
... ),()(
L
2L2L
2L2L
L2L
11
23
1
...
122
11
11
11
n
nnnn
nnnn
nnnn
n
nnnnn
nnnnn
nntn
i
ti
t
i
ti
t
i
ti
t
i
ii
ii
i
Histogram using Instruction Model (P3)
l = 12, l = 34, and l = 106 = 271 = 18, 2 = 18, and 1 = 20
Algorithm Comparison
Recursive/Iterative Runtime
0.00E+002.00E-014.00E-016.00E-018.00E-011.00E+00
1.20E+001.40E+001.60E+001.80E+002.00E+00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
WHT size(2̂ n)
ratio r1/i1
Rec &Bal/It Instruction Count
0
0.5
1
1.5
2
2.5
1 3 5 7 9 11 13 15 17 19
rr1/i1
lr1/i1
bal1/i1
Rec&It/Best Runtime
0.00E+00
2.00E+00
4.00E+00
6.00E+00
8.00E+00
1.00E+01
1.20E+01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
WHT size(2̂ n)
ratio
r1/b
r3/b
i1/b
i3/b
b/b
Small/It Runtime
0.00E+002.00E+00
4.00E+006.00E+00
8.00E+001.00E+01
1.20E+01
1 2 3 4 5 6 7 8
WHT size(2̂ n)
ratio I_1/rt
r_1/rt
Dynamic Programming
minn1+… + nt= n
Cost(T T…
n1 nt
n),
where Tn is the optimial tree of size n.
This depends on the assumption that Cost only depends onthe size of a tree and not where it is located.(true for IC, but false for runtime).
For IC, the optimal tree is iterative with appropriate leaves.For runtime, DP is a good heuristic (used with binary trees).