Upload
lydiep
View
213
Download
0
Embed Size (px)
Citation preview
Neural Universal Discrete Denoiser
Taesup Moon
Daegu Gyeongbuk Institute of Science and Technology (DGIST)Daegu, Korea
April 22, 2016Information Theory Forum, Stanford University
1 / 38
Outline
Introduction to discrete denoising
Deep neural networks
Neural DUDE
Experimental results
2 / 38
Discrete denoising - an estimation problem
I Xi , Zi , Xi take values in finite alphabets (e.g., binary, DNA)
I Goal: Choose X n as close as possible to X n, based on Zn
3 / 38
E.g., Correcting noisy bit stream
X n : 00000011111110000000000111111111100000001111111
Zn : 00100011101110010001000111110111100000011110111
X n : 00000011101110000000000011111111100000011111111
bit error rate (BER) =(number of errors)
n=
3
47= 0.064
I X n : clean bit stream
I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)
I X n : estimate of X n based on observing Zn
“given Zn, how would you choose X n to minimize BER?”
4 / 38
E.g., Correcting noisy bit stream
X n : 00000011111110000000000111111111100000001111111
Zn : 00100011101110010001000111110111100000011110111
X n : 00000011101110000000000011111111100000011111111
bit error rate (BER) =(number of errors)
n=
3
47= 0.064
I X n : clean bit stream
I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)
I X n : estimate of X n based on observing Zn
“given Zn, how would you choose X n to minimize BER?”
4 / 38
E.g., Correcting noisy bit stream
X n : 00000011111110000000000111111111100000001111111
Zn : 00100011101110010001000111110111100000011110111
X n : 00000011101110000000000011111111100000011111111
bit error rate (BER) =(number of errors)
n=
3
47= 0.064
I X n : clean bit stream
I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)
I X n : estimate of X n based on observing Zn
“given Zn, how would you choose X n to minimize BER?”
4 / 38
E.g., Correcting noisy bit stream
X n : 00000011111110000000000111111111100000001111111
Zn : 00100011101110010001000111110111100000011110111
X n : 00000011101110000000000011111111100000011111111
bit error rate (BER) =(number of errors)
n=
3
47= 0.064
I X n : clean bit stream
I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)
I X n : estimate of X n based on observing Zn
“given Zn, how would you choose X n to minimize BER?”
4 / 38
E.g., Correcting noisy bit stream
X n : 00000011111110000000000111111111100000001111111
Zn : 00100011101110010001000111110111100000011110111
X n : 00000011101110000000000011111111100000011111111
bit error rate (BER) =(number of errors)
n=
3
47= 0.064
I X n : clean bit stream
I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)
I X n : estimate of X n based on observing Zn
“given Zn, how would you choose X n to minimize BER?”
4 / 38
Life is easy if the source and noise are known
[Bayesian setting], e.g.,
I Source : BSMC(p) - binary symmetric Markov chain with p
1 − p
0 1
p
p
1 − p
I Noise : BSC(δ) - flips bits with probability δ
δ
0
1 1
01 − δ
1 − δ
δ
⇒ Zn is HMP → Forward-Backward Recursion is optimal
5 / 38
In real life, the source is often not known
Correcting noisy bit stream
I Source : 0000011110000101011100000 . . . (?)
I Noise : BSC(δ)
δ
0
1 1
01 − δ
1 − δ
δ
⇒ ?
6 / 38
In real life, the source is often not known
Image denoising
I Source :
or or or
I Noise : BSC(δ)
δ
0
1 1
01 − δ
1 − δ
δ
⇒ ?
6 / 38
Can we estimate an unknown source?
[Universal setting]
I Unknown source
I Known memoryless channel Π (e.g., BSC)
Π(x , z) = Prob(Z = z | X = x)
I Assume Π has a right inverse (a mild assumption)
I Loss measured by Λ(x , x) (e.g., Hamming loss)
LX n(xn, zn) =1
n
n∑i=1
Λ(xi , Xi (zn))
Can we still denoise as well as if we knew the source?
7 / 38
DUDE (Discrete Universal DEnoiser)’s attempt
A two-pass algorithm1
I Fix window (context) size kI First pass: For each zi
1. Identify the double-sided context ci = (z i−1i−k , z
i+ki+1 ) = (`k , rk)
· · · `1 `2 · · · `k zi r1 r2 · · · rk · · ·
2. Update the count vector m[zn, ci ] ∈ R|Z|
m[zn, ci ][zi ]← m[zn, ci ][zi ] + 1
I Second pass: Denoise zi with
Xi (zn) = Simple rule(Π,Λ,m[zn, ci ], zi )
1[Weissman et.al 05]8 / 38
Noiseless text
We might place the restriction on allowable sequences that no spaces
follow each other. · · · effect of statistical knowledge about the source
in reducing the required capacity of the channel · · · the relative
frequency of the digram i j. The letter frequencies p(i), the transition
probabilities · · · The resemblance to ordinary English text increases
quite noticeably at each of the above steps. · · · This theorem, and the
assumptions required for its proof, are in no way necessary for the
present theory. · · · The real justification of these definitions,
however, will reside in their implications. · · · H is then, for example,
the H in Boltzmann’s famous H theorem. We shall call H = −∑
pi log pithe entropy of the set of probabilities p1, . . . , pn. · · · The theorem says
that for large N this will be independent of q and equal to H. · · · The
next two theorems show that H and H′ can be determined by limiting
operations directly from the statistics of the message sequences,
without reference to the states and transition probabilities between
states. · · · The Fundamental Theorem for a Noiseless Channel · · · The
converse part of the theorem, that CH
cannot be exceeded, may be proved
by · · ·
9 / 38
DUDE observes noisy text
Wz right peace the rest iction on alksoable sequbole thgt wo spices
fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce
in recucilg the requihed clpagity ofythe clabbel · · · the relatrte
pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion
probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes
quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe
aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe
prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,
doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,
tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑
pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs
tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng
operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,
bithout referenge ty the htates and trankituon krobabilitnes bejwekn
ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe
ronvegse jaht jf tketheorem, thlt CH
calnot be excweded, may ke xroved
ey · · ·
9 / 38
To denoise m at i
Wz right peace the rest iction on alksoable sequbole thgt wo spices
fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce
in recucilg the requihed clpagity ofythe clabbel · · · the relatrte
pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion
probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes
quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe
aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe
prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,
doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,
tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑
pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs
tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng
operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,
bithout referenge ty the htates and trankituon krobabilitnes bejwekn
ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe
ronvegse jaht jf tketheorem, thlt CH
calnot be excweded, may ke xroved
ey · · ·
9 / 38
DUDE with window size 2 searches for h e • r e
Wz right peace the rest iction on alksoable sequbole thgt wo spices
fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce
in recucilg the requihed clpagity ofythe clabbel · · · the relatrte
pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion
probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes
quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe
aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe
prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,
doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,
tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑
pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs
tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng
operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,
bithout referenge ty the htates and trankituon krobabilitnes bejwekn
ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe
ronvegse jaht jf tketheorem, thlt CH
calnot be excweded, may ke xroved
ey · · ·
9 / 38
DUDE with window size 2 counts h e • r e
I The count vector
m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T
↑ ↑ ↑ ↑i m o sp
I The reconstruction of m at i
Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)
I Wherever DUDE sees hemre, same decision for m
→DUDE is a sliding window denoiser
9 / 38
DUDE with window size 2 counts h e • r e
I The count vector
m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T
↑ ↑ ↑ ↑i m o sp
I The reconstruction of m at i
Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)
I Wherever DUDE sees hemre, same decision for m
→DUDE is a sliding window denoiser
9 / 38
How good is DUDE? - Asymptotics
Denote Sk as the class of the k-th order sliding window denoisers
I sk ∈ Sk is a mapping Z2k+1 → XI Xi (z
n) = sk(z i+ki−k )
DUDE is as good as the best sk ∈ Sk for any source!
Theorem (Weissman et al. 05)
For all x∞ ∈ X∞, if k|Z|2k = o(
nlog n
),
limn→∞
(LXDUDE
(xn,Zn)− Dk(xn,Zn)︸ ︷︷ ︸Best loss in Sk
)= 0 w .p.1
10 / 38
How good is DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
FB Recursion
DUDE gets close to the optimum (FB recursion) for some k
11 / 38
Limitations of DUDE in practice
The k |Z|2k = o(
nlog n
)condition
I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)
I No concrete way of choosing the right k for given n and |Z|
Can we overcome above limitations?
12 / 38
Limitations of DUDE in practice
The k |Z|2k = o(
nlog n
)condition
I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)
I No concrete way of choosing the right k for given n and |Z|
Can we overcome above limitations?
12 / 38
Outline
Introduction to discrete denoising
Deep neural networks
Neural DUDE
Experimental results
13 / 38
Deep neural networks (DNN) are eating the world!
E.g., Order of magnitude improvement in image classification
I ImageNet Top-5 error rates
14 / 38
Deep neural networks (DNN) are eating the world!
E.g., AlphaGo beats Lee Sedol by 4-1
I ConvNet based policy/value networks for tree search
15 / 38
Supervised Softmax Regression
Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}
I Model:
ok =w>k x
pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)
I Given D = {(xi , yi )}mi=1, minimize
L(D,w) = −m∑i=1
K∑k=1
tik log pk(w, xi )
where tik = 1{yi = k}
16 / 38
Supervised Deep Neural Network (DNN)
Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}I Model:
ok =w>k hL
hk` =max(0,w>k`h`−1), ` = 1, . . . , L
pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)
I Given D = {(xi , yi )}mi=1, minimize
L(D,w) = −m∑i=1
K∑k=1
tik log pk(w, xi )
where tik = 1{yi = k}
17 / 38
Supervised Deep Neural Network (DNN)
Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}
I After training, we obtain a mapping
p(w∗, ·) : Rd → ∆K
I For a new data x, DNN predicts with
y = arg maxk
pk(w∗, x)
18 / 38
Outline
Introduction to discrete denoising
Deep neural networks
Neural DUDE
Experimental results
19 / 38
DNN for discrete denoising?
Recall sk : Z2k+1 → X .
Question: Can we learn a sliding window denoiser with DNN?
I Namely, can we train a DNN to obtain a mapping
p(w∗, ·) : Z2k+1 → ∆|X |
as in multi-class classification?
I A parametric model could overcome the limitations of DUDE
Not straightforward!
I Training DNN typically requires supervised training data
I Ground-truth label is the clean source, which is not available!!
An alternative view on discrete denoising is necessary
20 / 38
DNN for discrete denoising?
Recall sk : Z2k+1 → X .
Question: Can we learn a sliding window denoiser with DNN?
I Namely, can we train a DNN to obtain a mapping
p(w∗, ·) : Z2k+1 → ∆|X |
as in multi-class classification?
I A parametric model could overcome the limitations of DUDE
Not straightforward!
I Training DNN typically requires supervised training data
I Ground-truth label is the clean source, which is not available!!
An alternative view on discrete denoising is necessary
20 / 38
Two equivalent views of a sliding window denoiser
Recall ci = (z i−1i−k , z
i+ki+1 ). Then, z i+k
i−k = (ci , zi ).
Xi (zn) = sk(z i+k
i−k )
I sk : Z2k+1 → X
Xi (zn) = sk(ci , zi )
I Let S = {s : Z → X}I Then, sk(ci , ·) ∈ SI Note |S| = |X ||Z|
Maybe, we can learn a mapping p(w∗, ·) : Z2k → ∆|S| with DNN?
21 / 38
Unbiased estimated loss for single-symbol denoiser
Consider a single-letter case
x → Π→ Z → s(Z ) = x
I The true loss Λ(x , s(Z )) cannot be evaluated
I Instead, we can devise L = Π−1ρ ∈ R|Z|×|S|
I Π−1 ∈ R|Z|×|X|I ρ ∈ R|X |×|S| with ρ(u, s) = EuΛ(u, s(Z ))
I Then, L(Z , s) is the unbiased estimate of ExΛ(x , s(Z )), i.e.,
ExL(Z , s) =ExΛ(x , s(Z )) (1)
I First devised in [Weissman et al. 07]
22 / 38
A closer look at DUDE
The concrete DUDE rule with ci = (Z i−1i−k ,Z
i+ki+1 ):
Xi ,DUDE(Zn) = arg minx∈X
m[Zn, ci ]>Π−1[Λx �ΠZi
] (2)
I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show
sk,DUDE(c, ·) = arg mins∈S
∑i∈{i :ci=c}
L(Zi , s)
I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to
arg mins∈S
∑i∈{i :ci=c}
Λ(x , s(Zi ))
“DUDE learns s ∈ S for each c by minimizing the estimted loss!”
23 / 38
A closer look at DUDE
The concrete DUDE rule with ci = (Z i−1i−k ,Z
i+ki+1 ):
Xi ,DUDE(Zn) = arg minx∈X
m[Zn, ci ]>Π−1[Λx �ΠZi
] (2)
I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show
sk,DUDE(c, ·) = arg mins∈S
∑i∈{i :ci=c}
L(Zi , s)
I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to
arg mins∈S
∑i∈{i :ci=c}
Λ(x , s(Zi ))
“DUDE learns s ∈ S for each c by minimizing the estimted loss!”
23 / 38
A closer look at DUDE
The concrete DUDE rule with ci = (Z i−1i−k ,Z
i+ki+1 ):
Xi ,DUDE(Zn) = arg minx∈X
m[Zn, ci ]>Π−1[Λx �ΠZi
] (2)
I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show
sk,DUDE(c, ·) = arg mins∈S
∑i∈{i :ci=c}
L(Zi , s)
I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to
arg mins∈S
∑i∈{i :ci=c}
Λ(x , s(Zi ))
“DUDE learns s ∈ S for each c by minimizing the estimted loss!”
23 / 38
An alternative expression of DUDE
Following gives an alternative expression of
sk,DUDE(c, ·) = arg mins∈S
∑i∈{i :ci=c}
L(zi , s)
I Definep(c) = arg min
p∈∆|S|
( ∑i∈{i :ci=c}
1>zi L)
p,
which will be on one of the vertices in ∆|S| (∵ A simple LP)
I Then, sk,DUDE(c, ·) = arg maxs∈S ps(c)
24 / 38
Another way of obtaining DUDE
We can also obtain sk,DUDE(c, ·) as the following lemma
LemmaDefine Lnew , −L + Lmax11> where Lmax = maxz,s L(z , s). Denote
p∗(c) = arg minp∈∆|S|
∑i∈{i :ci=c}
C(L>new1zi ,p
),
in which C(g,p) , −∑|S|k=1 gk log pk for any g ∈ R|S|+ and
p ∈ ∆|S|.
Then, sk,DUDE(c, ·) = arg maxs∈S p∗s (c).
25 / 38
Proof of the lemma
Proof.Recall sk,DUDE(c, ·) = arg maxs∈S ps(c)
p(c) = arg minp∈∆|S|
( ∑i∈{i :ci=c}
1>zi L)
p
= arg maxp∈∆|S|
( ∑i∈{i :ci=c}
1>zi (−L + Lmax11>))
p
= arg maxp∈∆|S|
( ∑i∈{i :ci=c}
L>new1zi
)>p
p∗(c) = arg minp∈∆|S|
∑i∈{ci=c}
C(L>new1zi ,p
)= arg min
p∈∆|S|C( ∑
i∈{ci=c}
L>new1zi ,p)
p∗(c) puts the max. probability mass on the vertex of p(c)!26 / 38
Motivation for Neural DUDE
From the lemma, DUDE can be obtained by solving
p∗(c) = arg minp∈∆|S|
∑i∈{i :ci=c}
C(L>new1zi ,p
),
for each context c.
⇒ We may define p(w, ·) : Z2k → ∆|S| and solve
w∗ = arg minw
n∑i=1
C(
L>newIzi ,p(w, ci ))
to obtain a model to work for all c’s.
27 / 38
Neural DUDE algorithm
1. Define a DNN model
p(w, ·) : Z2k → ∆|S|
2. Given zn, obtain w∗ that minimizes
L(zn,w) ,1
n
n∑i=1
C(
L>newIzi ,p(w, ci ))
I Results in a network p(w∗, ·) that works for all cI Interpret ci as “input” and L>newIzi ∈ R|S| as a “pseudo-label”
3. Compute
sk,Neural DUDE(c, ·) = arg maxs∈S
ps(w∗, c)
4. Obtain Xi ,Neural DUDE(zn) = sk,Neural DUDE(ci , zi )
28 / 38
Property of Neural DUDE
I A sliding window denoiserI A single, parametric model
I Information from similar contexts are combined⇔ Non-parametric DUDE
I Robust to the choice of k (as shown in the experiements)
29 / 38
Outline
Introduction to discrete denoising
Deep neural networks
Neural DUDE
Experimental results
30 / 38
Experimental setup
For binary, BSC (with δ < 0.5) data
I |S| = 3; {always-0, always-1, say-what-you-see}
Neural DUDE models (with increasing layers)
I 1L: 2k − 3 (Softmax Regression)
I 2L: 2k − 20− 3
I 3L: 2k − 20− 20− 3
I 4L: 2k − 20− 20− 20− 3
I Mini-batch SGD, Adam,Momentum method used foroptimization
31 / 38
How good is Neural DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
FB Recursion
DUDE is sensitive to k
32 / 38
How good is Neural DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
Neural DUDE (1L)
FB Recursion
Neural DUDE 1 Layer
32 / 38
How good is Neural DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
Neural DUDE (2L)
FB Recursion
Neural DUDE 2 Layer
32 / 38
How good is Neural DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
Neural DUDE (3L)
FB Recursion
Neural DUDE 3 Layer
32 / 38
How good is Neural DUDE? - Finite n
Binary example with n = 106
I Source : BSMC(p) with p = 0.1
I Noise : BSC(δ) with δ = 0.1
0 2 4 6 8 10 12 14
Window size k
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
(Bit
Err
or
Rate
) / δ
0.563δ
0.558δ
DUDE
Neural DUDE (4L)
FB Recursion
Neural DUDE 4 Layer - robust to k!
32 / 38
Concentration of the estimated loss
How close is 1n
∑ni=1 L(zi , sk(ci , ·)) to 1
n
∑ni=1 Λ(xi , sk(ci , zi ))?
0 2 4 6 8 10 12 14
Window size k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(Bit
Err
or
Rate
) / δ
BER
Estimated BER
FB Recursion
DUDE
0 2 4 6 8 10 12 14
Window size k
0.52
0.54
0.56
0.58
0.60
0.62
0.64
(Bit
Err
or
Rate
) / δ
BER
Estimated BER
FB Recursion
Neural DUDE(4L)
The estimated loss of Neural DUDE concentrates on the true loss
I Can pick k based on 1n
∑ni=1 L(zi , sk(ci , ·)) for Neural DUDE!
33 / 38
Binary image denoising
Setting
I Source:
or or or · · ·
I Noise: BSC(δ) with δ = 0.1
Comparison between DUDE and Neural DUDE?
I Raster scan of the data
I Use 1-D context
34 / 38
Binary image denoising - Einstein 256× 256
Neural DUDE significantly outpeforms DUDE! (relative 25.7%!)
0 5 10 15 20 25 30
Window size k
0.3
0.4
0.5
0.6
0.7
0.8
0.9(B
it E
rror
Rate
) / δ
0.417δ
0.561δ
DUDE
Neural DUDE (4L)
Note for k = 27, there are 254 different contexts!! (� n = 216)
35 / 38
Binary image denoising - Einstein 256× 256
Estimated loss of Neural DUDE still concentrates on the true loss!
0 5 10 15 20 25 30
Window size k
1.0
0.5
0.0
0.5
1.0
(Bit
Err
or
Rate
) / δ
DUDE BER
DUDE Estimated BER
DUDE
0 5 10 15 20 25 30
Window size k
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(Bit
Err
or
Rate
) / δ
Neural DUDE BER
Neural DUDE Estimated BER
Neural DUDE(4L)
35 / 38
Binary image denoising results
Relative error reduction of Neural DUDE over DUDE
0
5
10
15
20
25
30
Barbara Mandrill Lena Camera Man
Boat Peppers Einstein Halftone
Shannon Text
Relativ
e Error R
eductio
n (%)
36 / 38
Future/On-going research directions
Analysis
I Analyzing the concentration of the estimated loss
I Connection to SURE (Stein’s Unbiased Risk Estimator)
Algorithmic extension
I Extending to larger alphabet data beyond binary (e.g., DNA)
I Extending to continuous-valued data (e.g., grayscale image)
I Applying Recurrent Neural Networks (RNN)
I Try other general function approximation methods (e.g.,random forest, etc.)
37 / 38