61
Neural Universal Discrete Denoiser Taesup Moon Daegu Gyeongbuk Institute of Science and Technology (DGIST) Daegu, Korea April 22, 2016 Information Theory Forum, Stanford University 1 / 38

Neural Universal Discrete Denoiser - web.stanford.edu is then, for example, the H in Boltzmann’s ... Wz right peace the rest iction on alksoable sequbole thgt wo spices ... Binary

  • Upload
    lydiep

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Neural Universal Discrete Denoiser

Taesup Moon

Daegu Gyeongbuk Institute of Science and Technology (DGIST)Daegu, Korea

April 22, 2016Information Theory Forum, Stanford University

1 / 38

Outline

Introduction to discrete denoising

Deep neural networks

Neural DUDE

Experimental results

2 / 38

Discrete denoising - an estimation problem

I Xi , Zi , Xi take values in finite alphabets (e.g., binary, DNA)

I Goal: Choose X n as close as possible to X n, based on Zn

3 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38

Life is easy if the source and noise are known

[Bayesian setting], e.g.,

I Source : BSMC(p) - binary symmetric Markov chain with p

1 − p

0 1

p

p

1 − p

I Noise : BSC(δ) - flips bits with probability δ

δ

0

1 1

01 − δ

1 − δ

δ

⇒ Zn is HMP → Forward-Backward Recursion is optimal

5 / 38

In real life, the source is often not known

Correcting noisy bit stream

I Source : 0000011110000101011100000 . . . (?)

I Noise : BSC(δ)

δ

0

1 1

01 − δ

1 − δ

δ

⇒ ?

6 / 38

In real life, the source is often not known

Image denoising

I Source :

or or or

I Noise : BSC(δ)

δ

0

1 1

01 − δ

1 − δ

δ

⇒ ?

6 / 38

Can we estimate an unknown source?

[Universal setting]

I Unknown source

I Known memoryless channel Π (e.g., BSC)

Π(x , z) = Prob(Z = z | X = x)

I Assume Π has a right inverse (a mild assumption)

I Loss measured by Λ(x , x) (e.g., Hamming loss)

LX n(xn, zn) =1

n

n∑i=1

Λ(xi , Xi (zn))

Can we still denoise as well as if we knew the source?

7 / 38

DUDE (Discrete Universal DEnoiser)’s attempt

A two-pass algorithm1

I Fix window (context) size kI First pass: For each zi

1. Identify the double-sided context ci = (z i−1i−k , z

i+ki+1 ) = (`k , rk)

· · · `1 `2 · · · `k zi r1 r2 · · · rk · · ·

2. Update the count vector m[zn, ci ] ∈ R|Z|

m[zn, ci ][zi ]← m[zn, ci ][zi ] + 1

I Second pass: Denoise zi with

Xi (zn) = Simple rule(Π,Λ,m[zn, ci ], zi )

1[Weissman et.al 05]8 / 38

Noiseless text

We might place the restriction on allowable sequences that no spaces

follow each other. · · · effect of statistical knowledge about the source

in reducing the required capacity of the channel · · · the relative

frequency of the digram i j. The letter frequencies p(i), the transition

probabilities · · · The resemblance to ordinary English text increases

quite noticeably at each of the above steps. · · · This theorem, and the

assumptions required for its proof, are in no way necessary for the

present theory. · · · The real justification of these definitions,

however, will reside in their implications. · · · H is then, for example,

the H in Boltzmann’s famous H theorem. We shall call H = −∑

pi log pithe entropy of the set of probabilities p1, . . . , pn. · · · The theorem says

that for large N this will be independent of q and equal to H. · · · The

next two theorems show that H and H′ can be determined by limiting

operations directly from the statistics of the message sequences,

without reference to the states and transition probabilities between

states. · · · The Fundamental Theorem for a Noiseless Channel · · · The

converse part of the theorem, that CH

cannot be exceeded, may be proved

by · · ·

9 / 38

DUDE observes noisy text

Wz right peace the rest iction on alksoable sequbole thgt wo spices

fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce

in recucilg the requihed clpagity ofythe clabbel · · · the relatrte

pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion

probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes

quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe

aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe

prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,

doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,

tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑

pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs

tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng

operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,

bithout referenge ty the htates and trankituon krobabilitnes bejwekn

ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe

ronvegse jaht jf tketheorem, thlt CH

calnot be excweded, may ke xroved

ey · · ·

9 / 38

To denoise m at i

Wz right peace the rest iction on alksoable sequbole thgt wo spices

fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce

in recucilg the requihed clpagity ofythe clabbel · · · the relatrte

pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion

probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes

quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe

aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe

prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,

doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,

tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑

pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs

tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng

operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,

bithout referenge ty the htates and trankituon krobabilitnes bejwekn

ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe

ronvegse jaht jf tketheorem, thlt CH

calnot be excweded, may ke xroved

ey · · ·

9 / 38

DUDE with window size 2 searches for h e • r e

Wz right peace the rest iction on alksoable sequbole thgt wo spices

fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce

in recucilg the requihed clpagity ofythe clabbel · · · the relatrte

pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion

probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes

quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe

aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe

prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,

doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,

tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑

pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs

tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng

operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,

bithout referenge ty the htates and trankituon krobabilitnes bejwekn

ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe

ronvegse jaht jf tketheorem, thlt CH

calnot be excweded, may ke xroved

ey · · ·

9 / 38

DUDE with window size 2 counts h e • r e

I The count vector

m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T

↑ ↑ ↑ ↑i m o sp

I The reconstruction of m at i

Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)

I Wherever DUDE sees hemre, same decision for m

→DUDE is a sliding window denoiser

9 / 38

DUDE with window size 2 counts h e • r e

I The count vector

m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T

↑ ↑ ↑ ↑i m o sp

I The reconstruction of m at i

Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)

I Wherever DUDE sees hemre, same decision for m

→DUDE is a sliding window denoiser

9 / 38

How good is DUDE? - Asymptotics

Denote Sk as the class of the k-th order sliding window denoisers

I sk ∈ Sk is a mapping Z2k+1 → XI Xi (z

n) = sk(z i+ki−k )

DUDE is as good as the best sk ∈ Sk for any source!

Theorem (Weissman et al. 05)

For all x∞ ∈ X∞, if k|Z|2k = o(

nlog n

),

limn→∞

(LXDUDE

(xn,Zn)− Dk(xn,Zn)︸ ︷︷ ︸Best loss in Sk

)= 0 w .p.1

10 / 38

How good is DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

FB Recursion

DUDE gets close to the optimum (FB recursion) for some k

11 / 38

Limitations of DUDE in practice

The k |Z|2k = o(

nlog n

)condition

I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)

I No concrete way of choosing the right k for given n and |Z|

Can we overcome above limitations?

12 / 38

Limitations of DUDE in practice

The k |Z|2k = o(

nlog n

)condition

I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)

I No concrete way of choosing the right k for given n and |Z|

Can we overcome above limitations?

12 / 38

Outline

Introduction to discrete denoising

Deep neural networks

Neural DUDE

Experimental results

13 / 38

Deep neural networks (DNN) are eating the world!

E.g., Order of magnitude improvement in image classification

I ImageNet Top-5 error rates

14 / 38

Deep neural networks (DNN) are eating the world!

E.g., AlphaGo beats Lee Sedol by 4-1

I ConvNet based policy/value networks for tree search

15 / 38

Supervised Softmax Regression

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}

I Model:

ok =w>k x

pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)

I Given D = {(xi , yi )}mi=1, minimize

L(D,w) = −m∑i=1

K∑k=1

tik log pk(w, xi )

where tik = 1{yi = k}

16 / 38

Supervised Deep Neural Network (DNN)

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}I Model:

ok =w>k hL

hk` =max(0,w>k`h`−1), ` = 1, . . . , L

pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)

I Given D = {(xi , yi )}mi=1, minimize

L(D,w) = −m∑i=1

K∑k=1

tik log pk(w, xi )

where tik = 1{yi = k}

17 / 38

Supervised Deep Neural Network (DNN)

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}

I After training, we obtain a mapping

p(w∗, ·) : Rd → ∆K

I For a new data x, DNN predicts with

y = arg maxk

pk(w∗, x)

18 / 38

Outline

Introduction to discrete denoising

Deep neural networks

Neural DUDE

Experimental results

19 / 38

DNN for discrete denoising?

Recall sk : Z2k+1 → X .

Question: Can we learn a sliding window denoiser with DNN?

I Namely, can we train a DNN to obtain a mapping

p(w∗, ·) : Z2k+1 → ∆|X |

as in multi-class classification?

I A parametric model could overcome the limitations of DUDE

Not straightforward!

I Training DNN typically requires supervised training data

I Ground-truth label is the clean source, which is not available!!

An alternative view on discrete denoising is necessary

20 / 38

DNN for discrete denoising?

Recall sk : Z2k+1 → X .

Question: Can we learn a sliding window denoiser with DNN?

I Namely, can we train a DNN to obtain a mapping

p(w∗, ·) : Z2k+1 → ∆|X |

as in multi-class classification?

I A parametric model could overcome the limitations of DUDE

Not straightforward!

I Training DNN typically requires supervised training data

I Ground-truth label is the clean source, which is not available!!

An alternative view on discrete denoising is necessary

20 / 38

Two equivalent views of a sliding window denoiser

Recall ci = (z i−1i−k , z

i+ki+1 ). Then, z i+k

i−k = (ci , zi ).

Xi (zn) = sk(z i+k

i−k )

I sk : Z2k+1 → X

Xi (zn) = sk(ci , zi )

I Let S = {s : Z → X}I Then, sk(ci , ·) ∈ SI Note |S| = |X ||Z|

Maybe, we can learn a mapping p(w∗, ·) : Z2k → ∆|S| with DNN?

21 / 38

Unbiased estimated loss for single-symbol denoiser

Consider a single-letter case

x → Π→ Z → s(Z ) = x

I The true loss Λ(x , s(Z )) cannot be evaluated

I Instead, we can devise L = Π−1ρ ∈ R|Z|×|S|

I Π−1 ∈ R|Z|×|X|I ρ ∈ R|X |×|S| with ρ(u, s) = EuΛ(u, s(Z ))

I Then, L(Z , s) is the unbiased estimate of ExΛ(x , s(Z )), i.e.,

ExL(Z , s) =ExΛ(x , s(Z )) (1)

I First devised in [Weissman et al. 07]

22 / 38

A closer look at DUDE

The concrete DUDE rule with ci = (Z i−1i−k ,Z

i+ki+1 ):

Xi ,DUDE(Zn) = arg minx∈X

m[Zn, ci ]>Π−1[Λx �ΠZi

] (2)

I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show

sk,DUDE(c, ·) = arg mins∈S

∑i∈{i :ci=c}

L(Zi , s)

I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to

arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))

“DUDE learns s ∈ S for each c by minimizing the estimted loss!”

23 / 38

A closer look at DUDE

The concrete DUDE rule with ci = (Z i−1i−k ,Z

i+ki+1 ):

Xi ,DUDE(Zn) = arg minx∈X

m[Zn, ci ]>Π−1[Λx �ΠZi

] (2)

I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show

sk,DUDE(c, ·) = arg mins∈S

∑i∈{i :ci=c}

L(Zi , s)

I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to

arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))

“DUDE learns s ∈ S for each c by minimizing the estimted loss!”

23 / 38

A closer look at DUDE

The concrete DUDE rule with ci = (Z i−1i−k ,Z

i+ki+1 ):

Xi ,DUDE(Zn) = arg minx∈X

m[Zn, ci ]>Π−1[Λx �ΠZi

] (2)

I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show

sk,DUDE(c, ·) = arg mins∈S

∑i∈{i :ci=c}

L(Zi , s)

I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to

arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))

“DUDE learns s ∈ S for each c by minimizing the estimted loss!”

23 / 38

An alternative expression of DUDE

Following gives an alternative expression of

sk,DUDE(c, ·) = arg mins∈S

∑i∈{i :ci=c}

L(zi , s)

I Definep(c) = arg min

p∈∆|S|

( ∑i∈{i :ci=c}

1>zi L)

p,

which will be on one of the vertices in ∆|S| (∵ A simple LP)

I Then, sk,DUDE(c, ·) = arg maxs∈S ps(c)

24 / 38

Another way of obtaining DUDE

We can also obtain sk,DUDE(c, ·) as the following lemma

LemmaDefine Lnew , −L + Lmax11> where Lmax = maxz,s L(z , s). Denote

p∗(c) = arg minp∈∆|S|

∑i∈{i :ci=c}

C(L>new1zi ,p

),

in which C(g,p) , −∑|S|k=1 gk log pk for any g ∈ R|S|+ and

p ∈ ∆|S|.

Then, sk,DUDE(c, ·) = arg maxs∈S p∗s (c).

25 / 38

Proof of the lemma

Proof.Recall sk,DUDE(c, ·) = arg maxs∈S ps(c)

p(c) = arg minp∈∆|S|

( ∑i∈{i :ci=c}

1>zi L)

p

= arg maxp∈∆|S|

( ∑i∈{i :ci=c}

1>zi (−L + Lmax11>))

p

= arg maxp∈∆|S|

( ∑i∈{i :ci=c}

L>new1zi

)>p

p∗(c) = arg minp∈∆|S|

∑i∈{ci=c}

C(L>new1zi ,p

)= arg min

p∈∆|S|C( ∑

i∈{ci=c}

L>new1zi ,p)

p∗(c) puts the max. probability mass on the vertex of p(c)!26 / 38

Motivation for Neural DUDE

From the lemma, DUDE can be obtained by solving

p∗(c) = arg minp∈∆|S|

∑i∈{i :ci=c}

C(L>new1zi ,p

),

for each context c.

⇒ We may define p(w, ·) : Z2k → ∆|S| and solve

w∗ = arg minw

n∑i=1

C(

L>newIzi ,p(w, ci ))

to obtain a model to work for all c’s.

27 / 38

Neural DUDE algorithm

1. Define a DNN model

p(w, ·) : Z2k → ∆|S|

2. Given zn, obtain w∗ that minimizes

L(zn,w) ,1

n

n∑i=1

C(

L>newIzi ,p(w, ci ))

I Results in a network p(w∗, ·) that works for all cI Interpret ci as “input” and L>newIzi ∈ R|S| as a “pseudo-label”

3. Compute

sk,Neural DUDE(c, ·) = arg maxs∈S

ps(w∗, c)

4. Obtain Xi ,Neural DUDE(zn) = sk,Neural DUDE(ci , zi )

28 / 38

Property of Neural DUDE

I A sliding window denoiserI A single, parametric model

I Information from similar contexts are combined⇔ Non-parametric DUDE

I Robust to the choice of k (as shown in the experiements)

29 / 38

Outline

Introduction to discrete denoising

Deep neural networks

Neural DUDE

Experimental results

30 / 38

Experimental setup

For binary, BSC (with δ < 0.5) data

I |S| = 3; {always-0, always-1, say-what-you-see}

Neural DUDE models (with increasing layers)

I 1L: 2k − 3 (Softmax Regression)

I 2L: 2k − 20− 3

I 3L: 2k − 20− 20− 3

I 4L: 2k − 20− 20− 20− 3

I Mini-batch SGD, Adam,Momentum method used foroptimization

31 / 38

How good is Neural DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

FB Recursion

DUDE is sensitive to k

32 / 38

How good is Neural DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (1L)

FB Recursion

Neural DUDE 1 Layer

32 / 38

How good is Neural DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (2L)

FB Recursion

Neural DUDE 2 Layer

32 / 38

How good is Neural DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (3L)

FB Recursion

Neural DUDE 3 Layer

32 / 38

How good is Neural DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (4L)

FB Recursion

Neural DUDE 4 Layer - robust to k!

32 / 38

Concentration of the estimated loss

How close is 1n

∑ni=1 L(zi , sk(ci , ·)) to 1

n

∑ni=1 Λ(xi , sk(ci , zi ))?

0 2 4 6 8 10 12 14

Window size k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(Bit

Err

or

Rate

) / δ

BER

Estimated BER

FB Recursion

DUDE

0 2 4 6 8 10 12 14

Window size k

0.52

0.54

0.56

0.58

0.60

0.62

0.64

(Bit

Err

or

Rate

) / δ

BER

Estimated BER

FB Recursion

Neural DUDE(4L)

The estimated loss of Neural DUDE concentrates on the true loss

I Can pick k based on 1n

∑ni=1 L(zi , sk(ci , ·)) for Neural DUDE!

33 / 38

Binary image denoising

Setting

I Source:

or or or · · ·

I Noise: BSC(δ) with δ = 0.1

Comparison between DUDE and Neural DUDE?

I Raster scan of the data

I Use 1-D context

34 / 38

Binary image denoising - Einstein 256× 256

Einstein (clean)

35 / 38

Binary image denoising - Einstein 256× 256

Einstein (noisy, δ = 0.1)

35 / 38

Binary image denoising - Einstein 256× 256

DUDE (k = 5, BER=0.561δ)

35 / 38

Binary image denoising - Einstein 256× 256

Neural DUDE (k = 27, BER=0.417δ)

35 / 38

Binary image denoising - Einstein 256× 256

Neural DUDE significantly outpeforms DUDE! (relative 25.7%!)

0 5 10 15 20 25 30

Window size k

0.3

0.4

0.5

0.6

0.7

0.8

0.9(B

it E

rror

Rate

) / δ

0.417δ

0.561δ

DUDE

Neural DUDE (4L)

Note for k = 27, there are 254 different contexts!! (� n = 216)

35 / 38

Binary image denoising - Einstein 256× 256

Estimated loss of Neural DUDE still concentrates on the true loss!

0 5 10 15 20 25 30

Window size k

1.0

0.5

0.0

0.5

1.0

(Bit

Err

or

Rate

) / δ

DUDE BER

DUDE Estimated BER

DUDE

0 5 10 15 20 25 30

Window size k

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(Bit

Err

or

Rate

) / δ

Neural DUDE BER

Neural DUDE Estimated BER

Neural DUDE(4L)

35 / 38

Binary image denoising results

Relative error reduction of Neural DUDE over DUDE

0

5

10

15

20

25

30

Barbara Mandrill Lena Camera Man

Boat Peppers Einstein Halftone

Shannon Text

Relativ

e  Error  R

eductio

n  (%)

36 / 38

Future/On-going research directions

Analysis

I Analyzing the concentration of the estimated loss

I Connection to SURE (Stein’s Unbiased Risk Estimator)

Algorithmic extension

I Extending to larger alphabet data beyond binary (e.g., DNA)

I Extending to continuous-valued data (e.g., grayscale image)

I Applying Recurrent Neural Networks (RNN)

I Try other general function approximation methods (e.g.,random forest, etc.)

37 / 38

Thank you very much!

E-mail: [email protected]

38 / 38