Neural Universal Discrete Denoiser - web.stanford.edu is then, for example, the H in Boltzmann’s ... Wz right peace the rest iction on alksoable sequbole thgt wo spices ... Binary

Neural Universal Discrete Denoiser

Taesup Moon

Daegu Gyeongbuk Institute of Science and Technology (DGIST)Daegu, Korea

April 22, 2016Information Theory Forum, Stanford University

1 / 38

Outline

Introduction to discrete denoising

Deep neural networks

Neural DUDE

Experimental results

2 / 38

Discrete denoising - an estimation problem

I Xi , Zi , Xi take values in finite alphabets (e.g., binary, DNA)

I Goal: Choose X n as close as possible to X n, based on Zn

3 / 38

E.g., Correcting noisy bit stream

X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111

bit error rate (BER) =(number of errors)

n=

3

47= 0.064

I X n : clean bit stream

I Zn : noisy bit stream (suppose bits are flipped w.p. δ = 0.1)

I X n : estimate of X n based on observing Zn

“given Zn, how would you choose X n to minimize BER?”

4 / 38


X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111


n=

3

47= 0.064





4 / 38


X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111


n=

3

47= 0.064





4 / 38


X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111


n=

3

47= 0.064





4 / 38


X n : 00000011111110000000000111111111100000001111111

Zn : 00100011101110010001000111110111100000011110111

X n : 00000011101110000000000011111111100000011111111


n=

3

47= 0.064





4 / 38

Life is easy if the source and noise are known

[Bayesian setting], e.g.,

I Source : BSMC(p) - binary symmetric Markov chain with p

1 − p

0 1

p

p

1 − p

I Noise : BSC(δ) - flips bits with probability δ

δ

0

1 1

01 − δ

1 − δ

δ

⇒ Zn is HMP → Forward-Backward Recursion is optimal

5 / 38

In real life, the source is often not known

Correcting noisy bit stream

I Source : 0000011110000101011100000 . . . (?)

I Noise : BSC(δ)

δ

0

1 1

01 − δ

1 − δ

δ

⇒ ?

6 / 38

In real life, the source is often not known

Image denoising

I Source :

or or or

I Noise : BSC(δ)

δ

0

1 1

01 − δ

1 − δ

δ

⇒ ?

6 / 38

Can we estimate an unknown source?

[Universal setting]

I Unknown source

I Known memoryless channel Π (e.g., BSC)

Π(x , z) = Prob(Z = z | X = x)

I Assume Π has a right inverse (a mild assumption)

I Loss measured by Λ(x , x) (e.g., Hamming loss)

LX n(xn, zn) =1

n

n∑i=1

Λ(xi , Xi (zn))

Can we still denoise as well as if we knew the source?

7 / 38

DUDE (Discrete Universal DEnoiser)’s attempt

A two-pass algorithm1

I Fix window (context) size kI First pass: For each zi

1. Identify the double-sided context ci = (z i−1i−k , z

i+ki+1 ) = (`k , rk)

· · · `1 `2 · · · `k zi r1 r2 · · · rk · · ·

2. Update the count vector m[zn, ci ] ∈ R|Z|

m[zn, ci ][zi ]← m[zn, ci ][zi ] + 1

I Second pass: Denoise zi with

Xi (zn) = Simple rule(Π,Λ,m[zn, ci ], zi )

1[Weissman et.al 05]8 / 38

Noiseless text

We might place the restriction on allowable sequences that no spaces

follow each other. · · · effect of statistical knowledge about the source

in reducing the required capacity of the channel · · · the relative

frequency of the digram i j. The letter frequencies p(i), the transition

probabilities · · · The resemblance to ordinary English text increases

quite noticeably at each of the above steps. · · · This theorem, and the

assumptions required for its proof, are in no way necessary for the

present theory. · · · The real justification of these definitions,

however, will reside in their implications. · · · H is then, for example,

the H in Boltzmann’s famous H theorem. We shall call H = −∑

pi log pithe entropy of the set of probabilities p1, . . . , pn. · · · The theorem says

that for large N this will be independent of q and equal to H. · · · The

next two theorems show that H and H′ can be determined by limiting

operations directly from the statistics of the message sequences,

without reference to the states and transition probabilities between

states. · · · The Fundamental Theorem for a Noiseless Channel · · · The

converse part of the theorem, that CH

cannot be exceeded, may be proved

by · · ·

9 / 38

DUDE observes noisy text

Wz right peace the rest iction on alksoable sequbole thgt wo spices

fokiow eadh otxer. · · · egfbct of sraaistfcal keowleuge apolt tje souwce

in recucilg the requihed clpagity ofythe clabbel · · · the relatrte

pweqiency ofpthe digram i j. The setter freqbwncles p(i), ghe rrahsibion

probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes

quitq noliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe

aszumptjona requiyed ffr its croof, arv il no wsy necqssrry forptfe

prwwent theorz. · · · jhe reap juptifocation of dhese defikjtmons,

doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle,

tle H in Bolgnmann’s falous H themreg. We vhall cbll H = −∑

pi log pithe wntgopz rf thb set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs

tyat fsr lawge N mhis gill we hndeypensdest of q aed vqunl tj H. · · ·The neht txo theiremf scow tyat H and H′ can be degereined jy likitkng

operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,

bithout referenge ty the htates and trankituon krobabilitnes bejwekn

ltates. · · · The Fundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe

ronvegse jaht jf tketheorem, thlt CH

calnot be excweded, may ke xroved

ey · · ·

9 / 38

To denoise m at i


















ey · · ·

9 / 38

DUDE with window size 2 searches for h e • r e


















ey · · ·

9 / 38

DUDE with window size 2 counts h e • r e

I The count vector

m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T

↑ ↑ ↑ ↑i m o sp

I The reconstruction of m at i

Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)

I Wherever DUDE sees hemre, same decision for m

→DUDE is a sliding window denoiser

9 / 38

DUDE with window size 2 counts h e • r e

I The count vector

m[Noisy text, (he, re)] =[0 0 0 0 0 0 0 0 1 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5]T

↑ ↑ ↑ ↑i m o sp

I The reconstruction of m at i

Xi = Simple rule(Π,Λ,m[Noisy text, (he, re)], m)

I Wherever DUDE sees hemre, same decision for m

→DUDE is a sliding window denoiser

9 / 38

How good is DUDE? - Asymptotics

Denote Sk as the class of the k-th order sliding window denoisers

I sk ∈ Sk is a mapping Z2k+1 → XI Xi (z

n) = sk(z i+ki−k )

DUDE is as good as the best sk ∈ Sk for any source!

Theorem (Weissman et al. 05)

For all x∞ ∈ X∞, if k|Z|2k = o(

nlog n

),

limn→∞

(LXDUDE

(xn,Zn)− Dk(xn,Zn)︸︷︷︸Best loss in Sk

)= 0 w .p.1

10 / 38

How good is DUDE? - Finite n

Binary example with n = 106

I Source : BSMC(p) with p = 0.1

I Noise : BSC(δ) with δ = 0.1

0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

FB Recursion

DUDE gets close to the optimum (FB recursion) for some k

11 / 38

Limitations of DUDE in practice

The k |Z|2k = o(

nlog n

)condition

I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)

I No concrete way of choosing the right k for given n and |Z|

Can we overcome above limitations?

12 / 38

Limitations of DUDE in practice

The k |Z|2k = o(

nlog n

)condition

I n needs to grow exponentially with the context size kI Counts for similar contexts are not sharedI For larger |Z|, things get worse (e.g., text, grayscale image)

I No concrete way of choosing the right k for given n and |Z|

Can we overcome above limitations?

12 / 38

Outline



Neural DUDE


13 / 38

Deep neural networks (DNN) are eating the world!

E.g., Order of magnitude improvement in image classification

I ImageNet Top-5 error rates

14 / 38

Deep neural networks (DNN) are eating the world!

E.g., AlphaGo beats Lee Sedol by 4-1

I ConvNet based policy/value networks for tree search

15 / 38

Supervised Softmax Regression

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}

I Model:

ok =w>k x

pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)

I Given D = {(xi , yi )}mi=1, minimize

L(D,w) = −m∑i=1

K∑k=1

tik log pk(w, xi )

where tik = 1{yi = k}

16 / 38

Supervised Deep Neural Network (DNN)

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}I Model:

ok =w>k hL

hk` =max(0,w>k`h`−1), ` = 1, . . . , L

pk(w, x) =P(y = k|x; w) =exp(ok)∑Kj=1 exp(oj)

I Given D = {(xi , yi )}mi=1, minimize

L(D,w) = −m∑i=1

K∑k=1

tik log pk(w, xi )

where tik = 1{yi = k}

17 / 38

Supervised Deep Neural Network (DNN)

Multi-class classification x ∈ Rd , y ∈ {1, . . . ,K}

I After training, we obtain a mapping

p(w∗, ·) : Rd → ∆K

I For a new data x, DNN predicts with

y = arg maxk

pk(w∗, x)

18 / 38

Outline



Neural DUDE


19 / 38

DNN for discrete denoising?

Recall sk : Z2k+1 → X .

Question: Can we learn a sliding window denoiser with DNN?

I Namely, can we train a DNN to obtain a mapping

p(w∗, ·) : Z2k+1 → ∆|X |

as in multi-class classification?

I A parametric model could overcome the limitations of DUDE

Not straightforward!

I Training DNN typically requires supervised training data

I Ground-truth label is the clean source, which is not available!!

An alternative view on discrete denoising is necessary

20 / 38

DNN for discrete denoising?

Recall sk : Z2k+1 → X .

Question: Can we learn a sliding window denoiser with DNN?

I Namely, can we train a DNN to obtain a mapping

p(w∗, ·) : Z2k+1 → ∆|X |

as in multi-class classification?

I A parametric model could overcome the limitations of DUDE

Not straightforward!

I Training DNN typically requires supervised training data

I Ground-truth label is the clean source, which is not available!!

An alternative view on discrete denoising is necessary

20 / 38

Two equivalent views of a sliding window denoiser

Recall ci = (z i−1i−k , z

i+ki+1 ). Then, z i+k

i−k = (ci , zi ).

Xi (zn) = sk(z i+k

i−k )

I sk : Z2k+1 → X

Xi (zn) = sk(ci , zi )

I Let S = {s : Z → X}I Then, sk(ci , ·) ∈ SI Note |S| = |X ||Z|

Maybe, we can learn a mapping p(w∗, ·) : Z2k → ∆|S| with DNN?

21 / 38

Unbiased estimated loss for single-symbol denoiser

Consider a single-letter case

x → Π→ Z → s(Z ) = x

I The true loss Λ(x , s(Z )) cannot be evaluated

I Instead, we can devise L = Π−1ρ ∈ R|Z|×|S|

I Π−1 ∈ R|Z|×|X|I ρ ∈ R|X |×|S| with ρ(u, s) = EuΛ(u, s(Z ))

I Then, L(Z , s) is the unbiased estimate of ExΛ(x , s(Z )), i.e.,

ExL(Z , s) =ExΛ(x , s(Z )) (1)

I First devised in [Weissman et al. 07]

22 / 38

A closer look at DUDE

The concrete DUDE rule with ci = (Z i−1i−k ,Z

i+ki+1 ):

Xi ,DUDE(Zn) = arg minx∈X

m[Zn, ci ]>Π−1[Λx �ΠZi

] (2)

I Xi ,DUDE(Zn) = sk,DUDE(ci ,Zi ), and we can show

sk,DUDE(c, ·) = arg mins∈S

∑i∈{i :ci=c}

L(Zi , s)

I If |{i : ci = c}| is large enough, then sk,DUDE(c, ·) gets close to

arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))

“DUDE learns s ∈ S for each c by minimizing the estimted loss!”

23 / 38



i+ki+1 ):



] (2)



∑i∈{i :ci=c}

L(Zi , s)


arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))


23 / 38



i+ki+1 ):



] (2)



∑i∈{i :ci=c}

L(Zi , s)


arg mins∈S

∑i∈{i :ci=c}

Λ(x , s(Zi ))


23 / 38

An alternative expression of DUDE

Following gives an alternative expression of


∑i∈{i :ci=c}

L(zi , s)

I Definep(c) = arg min

p∈∆|S|

( ∑i∈{i :ci=c}

1>zi L)

p,

which will be on one of the vertices in ∆|S| (∵ A simple LP)

I Then, sk,DUDE(c, ·) = arg maxs∈S ps(c)

24 / 38

Another way of obtaining DUDE

We can also obtain sk,DUDE(c, ·) as the following lemma

LemmaDefine Lnew , −L + Lmax11> where Lmax = maxz,s L(z , s). Denote

p∗(c) = arg minp∈∆|S|

∑i∈{i :ci=c}

C(L>new1zi ,p

),

in which C(g,p) , −∑|S|k=1 gk log pk for any g ∈ R|S|+ and

p ∈ ∆|S|.

Then, sk,DUDE(c, ·) = arg maxs∈S p∗s (c).

25 / 38

Proof of the lemma

Proof.Recall sk,DUDE(c, ·) = arg maxs∈S ps(c)

p(c) = arg minp∈∆|S|

( ∑i∈{i :ci=c}

1>zi L)

p

= arg maxp∈∆|S|

( ∑i∈{i :ci=c}

1>zi (−L + Lmax11>))

p

= arg maxp∈∆|S|

( ∑i∈{i :ci=c}

L>new1zi

)>p


∑i∈{ci=c}

C(L>new1zi ,p

)= arg min

p∈∆|S|C( ∑

i∈{ci=c}

L>new1zi ,p)

p∗(c) puts the max. probability mass on the vertex of p(c)!26 / 38

Motivation for Neural DUDE

From the lemma, DUDE can be obtained by solving


∑i∈{i :ci=c}

C(L>new1zi ,p

),

for each context c.

⇒ We may define p(w, ·) : Z2k → ∆|S| and solve

w∗ = arg minw

n∑i=1

C(

L>newIzi ,p(w, ci ))

to obtain a model to work for all c’s.

27 / 38

Neural DUDE algorithm

1. Define a DNN model

p(w, ·) : Z2k → ∆|S|

2. Given zn, obtain w∗ that minimizes

L(zn,w) ,1

n

n∑i=1

C(

L>newIzi ,p(w, ci ))

I Results in a network p(w∗, ·) that works for all cI Interpret ci as “input” and L>newIzi ∈ R|S| as a “pseudo-label”

3. Compute

sk,Neural DUDE(c, ·) = arg maxs∈S

ps(w∗, c)

4. Obtain Xi ,Neural DUDE(zn) = sk,Neural DUDE(ci , zi )

28 / 38

Property of Neural DUDE

I A sliding window denoiserI A single, parametric model

I Information from similar contexts are combined⇔ Non-parametric DUDE

I Robust to the choice of k (as shown in the experiements)

29 / 38

Outline



Neural DUDE


30 / 38

Experimental setup

For binary, BSC (with δ < 0.5) data

I |S| = 3; {always-0, always-1, say-what-you-see}

Neural DUDE models (with increasing layers)

I 1L: 2k − 3 (Softmax Regression)

I 2L: 2k − 20− 3

I 3L: 2k − 20− 20− 3

I 4L: 2k − 20− 20− 20− 3

I Mini-batch SGD, Adam,Momentum method used foroptimization

31 / 38

How good is Neural DUDE? - Finite n




0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

FB Recursion

DUDE is sensitive to k

32 / 38





0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (1L)

FB Recursion

Neural DUDE 1 Layer

32 / 38





0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (2L)

FB Recursion

Neural DUDE 2 Layer

32 / 38





0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (3L)

FB Recursion

Neural DUDE 3 Layer

32 / 38





0 2 4 6 8 10 12 14

Window size k

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

(Bit

Err

or

Rate

) / δ

0.563δ

0.558δ

DUDE

Neural DUDE (4L)

FB Recursion

Neural DUDE 4 Layer - robust to k!

32 / 38

Concentration of the estimated loss

How close is 1n

∑ni=1 L(zi , sk(ci , ·)) to 1

n

∑ni=1 Λ(xi , sk(ci , zi ))?

0 2 4 6 8 10 12 14

Window size k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(Bit

Err

or

Rate

) / δ

BER

Estimated BER

FB Recursion

DUDE

0 2 4 6 8 10 12 14

Window size k

0.52

0.54

0.56

0.58

0.60

0.62

0.64

(Bit

Err

or

Rate

) / δ

BER

Estimated BER

FB Recursion

Neural DUDE(4L)

The estimated loss of Neural DUDE concentrates on the true loss

I Can pick k based on 1n

∑ni=1 L(zi , sk(ci , ·)) for Neural DUDE!

33 / 38

Binary image denoising

Setting

I Source:

or or or · · ·

I Noise: BSC(δ) with δ = 0.1

Comparison between DUDE and Neural DUDE?

I Raster scan of the data

I Use 1-D context

34 / 38

Binary image denoising - Einstein 256× 256

Einstein (clean)

35 / 38


Einstein (noisy, δ = 0.1)

35 / 38


DUDE (k = 5, BER=0.561δ)

35 / 38


Neural DUDE (k = 27, BER=0.417δ)

35 / 38


Neural DUDE significantly outpeforms DUDE! (relative 25.7%!)

0 5 10 15 20 25 30

Window size k

0.3

0.4

0.5

0.6

0.7

0.8

0.9(B

it E

rror

Rate

) / δ

0.417δ

0.561δ

DUDE

Neural DUDE (4L)

Note for k = 27, there are 254 different contexts!! (� n = 216)

35 / 38


Estimated loss of Neural DUDE still concentrates on the true loss!

0 5 10 15 20 25 30

Window size k

1.0

0.5

0.0

0.5

1.0

(Bit

Err

or

Rate

) / δ

DUDE BER

DUDE Estimated BER

DUDE

0 5 10 15 20 25 30

Window size k

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(Bit

Err

or

Rate

) / δ

Neural DUDE BER

Neural DUDE Estimated BER

Neural DUDE(4L)

35 / 38

Binary image denoising results

Relative error reduction of Neural DUDE over DUDE

0

5

10

15

20

25

30

Barbara Mandrill Lena Camera Man

Boat Peppers Einstein Halftone

Shannon Text

Relativ

e Error R

eductio

n (%)

36 / 38

Future/On-going research directions

Analysis

I Analyzing the concentration of the estimated loss

I Connection to SURE (Stein’s Unbiased Risk Estimator)

Algorithmic extension

I Extending to larger alphabet data beyond binary (e.g., DNA)

I Extending to continuous-valued data (e.g., grayscale image)

I Applying Recurrent Neural Networks (RNN)

I Try other general function approximation methods (e.g.,random forest, etc.)

37 / 38

Thank you very much!

E-mail: [email protected]

38 / 38

Documents

Neural Universal Discrete Denoiser - web.stanford.edu is then, for example, the H in Boltzmann’s ... Wz right peace the rest iction on alksoable sequbole thgt wo spices ... Binary