37
DUDE : An Algorithm for D iscrete U niversal De noising Tsachy Weissman HP Labs and Stanford Erik Ordentlich HP Labs Gadiel Seroussi HP Labs Sergio Verd´ u Princeton Marcelo Weinberger HP Labs

DUDE: An Algorithm for Discrete Universal Denoising Algorithm for Discrete Universal Denoising ... Example Source: Binary Markov ... Wz right peace the rest iction on alksoable sequbole

Embed Size (px)

Citation preview

DUDE:

An Algorithm for D iscrete U niversal De noising

Tsachy Weissman HP Labs and Stanford

Erik Ordentlich HP Labs

Gadiel Seroussi HP Labs

Sergio Verdu Princeton

Marcelo Weinberger HP Labs

Discrete Denoising

X1, . . . ,Xn Z1, . . . ,Zn X1, . . . ,Xn

• Xi, Zi, Xi take values from finite alphabets.

• Goal: Choose X1, . . . ,Xn on the basis of Z1, . . . ,Zn to minimize afidelity criterion (which we might not be able to measure!).

1

Applications

• Text Correction

• Image Denoising

• Reception of Uncoded Data

• DNA Sequence Analysis and Processing

• Hidden Markov Model State Estimation

• · · ·

2

Example

Source: Binary Markov Chain Channel: BSC

. . . 0001111100001111100 . . . . . . 0001000001000001010 . . . ⇒ . . . 0000111101001110110 . . .

• Objective: Minimize Bit Error Rate given the observation of n-block.

• Solution: Backward-Forward Dynamic Programming

• Fundamental Limit: limn→∞ Min Error Probability = f(δ, p) still open.

3

Universal Example

• Source: Binary; unknown statistics

• Channel: BSC

• Objective: Minimize Bit Error Rate given the observation of n-block.

• Solution: ???

• Fundamental Limit: limn→∞ Min Error Probability = ???

4

Universal Setting: Basic Assumptions

• Unknown Source Statistics

• Discrete Memoryless Channel (DMC) with known transition probabilitymatrix

Prob(j received | i sent)

5

Previous Approaches to Universal Discrete Denoising

Occam filter: (Natarajan, 1993,1995)

• Lossy compression of the noisy signal, tuning the desired SNR to theexpected noise level of the channel.

• Experiments with specific lossy data compression algorithms

• Shortcoming: No implementable universal optimal lossy compressionis known.

6

Previous Approaches to Universal Discrete Denoising

Kolmogorov Sampler: (Donoho, 2002)

• For all typical noise realizations, list the corresponding sourcerealizations that explain the data. Then, do lossless compression ofthe source realizations and select the shortest one.

• Shortcoming: Not implementable.

• It does not attain the theoretically optimum distribution-dependentperformance. (It can be off by a factor of 2.)

• Example: Bernoulli( p ) source corrupted by a BSC( δ ). Trivial schemes“say what you see” (optimal for p ≥ δ ) and “say all zeros” (optimal forp ≤ δ ) each outperform the Kolmogorov sampler on more than half ofthe parameter space.

• Does this mean it is suboptimal in the universal setting? Is thedistribution-dependent performance attainable at all in this setting?

7

Notation

Alphabet: Same finite alphabet for clean and noisy signals: A, |A| = M .

Channel: Nonsingular transition probability matrix:

Π = {Π(i, j)}i,j∈A = [ππ1 | · · · | ππM ] both assumptionsabove can be relaxed

n-block denoiser: Xn : An → An .

Loss Function: Λ : A2 → [0,∞) , represented by the matrix

Λ = {Λ(i, j)}i,j∈A = [λλ1 | · · · | λλM ] .

cost ofguessing jwhen cleansignal is i

Normalized cumulative loss of the denoiser Xn when the observedsequence is zn ∈ An and the channel input block is xn ∈ An :

LXn(xn, zn) =1n

n∑i=1

Λ(xi, Xni (zn))

8

Performance Benchmark

Optimum distribution-dependent performance for a denoiser that knows theinput statistics:

limn→∞

minXn∈Dn

ELXn(Xn, Zn)

where Dn is the class of all (including non-universal) n -block denoisers

9

The DUDE Algorithm: General Idea

• Fix context length k. For each letter to be denoised, do:

• Find left k -context (`1, . . . , `k) and right k -context (r1, . . . , rk)

`1 `2 · · · `k • r1 r2 · · · rk

• Count all occurrences of letters with left k -context (`1, . . . , `k) and rightk -context (r1, . . . , rk) . This gives a conditional empirical distribution ofthe noisy symbol given the contexts (`1, . . . , `k) and (r1, . . . , rk) .

• Use channel transition probability to estimate the conditional empiricaldistribution of the noiseless symbol given (`1, . . . , `k) and (r1, . . . , rk) .

• Make decision using

the loss function,the channel transition probability,the conditional empirical distributionthe observed letter to be denoised.

10

Noiseless Text

We might place the restriction on allowable sequences that no spaces follow each other.· · · effect of statistical knowledge about the source in reducing the required capacity ofthe channel · · · the relative frequency of the digram i j. The letter frequencies p(i), thetransition probabilities · · · The resemblance to ordinary English text increases quitenoticeably at each of the above steps. · · · This theorem, and the assumptions required forits proof, are in no way necessary for the present theory. · · · The real justification of thesedefinitions, however, will reside in their implications. · · · H is then, for example, the H inBoltzmann’s famous H theorem. We shall call H = −

∑pi log pi the entropy of the set

of probabilities p1, . . . , pn. · · · The theorem says that for large N this will be independentof q and equal to H. · · · The next two theorems show that H and H ′ can be determinedby limiting operations directly from the statistics of the message sequences, withoutreference to the states and transition probabilities between states. · · · The FundamentalTheorem for a Noiseless Channel · · · The converse part of the theorem, that C

H cannot beexceeded, may be proved by noting that the entropy · · · The first part of the theorem willbe proved in two different ways. · · · Another method of performing this coding and therebyproving the theorem can be described as follows: · · · The content of Theorem 9 is that,although an exact match is · · · With a good code the logarithm of the reciprocal probabilityof a long message must be proportional to the duration of the corresponding signal · · ·

11

Noisy text

Wz right peace the rest iction on alksoable sequbole thgt wo spices fokiow eadh otxer. · · ·egfbct of sraaistfcal keowleuge apolt tje souwce in recucilg the requihed clpagity ofytheclabbel · · · the relatrte pweqiency ofpthe digram i j. The setter freqbwncles p(i), gherrahsibion probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes quitqnoliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe aszumptjona requiyed ffrits croof, arv il no wsy necqssrry forptfe prwwent theorz. · · · jhe reap juptifocation ofdhese defikjtmons, doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle, tleH in Bolgnmann’s falous H themreg. We vhall cbll H = −

∑pi log pi the wntgopz rf thb

set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs tyat fsr lawge N mhis gill wehndeypensdest of q aed vqunl tj H. · · · The neht txo theiremf scow tyat H and H ′ can bedegereined jy likitkng operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,bithout referenge ty the htates and trankituon krobabilitnes bejwekn ltates. · · · TheFundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe ronvegse jaht jf tketheorem, thltCH calnot be excweded, may ke xroved ey hotijg tyat the enyropy · · · The first pajt if thetheqrem will be ptoved in two kifferent wjys. · · · Another methjd of plrfolming shis godingald thmreby proking toe oheorem can bexdescrined as folfows: · · · The contemt ovThe rem 9 if thst, ajthorgh an ezacr mawwh is · · · Wotf a goul code therlogaretym of therehitrocpl prossbilfly of a lylg mwgsage lust be priporyiopal to tha rurafirn of · · ·

12

Noisy text: Denoising m

Wz right peace the rest iction on alksoable sequbole thgt wo spices fokiow eadh otxer. · · ·egfbct of sraaistfcal keowleuge apolt tje souwce in recucilg the requihed clpagity ofytheclabbel · · · the relatrte pweqiency ofpthe digram i j. The setter freqbwncles p(i), gherrahsibion probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes quitqnoliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe aszumptjona requiyed ffrits croof, arv il no wsy necqssrry forptfe prwwent theorz. · · · jhe reap juptifocation ofdhese defikjtmons, doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle, tleH in Bolgnmann’s falous H themreg. We vhall cbll H = −

∑pi log pi the wntgopz rf thb

set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs tyat fsr lawge N mhis gill wehndeypensdest of q aed vqunl tj H. · · · The neht txo theiremf scow tyat H and H ′ can bedegereined jy likitkng operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,bithout referenge ty the htates and trankituon krobabilitnes bejwekn ltates. · · · TheFundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe ronvegse jaht jf tketheorem, thltCH calnot be excweded, may ke xroved ey hotijg tyat the enyropy · · · The first pajt if thetheqrem will be ptoved in two kifferent wjys. · · · Another methjd of plrfolming shis godingald thmreby proking toe oheorem can bexdescrined as folfows: · · · The contemt ovThe rem 9 if thst, ajthorgh an ezacr mawwh is · · · Wotf a goul code therlogaretym of therehitrocpl prossbilfly of a lylg mwgsage lust be priporyiopal to tha rurafirn of · · ·

13

Context search k = 1 e • r

Wz right peace the rest iction on alksoable sequbole thgt wo spices fokiow eadh otxer. · · ·egfbct of sraaistfcal keowleuge apolt tje souwce in recucilg the requihed clpagity ofytheclabbel · · · the relatrte pweqiency ofpthe digram i j. The setter freqbwncles p(i), gherrahsibion probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes quitqnoliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe aszumptjona requiyed ffrits croof, arv il no wsy necqssrry forptfe prwwent theorz. · · · jhe reap juptifocation ofdhese defikjtmons, doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle, tleH in Bolgnmann’s falous H themreg. We vhall cbll H = −

∑pi log pi the wntgopz rf thb

set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs tyat fsr lawge N mhis gill wehndeypensdest of q aed vqunl tj H. · · · The neht txo theiremf scow tyat H and H ′ can bedegereined jy likitkng operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,bithout referenge ty the htates and trankituon krobabilitnes bejwekn ltates. · · · TheFundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe ronvegse jaht jf tketheorem, thltCH calnot be excweded, may ke xroved ey hotijg tyat the enyropy · · · The first pajt if thetheqrem will be ptoved in two kifferent wjys. · · · Another methjd of plrfolming shis godingald thmreby proking toe oheorem can bexdescrined as folfows: · · · The contemt ovThe rem 9 if thst, ajthorgh an ezacr mawwh is · · · Wotf a goul code therlogaretym ofthe rehitrocpl prossbilfly of a lylg mwgsage lust be priporyiopal to tha rurafirn of · · ·

14

Context search k = 1 e • r counts

• e r : 8

• eor : 6

• eir : 2

• emr : 1

• eqr : 1

15

Context search k = 2 h e • r e

Wz right peace the rest iction on alksoable sequbole thgt wo spices fokiow eadh otxer. · · ·egfbct of sraaistfcal keowleuge apolt tje souwce in recucilg the requihed clpagity ofytheclabbel · · · the relatrte pweqiency ofpthe digram i j. The setter freqbwncles p(i),ghe rrahsibion probtbilities · · · The resemglahca to ordwnard Engdish tzxt ircreakes quitqnoliceabcy at vach oftthe hbove steps. · · · Thus theorev, andlthe aszumptjona requiyed ffrits croof, arv il no wsy necqssrry forptfe prwwent theorz. · · · jhe reap juptifocation ofdhese defikjtmons, doweyer, bill rehide inytheir imjlycajijes. · · · H is them, fol eskmqle, tleH in Bolgnmann’s falous H themreg. We vhall cbll H = −

∑pi log pi the wntgopz rf thb

set jf prwbabjlities p1, . . . , pn. · · · The theorem sahs tyat fsr lawge N mhis gill wehndeypensdest of q aed vqunl tj H. · · · The neht txo theiremf scow tyat H and H ′ can bedegereined jy likitkng operatiofs digectlt fgom the stgtissics of thk mfssagj siqufnves,bithout referenge ty the htates and trankituon krobabilitnes bejwekn ltates. · · · TheFundkmendal Theorem kor a Soiselesd Chjnnen · · · Lhe ronvegse jaht jf tketheorem, thltCH calnot be excweded, may ke xroved ey hotijg tyat the enyropy · · · The first pajt if thetheqrem will be ptoved in two kifferent wjys. · · · Another methjd of plrfolming shis godingald thmreby proking toe oheorem can bexdescrined as folfows: · · · The contemt ovThe rem 9 if thst, ajthorgh an ezacr mawwh is · · · Wotf a goul code therlogaretym ofthe rehitrocpl prossbilfly of a lylg mwgsage lust be priporyiopal to tha rurafirn of · · ·

16

Context search k = 2 h e • r e counts

• he re : 7

• heore : 5

• heire : 1

• hemre : 1

• heqre : 1

17

Notation for Context Counts

• a ∈ An, whole data sequence

• b ∈ Ak, left k-context string

• c ∈ Ak, right k-context string

• Let m[a,b, c] denote the M -vector with α -th component equal to thenumber of occurrences of the pattern b α c in a :

m[a,b, c](α) =∣∣{k + 1 ≤ i ≤ n− k : ai−1

i−k = b, ai = α, ai+ki+1 = c

}∣∣ .

• Example:

m[Shannon text , he, re] = [0 0 0 0 0 0 0 0 1 0 0 0 1 0 5 0 1 0 0 0 0 0 0 0 0 0 7]T

↑ ↑ ↑ ↑ ↑

i m o q sp

18

The Discrete Universal Denoiser

Fix k

Pass 1 For every k + 1 ≤ i ≤ n− k :compute the count vectors m(zn

1 , zi−1i−k, z

i+ki+1)

Pass 2 Correct according to:

Xn,ki (zn) = gk

zn(zi−1i−k, zi, z

i+ki+1) k + 1 ≤ i ≤ n− k.

wheregka(b, α, c) = arg min

x∈AmT [a,b, c]Π−1 (λλx � ππα).

Π = {Π(i, j)}i,j∈A = [ππ1 | · · · | ππM ]

Λ = {Λ(i, j)}i,j∈A = [λλ1 | · · · | λλM ] .

(v �w)i = viwi

19

Motivation

• Given a distribution on an unknown random variable Y , the minimumexpected distortion guess is

Y = arg minβ∈A

λλTβPY

arg minx∈A

mT [a,b, c]Π−1 (λλx � ππα) = arg minx∈A

λλTx

{ππα �Π−Tm[a,b, c]

}

• Key: {ππzi

�Π−Tm[zn, zi−1i−k, z

i+ki+1 ]

}= unnormalized empirical estimate of PXi|Zn=zn

20

Motivation

{ππzi

�Π−Tm[zn, zi−1i−k, z

i+ki+1 ]

}unnormalized empirical estimate of PXi|Zn=zn

If n = 1 , no context:

PZ = ΠTPX

PX|z(a) =P (Z = z|X = a)PX(a)

PZ(z)=

Π(a, z)[Π−TPZ](a)PZ(z)

,

PX|z =1

PZ(z)ππz � [Π−TPZ].

21

Motivation

Because the channel is memoryless:

PXt|Z(T )=z(T ) =1

PZt|z(T\t)(zt)ππzt �

[Π−TPZt|z(T\t)

].

22

Example: BSC + Error Probability

For each bit b , count how many bits that have the same left and rightk -contexts are equal to b and how many are equal to b . If the ratio ofthese counts is below

2δ(1− δ)(1− δ)2 + δ2

then b is deemed to be an error introduced by the BSC.

23

Example: M-ary erasure channel + Error Probability

gkzn(u

−1−keu

k1) = arg min

x∈{1,...,M}

δ

1− δ

∑a∈{1,...,M}:a6=x

m[zn, u

−1−k, u

k1](a)

= arg minx∈{1,...,M}

∑a∈{1,...,M}

m[zn, u

−1−k, u

k1](a)

−m[zn, u

−1−k, u

k1](x)

= arg max

x∈{1,...,M}m[z

n, u

−1−k, u

k1](x).

Correct every erasure with the most frequent symbol for its context.

24

Choosing the Context Length k

• Tradeoff:

too short 7→ suboptimum performance;too long (⇔ too short n ) 7→ counts are unreliable

• k = kn s.t. knM2kn = o(n/ log n) guarantees asymptotic optimality (e.g.,kn = dc logM ne , c < 1

2 ).

• “Best k ” for givensequence: open problem

Output compressibilityheuristicDynamic, asymmetriccontext lengths

25

Computational Complexity

Time: O(n) register level operations

Space: o(n) working storage (linear if storage for buffering sequence iscounted)

• Preprocessing: O(M3) operations

• Computation of counts: O(n) operations (finite state automaton withM2k states)

• Pre-computations for the second pass: O(M2k) operations

• Denoising: O(n) operations

With k < 12 logM n , we have M2k = o(n) .

26

Optimality Result: Stochastic Setting

Define

Xnuniv = Xn,kn, with kn →∞ as knM2kn = o(n/ log n)

Theorem. (No asymptotic penalty for universality)

For every stationary ergodic input process,

limn→∞

ELXnuniv

(Xn, Zn) = limn→∞

minXn∈Dn

ELXn(Xn, Zn)

where Dn is the class of all (including non-universal) n -block denoisers

27

Optimality Result: Semi-Stochastic Setting

Xnuniv = Xn,kn, with kn →∞ as knM2kn = o(n/ log n)

Minimum k -sliding-window loss of (xn, zn) :

Dk(xn, zn) = minf :A2k+1→A

1n− 2k

n−k∑i=k+1

Λ(xi, f(zi+ki−k))

Theorem. For any input sequence, a.s.

lim supn→∞

[LXn

univ(xn, Zn)−Dkn(x

n, Zn)]≤ 0

28

Proof Sketch

• PXi|Zi+k

i−k=zi+k

i−kis not available because the source statistics are unknown.

• A genie that sees both xn and zn can give an empirical (unnormalized)estimate q[Zn, xn, uk

−k] = PXi|Zi+k

i−k=uk

−k

• Even without the genie, DUDE can get very close to that estimate:∥∥∥∥q[Zn, xn, uk−k]︸ ︷︷ ︸− ππu0 �

[Π−Tm[Zn, u−1

−k, uk1]

]︸ ︷︷ ︸∥∥∥∥

1

genie DUDEcan be bounded tightly enough that:

• Pr(L

Xn,k(xn−kk+1 , Zn)−Dk(x

n, Zn) > ε)≤ K1(k+1)M2k+1 exp

(− (n−2k)ε2

4(k+1)M2kFΠC2Λ,Π

)

• RHS summable for knM2kn = o(n/ log n) ⇒ a.s. result by BC Lemma

29

Nonsquare/Nonsingular Π

• If Π is full-row rank, then replace Π−1 by a pseudo-inverse, e.g. theMoore-Penrose pseudo-inverse ΠT

(ΠΠT

)−1. Same optimality results.

• If Π is not full-row rank, then, in general, optimum distribution-dependentperformance cannot be obtained by a universal algorithm.

30

Experiment: Binary Markov Chain ( p) → BSC (δ); n = 106

δ = 0.01 δ = 0.10 δ = 0.20

p DUDE ForwBack DUDE ForwBack DUDE ForwBack0.01 0.000723 [3] 0.000721 0.006648 [5] 0.005746 0.025301 [6] 0.0164470.05 0.004223 [3] 0.004203 0.030084 [5] 0.029725 0.074936 [5] 0.0715110.10 0.010213 [8] 0.010020 0.055976 [3] 0.055741 0.120420 [4] 0.1186610.15 0.010169 [8] 0.010050 0.075474 [5] 0.075234 0.153182 [4] 0.1529030.20 0.009994 [8] 0.009940 0.092304 [3] 0.092304 0.176354 [4] 0.176135

Table 1: Bit Error Rates; context length [k] chosen according to LZ heuristic

31

Image Denoising: δ=0.05, k=12 (2D)

32

Image Denoising: δ=0.02, k=14 (1D)

33

Image Denoising: Comparison with existing algorithms

Channel parameter δ

Image Scheme 0.01 0.02 0.05 0.10Shannon DUDE 0.00096 0.0018 0.0041 0.00911800×2160 K=11 K=12 K=12 K=12

median 0.00483 0.0057 0.0082 0.0141morpho. 0.00270 0.0039 0.0081 0.0161

Einstein DUDE 0.0035 0.0075 0.0181 0.0391896×1160 K=18 K=14† K=12† K=12†

median 0.156 0.158 0.164 0.180morpho. 0.149 0.151 0.163 0.193

Table 2: Denoising results for binary images

34

Text Denoising: Don Quixote de La Mancha

Noisy Text (21 errors, 5% error rate):

”Whar giants?” said Sancho Panza. ”Those thou seest theee,” snswered yis master, ”withthe long arms, and spne have tgem ndarly two leagues long.” ”Look, ylur worship,” sairSancho; ”what we see there zre not gianrs but windmills, and what seem to be their armsare the sails that turned by the wind make rhe millstpne go.” ”Kt is easy to see,” repliedDon Quixote, ”that thou art not used to this business of adventures; fhose are giantz; andif thou arf wfraod, away with thee out of this and betake thysepf to prayer while I engagethem in fierce and unequal combat.”

DUDE output, k = 2 (7 errors):

”What giants?” said Sancho Panza. ”Those thou seest there,” answered his master, ”withthe long arms, and spne have them nearly two leagues long.” ”Look, your worship,” saidSancho; ”what we see there are not giants but windmills, and what seem to be their armsare the sails that turned by the wind make the millstone go.” ”It is easy to see,” replied DonQuixote, ”that thou art not used to this business of adventures; fhose are giantz; and ifthou arf wfraod, away with thee out of this and betake thyself to prayer while I engagethem in fierce and unequal combat.”

35

Discrete Denoising

“If the source already has a certain redundancy and no attempt is made toeliminate it... a sizable fraction of the letters can be received incorrectlyand still reconstructed by the context.” (Claude Shannon, 1948)

36