37
1 A Tour of Modern Image Processing Peyman Milanfar Abstract Recent developments in computational imaging and restoration have heralded the arrival and con- vergence of several powerful methods for adaptive processing of multidimensional data. Examples include Moving Least Square (from Graphics), the Bilateral Filter and Anisotropic Diffusion (from Machine Vision), Boosting and Spectral Methods (from Machine Learning), Non-local Means (from Signal Processing), Bregman Iterations (from Applied Math), Kernel Regression and Iterative Scaling (from Statistics). While these approaches found their inspirations in diverse fields of nascence, they are deeply connected. In this paper I present a practical and unified framework to understand some of the basic underpinnings of these methods, with the intention of leading the reader to a broad understanding of how they interrelate. I also illustrate connections between these techniques and Bayesian approaches. The proposed framework is used to arrive at new insights, methods, and both practical and theoretical results. In particular, several novel optimality properties of algorithms in wide use such as BM3D, and methods for their iterative improvement (or non-existence thereof) are discussed. Several theoretical results are discussed which will enable the performance analysis and subsequent improvement of any existing restoration algorithm. While much of the material discussed is applicable to wider class of linear degradation models (e.g. noise, blur, etc.,) in order to keep matters focused, we consider the problem of denoising here. * Electrical Engineering Department, University of California, Santa Cruz CA, 95064 USA. e-mail: [email protected], Phone:(650) 322-2311, Fax: (410) 322-2312 This work was supported in part by the US Air Force Grant FA9550-07-1-0365 and the National Science Foundation Grant CCF-1016018 February 18, 2011 DRAFT

1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

1

A Tour of Modern Image Processing

Peyman Milanfar∗

Abstract

Recent developments in computational imaging and restoration have heralded the arrival and con-

vergence of several powerful methods for adaptive processing of multidimensional data. Examples

include Moving Least Square (from Graphics), the Bilateral Filter and Anisotropic Diffusion (from

Machine Vision), Boosting and Spectral Methods (from Machine Learning), Non-local Means (from

Signal Processing), Bregman Iterations (from Applied Math), Kernel Regression and Iterative Scaling

(from Statistics). While these approaches found their inspirations in diverse fields of nascence, they are

deeply connected. In this paper

• I present a practical and unified framework to understand some of the basic underpinnings of these

methods, with the intention of leading the reader to a broad understanding of how they interrelate.

I also illustrate connections between these techniques and Bayesian approaches.

• The proposed framework is used to arrive at new insights, methods, and both practical and theoretical

results. In particular, several novel optimality properties of algorithms in wide use such as BM3D,

and methods for their iterative improvement (or non-existence thereof) are discussed.

• Several theoretical results are discussed which will enable the performance analysis and subsequent

improvement of any existing restoration algorithm.

While much of the material discussed is applicable to wider class of linear degradation models (e.g.

noise, blur, etc.,) in order to keep matters focused, we consider the problem of denoising here.

∗Electrical Engineering Department, University of California, Santa Cruz CA, 95064 USA.

e-mail: [email protected], Phone:(650) 322-2311, Fax: (410) 322-2312

This work was supported in part by the US Air Force Grant FA9550-07-1-0365 and the National Science Foundation Grant

CCF-1016018

February 18, 2011 DRAFT

Page 2: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

I. INTRODUCTION

Consider the measurement model

yi = zi + ei, for i = 1, · · · , n, (1)

where zi = z(xi) is the underlying latent signal of interest at a position xi = [xi,1, xi,2]T

, yi is the

noisy measured signal (pixel value), and ei is the corrupting zero-mean, i.i.d. white noise with variance

σ2. The problem of interest then is to recover the complete set of samples of z(x), which we denote

vectorially as z = [z(x1), z(x2), · · · , z(xn)]T from the corresponding data set y. To restate the problem

more concisely, the complete measurement model in vector notation is given by 1

y = z + e. (2)

It has been realized for some time now that effective restoration of signals will require methods which

either model the signal a priori (i.e. Bayesian) or learn the underlying characteristics of the signal from the

given data (i.e. learning, or non-parametric methods.) Most recently, the latter category of approaches has

become exceedingly popular. Perhaps the most striking recent example is the popularity of patch-based

methods [1], [2], [3], [4]. This new generation of algorithms exploit both local and non-local redundancies

or ”self-similarities” in the signals being treated. Earlier on, the bilateral filter [2] was developed with

very much the same idea in mind, as were its spiritually close predecessors the Susan Filter [5] and the

filters of Yaroslavsky [6]. The common philosophy among these and related techniques is the notion of

measuring and making use of affinities between a given data point (or more generally patch or region)

of interest, and others in the given measured signal y. These similarities are then used in a filtering

context to give higher weights to contributions from more similar data values, and to properly discount

data points that are less similar.

Despite the large amount of recent literature on techniques based on these ideas, simply put, the

key differences between the methods have been relatively minor, yet somewhat poorly understood.

In particular, the underlying framework for each of these methods is distinct only to the extent that

the weights assigned to different data points is decided upon differently. To be more concrete and

mathematically precise, let us consider the denoising problem (2) again. The estimate of the signal z(x)

1Surely a similar analysis to what follows can and should be carried out for more general inverse problems such as

deconvolution, interpolation, etc. where z would be generally replaced by Az, but these will require more detailed analysis and

will be left for future work.

2

Page 3: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

at the position x is found using a (non-parametric) point estimation framework; namely, the weighted

least squares problem

z(xj) = arg minz(xj)

n∑

i=1

[yi − z(xj)]2 K(xi, xj , yi, yj) (3)

where the weight (or kernel) function K(·) is a positive, unimodal function which measures the ”similar-

ity” between the samples yi and yj , at respective positions xi and xj . If the kernel function is restricted

to be only a function of the spatial locations xi and xj , then the resulting formulation is what is known

as kernel regression in the classical non-parametric statistics literature [7], [8].

Interestingly, in the early 1980’s the essentially identical concept of moving least-squares emerged

independently [9], [10]. This idea has since been widely adopted in the computer graphics community

[11] as a very effective tool for smoothing and interpolating data in 3-D. Surprisingly, despite the obvious

connections between moving least-squares and the adaptive filters based on similarity, their kinship has

remained largely hidden so far.

II. EXISTING ALGORITHMS

Over the years, the measure of similarity K(xi, xj , yi, yj) has been defined in a number of different

ways, leading to a cacophony of filters, including some of the most well-known recent approaches to

image denoising [2], [3], [1]. Figure 1 gives a graphical illustration of how different choices of similarity

kernels leads to different classes of filters, some of which we discuss next.

A. Classical Regression Filters [7], [8]:

Naturally, the most naive way to measure the ”distance” between two pixels is to simply consider their

spatial Euclidean distance; namely, using a Gaussian kernel,

K(xi, xj , yi, yj) = exp

(−‖xi − xj‖2h2

x

).

Such filters essentially lead to (possibly space-varying) Gaussian filters which are quite familiar from

traditional image processing [12]. While it is possible to adapt the variance parameter hx to the local

image statistics to obtain a relatively modest improvement in performance, the lack of stronger adaptivity

to the underlying structure of the signal of interest is a major drawback of these classical approaches.

3

Page 4: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

B. The Bilateral Filter (BL) [2]:

Another manifestation of the formulation in (3) is the bilateral filter where the spatial and photometric

distances between two pixels are taken into account in separable fashion as follows:

K(xi, xj , yi, yj) = exp

(−‖xi − xj‖2h2

x

)exp

(−(yi − yj)2

h2y

)= exp

{−‖xi − xj‖2h2

x

+−(yi − yj)

2

h2y

}

(4)

As can be observed in the exponent on the right-hand side, and in Figure 1, the similarity metric here

is a weighted Euclidean distance between the vectors (xi, yi) and (xj , yj). This approach has several

advantages. Namely, while the kernel is easy to construct, and computationally simple to calculate, it

yields useful local adaptivity to the given data. In addition, it has only two control parameters (hx, hy),

which make it very convenient to use. Unfortunately, as is well-known, this filter does not provide effective

performance in low signal-to-noise scenarios [13].

C. Non-Local Means (NLM) [1], [14], [15]:

The non-local means algorithm has stirred a great deal of interest in the community in recent years.

At its core, however, it is a generalization of the bilateral filter in the sense that the photometric term in

the bilateral similarity kernel, which is measured point-wise, is simply replaced with one that is patch-

wise instead. A second difference is that (at least in theory) the geometric distance between the patches

(corresponding to the first term in the bilateral similarity kernel), is essentially ignored, leading to strong

contribution from patches that may not be physically near the pixel of interest (hence the name non-local).

To summarize, the NLM kernel is

K(xi, xj , yi, yj) = exp

(−‖xi − xj‖2h2

x

)exp

(−‖yi − yj‖2h2

y

)with hx −→∞, (5)

where yi and yj refer now to patches of pixels centered at pixels yi and yj , respectively. In practice,

two implementation details should be observed. First, the patch-wise photometric distance ‖yi−yj‖2 in

the above is in fact measured as (yi − yj)T G (yi − yj) where G is a fixed diagonal matrix containing

Gaussian weights which give higher importance to the center of the respective patches. Second, it is

rather impractical to consider all the patches yi within an image as possibly close to yj , so typically the

search is limited to a reasonable spatial neighborhood of yj , which means that in effect the NLM filter

too works more or less locally; said another way, hx is never infinite in practice.

Despite its popularity, the performance of the NLM filter leaves much to be desired. The true potential of

this filtering scheme was demonstrated only later with the Optimal Spatial Adaptation (OSA) approach of

Boulanger and Kervrann [14]. In their approach, the photometric distance was refined to include estimates

4

Page 5: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

of the local noise variances within each patch. Namely, they computed a local diagonal covariance matrix,

and defined the locally adaptive photometric distance as (yi − yj)TV−1

j (yi − yj) in such a way as to

minimize an estimate of the local mean-squared error. Furthermore, they considered iterative applications

of the filter as discussed in Section V.

Bilateral filter

The Euclidean distance

The photometric distance

Non-local means

The geodesic distance

LARK

The spatial distance

Classic kernel regression

Fig. 1. Similarity Metrics and the Resulting Filters

D. Locally Adaptive Regression (Steering) Kernels (LARK) [3]:

The key idea behind this measure of similarity, originally proposed in [3] is to robustly obtain the local

structure of images by analyzing the photometric (pixel value) differences based on estimated gradients,

and to use this structure information to adapt the shape and size of a canonical kernel. The LARK kernel

is defined as follows:

K(xi, xj , yi, yj) = exp{−(xi − xj)

TCi(xi − xj)}, (6)

where the matrix Ci = C(yi, yj) is estimated from the given data as

Ci =∑

j

z2xi,1

(xj) zxi,1(xj)zxi,2

(xj)

zxi,1(xj)zxi,2

(xj) z2xi,2

(xj)

.

5

Page 6: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Specifically, zxi,∗(xj) are the estimated gradients of the underlying signal at point xi, computed from the

given measurements yj in a patch around the point of interest. The reader may recognize the above as

the well-studied “structure tensor” [16]. The advantage of the LARK descriptor is that it is exceedingly

robust to noise and perturbations of the data. The formulation is also theoretically well-motivated since

the quadratic exponent in (6) essentially encodes the local geodesic distance between the points (xi, yi)

and (xj, yj) on the graph of the function z(x, y) thought of as a manifold embedded in 3-D. The geodesic

distance was also used in the context of the Beltrami-Flow kernel in [17] in an analogous fashion.

III. GENERALIZATIONS AND PROPERTIES

The above discussion can be naturally generalized by defining the augmented data variable t =[xT , yT

]Tand a general Gaussian kernel as follows:

K(ti, tj) = exp{−(ti − tj)

T Q(ti − tj)}, (7)

Qi,j =

Qx 0

0 Qy

(8)

where Q is symmetric positive definite (SPD).

With Qx = 1h2

x

I and Qy = 0 we have classical kernel regression. Whereas one obtains the bilateral

filter framework when Qx = 1h2

x

I and Qy = 1h2

y

diag[0, 0, · · · , 1, · · · , 0, 0] with the latter diagonal matrix

picking out the center pixel in the patch defined by ti − tj . When Qx = 0 and Qy = 1h2

y

G, we have

the nonlocal means filter and its variants. Finally, the LARK kernel in (6) is obtained when Qx = Ci

and Qy = 0. More generally, the matrix Q can be selected so that it has non-zero off-diagonal block

elements. However, no practical algorithms with this choice have been proposed so far. As detailed below,

with a symmetric positive definite Q, this general approach results in valid symmetric positive definite

kernels, a property that is used throughout the rest of our discussion2.

Next, we present a more general framework for selecting the similarity functions, which will help us

identify the wider class of admissible kernels. This would also allow us to define similarity kernels more

formally, and to understand how to produce new kernels from ones already defined [18]. Formally, a

scalar-valued function K(t, s) over a compact region of its domain Rn, is called an admissible kernel if

• K is symmetric: K(t, s) = K(s, t)

2The definition of t given here is only one of many possible choices. Our treatment in this paper is equally valid when, for

instance, t = T (x,y) is any feature derived from a convenient linear or nonlinear transformation of the original data [12].

6

Page 7: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

• K is positive definite. That is, for any collection of points ti, i = 1, · · · , n, the Gram matrix

Ki,j = K(ti, tj) is positive definite.

Such kernels satisfy some useful properties such as positivity, K(t, t) ≥ 0, and the Cauchy-Schwartz

inequality K(t, s) ≤√K(t, t)K(s, s).

Another way to characterize SPD kernels is using Mercer’s Theorem ([18]). Namely, suppose that

K(t, s) is a SPD kernel, and consider the integral operator

K f =

∫K(t, s) f(s) ds. (9)

This operator should be understood as simply a filter applied to the signal f (without normalization), or

equivalently in the discrete case as the numerator of the first term on the right-hand side of (12) below.

If for any square-integrable function f(t), we have that∫K(t, s) f(t) f(s) dt ds ≥ 0,

then Mercer’s theorem [18] states that there is a spectral representation (i.e. eigen-decomposition)

K(t, s) =

∞∑

i=0

µi ψi(t)ψi(s), (10)

where {µi} are the (non-negative) eigenvalues corresponding to an orthonormal basis {ψi} of L2, which

consists of the eigenfunctions of the operator Kf . In essence,K(t, s) is a continuous functional equivalent

of a symmetric positive-definite matrix.

With the above definitions in place, there are numerous ways to construct new valid kernels from

existing ones. We list some of the most useful ones below, without proof [18]. Given two valid kernels

K1(t, s) and K2(t, s), the following constructions yield admissible kernels:

1) K(t, s) = αK1(t, s) + βK2(t, s) for any pair α, β ≥ 0

2) K(t, s) = K1(t, s)K2(t, s)

3) K(t, s) = k(t) k(s), where k(·) is a scalar-valued function.

4) K(t, s) = p (K1(t, s)), where p(·) is a polynomial with positive coefficients.

5) K(t, s) = exp(K1(t, s))

Furthermore, t may be considered to be any feature vector t = T (x,y) constructed from the data (e.g.

t =[xT , yT

]Tbeing a rather trivial example).

Whatever the choice of the kernel function, the weighted least-square optimization problem (3) has a

simple solution. In matrix notation we can write

z(xj) = arg minz(xj)

[y − z(xj)1n]T Kj [y − z(xj)1n] (11)

7

Page 8: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

where 1n = [1, · · · , 1]T , and Kj = diag [K(x1, xj , y1, yj), K(x2, xj , y2, yj), · · · , K(xn, xj , yn, yj)].

The closed form solution to the above is

z(xj) =(1T

nKj1n

)−11T

nKjy (12)

=

(∑

i

K(xi, xj , yi, yj)

)−1(∑

i

K(xi, xj , yi, yj) yi

)

=∑

i

K(xi, xj , yi, yj)∑iK(xi, xj , yi, yj)

yi

=∑

i

W (xi, xj , yi, yj) yi

= wTj y. (13)

So in general, the estimate z(xj) of the signal at position xj is given by a weighted sum of all the given

data points y(xi), each contributing a weight commensurate with its similarity as indicated by K(·),with the measurement y(xj) at the position of interest. Furthermore, as should be apparent in (12), the

weights sum to one. To control computational complexity, or to design local versus non-local filters,

we may choose to set the weight for ”sufficiently far-away” pixels to zero or a small value, leading to

a weighted sum involving a relatively small neighborhood of data points in a properly defined vicinity

of the sample of interest. This is essentially the only distinction between locally adaptive processing

methods (such as BL, and LARK) and so-called non-local methods such as NLM. It is worth noting that

in the formulation above, despite the simple form of (12), in general we have a nonlinear estimator since

the weights W (xi, xj, yi, yj) depend on the noisy data3.

To summarize the discussion so far, we have presented a general framework which absorbs many

existing algorithms as special cases. This was done in several ways, including a general description of

the set of admissible similarity kernels, which allows the construction of a wide variety of new kernels not

considered before in the image processing literature. Next, we turn our attention to the matrix formulation

of the non-parametric filtering approach. As we shall see, this provides a framework for more in-depth

and intuitive understanding of the resulting filters, their subsequent improvement, and for their respective

asymptotic and numerical properties.

3It is worth noting that the non-parametric approach in (3) can be further extended to include a more general model of the

signal z(x) in a desired basis. We briefly discuss this case in the appendix, but leave its full treatment for future research.

8

Page 9: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

IV. THE MARTIX FORMULATION AND ITS PROPERTIES

In this section, we analyze the filtering problems posed earlier in the language of linear algebra and

make several theoretical and practical observations. In particular, we are able to not only study the

numerical/algebraic properties of the resulting filters, but also to analyze some of their fundamental

statistical properties.

To begin, recall the convenient vector form of the filters:

z(xj) = wTj y. (14)

where wTj = [W (x1, xj , y1, yj), W (x2, xj , y2, yj), · · · , W (xn, xj , yn, yj)] is a vector of weights for

each j. Writing the above at once for all j we have

z =

wT1

wT2

...

wTn

y = W y. (15)

As such, the filters defined by the above process can be analyzed as the product of a matrix of weights

W with the vector of the given data y. First, a notational matter: W is in general a function of the data,

so strictly speaking, the notation W(y) would be more descriptive; but we will suppress this dependence

in what follows, in the interest of a simpler presentation.

Next, we highlight some important properties of the matrix W, which lend much insight to our analysis.

Referring to (12), W can be written as the product

W = D−1 K, (16)

where Kij = K(xi, xj , yi, yj) and D is a positive definite diagonal matrix with the normalizing factors

Djj = diag{∑iK(xi, xj , yi, yj)} along its diagonals. We can write

W = D−1 K = D−1/2 D−1/2 KD−1/2︸ ︷︷ ︸

L

D1/2 (17)

It is evident that since K and D are symmetric positive definite, then so is L. Meanwhile, W and L

are related through a diagonal similarity transformation, and therefore have the same eigenvalues. Hence,

interestingly, W has real, non-negative eigenvalues (even though it is not symmetric!)4 An alternative,

though not necessarily more intuitive, description of the action of W is possible in terms of graphical

4That is, W is positive definite but not symmetric.

9

Page 10: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

models. Namely, if we consider the data (xi, yi) to be nodes on a weighted graph with weights between

nodes i and j given by the kernel values K(xi, xj , yi, yj), the matrix L = L− I is precisely the “graph

Laplacian” [19], [20]5,6.

Next, we note that W is a positive row-stochastic matrix – namely, by definition, the sum of each of

its rows is equal to one, as should be clear from (12). This implies [21], [22] that W has spectral radius

0 ≤ λ(W) ≤ 1. Indeed, the largest eigenvalue of W is simple (unrepeated) and is exactly λ1 = 1, with

corresponding eigenvector v1 = (1/√n)[1, 1, · · · , 1]T = (1/

√n)1n. Intuitively, this means that filtering

by W will leave a constant signal (i.e., a ”flat” image) unchanged. In fact, with its spectrum inside the

unit disk, W is an ergodic matrix [23], [24], which is to say that its powers converge to a matrix of rank

one, with identical rows. More explicitly, using the eigenvalue decomposition of W we can write

Wk = VSkV−1 = VSkUT =

n∑

i=1

λki viu

Ti ,

where we have defined UT = V−1, while denoting by uTi the left eigenvectors of W which are the

columns of U. Therefore,

limk→∞

Wk = 1nuT1 .

So u1 summarizes the asymptotic effect of applying the filter W many times.

As we made clear earlier, W = D−1K is generically not a symmetric matrix, though it always

has real, positive eigenvalues. As such, it should not come as a surprise that W is quite close to a

symmetric positive definite matrix. As we illustrate in the Appendix, this is indeed the case. As such,

we shall find it useful to approximate W with such a symmetric positive definite matrix. To effect this

approximation requires a bit of care, as the resulting matrix must still be row-stochastic. Fortunately,

this is not difficult to do. Sinkhorn’s algorithm [25], [26] for matrix scaling provides a provably optimal

[27] way to accomplish this task7. The obtained matrix is a close approximation of a given W which

is both symmetric positive definite, and now doubly (i.e., row- and column-) stochastic. Working with

5The name is not surprising. Consider the canonical case in which W is a low-pass filter (though possibly nonlinear and

space-varying) filter. Since W and L share the same spectrum, L is also a low-pass filter. Therefore, L = L− I is generically

a high-pass filter, justifying the Laplacian label from the point of view of filtering.

6Another interpretation of the matrix W is in terms of Markov chains. Namely W is a probability transition matrix for a

Markov chain defined on the entries of the data vector y [19], [20].

7We refer the reader to the appendix and citations therein, where we illustrate and justify both the approximation procedure,

and its fidelity in detail.

10

Page 11: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

a symmetric (or rather ”symmetrized”) W makes possible much of the analysis that follows in the

remainder of this paper8.

From this point forward therefore, we shall consider W to be a symmetric positive definite, and

(doubly-) stochastic matrix. The spectrum of W determines the effect of the filter on the noisy signal y.

We write the eigenvalue decomposition:

W = VSVT (18)

where S = diag [λ1, · · · , λn] contains the eigenvalues in decreasing order 0 ≤ λn ≤ · · · ≤ λ1 = 1.

As an example, we illustrate the spectrum of the LARK [3] kernel in Figure 2 for different types of

patches. As can be observed, the spectrum of W decays rather quickly for ”flat” patches, indicating

that the filter is very aggressive in denoising these types of patches. As we consider increasingly more

complex anisotropic patches containing edges, corners, and various textures, the spectrum of the filter is

(automatically) adjusted to be less aggressive in particular directions. To take the specific case of a patch

containing a simple edge, the filter is adjusted in unsupervised fashion to perform strong denoising along

the edge, and to preserve and enhance the structure of the patch across the edge. This type of behavior

is typical of the adaptive non-parametric filters we have discussed so far, but it is particularly striking

and stable in the case of LARK kernels shown in Figure 2.

Of course, the matrix W is a function of the noisy data y and hence it is strictly speaking a stochastic

variable. But another interesting property of W is that when the similarity kernel K(·) is of the general

Gaussian type as in (7), as described in the Appendix, the resulting filter coefficients W are quite stable

to perturbations of the data by noise of modest size [28]. From a practical point of view, and insofar as

the computation of the matrix W is concerned, it is always reasonable to assume that the noise variance

is relatively small, because in practice we typically compute W on a “pre-processed” version of the noisy

image y anyway9. Going forward, therefore, we consider W as fixed. Some experiments in Section V

confirm the validity of this assumption.

A. Statistical Analysis of the Filters’ Performance

We are now in a position to carry out an analysis of the performance of filters defined earlier. In

particular, let us compute an approximation to the bias and variance of the estimator z = Wy first. The

8Interestingly, there is an inherent and practical advantage in using symmetrization. As we will see in the experimental results,

given any W, employing its the symmetrized version generally improves performance in the mean-squared error sense. We do

not present a proof of this observation here, and leave its analysis for a forthcoming submission.

9This ”small noise” assumption is made only for the analysis of the filter coefficients, and is not invoked in rest of the paper.

11

Page 12: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Fig. 2. Spectrum of the LARK filter on different types of patches

bias in the estimate is

bias = E(z)− z = E(Wy) − z ≈Wz− z = (W − I) z.

Recalling that the matrix W has a unit eigenvalue corresponding to a (constant) vector, we note that z is

an unbiased estimate if z is a constant image, but is biased for all other underlying images. The squared

magnitude of the bias is given by

‖bias‖2 = ‖(W − I)z‖2. (19)

Writing the image z in the column space of the orthogonal matrixV as z = Vb0, we can rewrite the

squared bias magnitude as

‖bias‖2 = ‖(W − I)z‖2 = ‖V(S− I)b0‖2 = ‖(S− I)b0‖2 =

n∑

i=1

(λi − 1)2b20i. (20)

We note that the last sum can be expressed over the indices i = 2, · · · , n because the first term in fact

vanishes because λ1 = 1. Next, we consider the variance of the estimate. We have

cov(z) = cov(Wy) ≈ cov(We) = σ2WWT =⇒ var(z) = tr(cov(z)) = σ2n∑

i=1

λ2i .

12

Page 13: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Overall, the mean-squared-error (MSE) is given by

MSE = ‖bias‖2 + var(z) =

n∑

i=1

(λi − 1)2b20i + σ2λ2i . (21)

a) BM3D [29] Explained : Let us pause for a moment and imagine that we can optimally ”design”

the matrix W from scratch in such a way as to minimize the mean-squared error above. That is, consider

the set of eigenvalues for which (21) is minimized. Differentiating the expression for MSE and setting

equal to zero leads to a familiar expression:

λ∗i =b20i

b20i + σ2=

1

1 + snr−1i

. (22)

This is of course nothing but a manifestation of the Wiener filter with snri = b20i

σ2 denoting the signal-to-

noise ratio at each component i of the signal. This observation naturally raises the question of whether

any existing filters in fact get close to this optimal performance [30]. An example filter which is designed

to approximately achieve this goal is BM3D [29], which is considered at present to be the state of the

art in denoising. The BM3D algorithm can be briefly summarized by the following three steps:

1) Patches from an input image are classified according to their similarity, and are grouped into 3-D

clusters accordingly.

2) A so-called ”3D collaborative Wiener filtering” is implemented to process each cluster.

3) Filtered patches inside all the clusters are aggregated to form an output.

The core process of the BM3D algorithm is the collaborative Wiener filtering that transforms a patch

cluster through a fixed orthonormal basis V (namely DCT) [29], where the coefficients of each component

are then scaled by some ”shrinkage” factor {λi}, which are precisely chosen according to (22). Following

this shrinkage step, the data are transformed back to the spatial domain to generate denoised patches for

the aggregation process. In practice, of course, one does not have access to {b20i}, so they are roughly

estimated from the given noisy image y, as is done also in [29]. Though BM3D is not directly based upon

a regression framework as other patch-based methods (such as NLM, LARK) are, the diagonal Wiener

shrinkage matrix S = diag[λ1, · · · , λn], and the overall 3-D collaborative Wiener filtering process can now

be understood using a symmetric, positive-definite filter matrix W with orthonormal eigen-decomposition

VSVT . If the orthonormal basis V contains a constant vector vi′ = 1√n1n, one can easily make W

doubly-stochastic by setting its corresponding shrinkage factor λi′ = 1.

To summarize this section, we have presented a widely applicable matrix formulation z = Wy for the

denoising problem, where W is henceforth considered symmetric, positive-definite, and doubly-stochastic.

This formulation has so far allowed us to characterize the performance of the resulting filters in terms

13

Page 14: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

of their spectra, and to obtain insights as to the relative statistical efficiency of existing filters such as

BM3D. One final observation worth making here is that the Birkhoff – von Neumann Theorem [24]

(p. 527) notes that W is doubly stochastic if and only if it is a convex combination of permutation

matrices. Namely, W =∑n

l=1 αlPl, where Pl are permutation matrices; αl are non-negative scalars

with∑αl = 1, and n is at most (n− 1)2 + 1. This mean that regardless of the choice of an admissible

kernel, each pixel of the estimated image

z = W y =

n∑

l=1

αl (Ply)

is always a convex linear combination of at most n = (n− 1)2 + 1 (not necessarily nearby) pixels of the

noisy input.

Going forward, we first use the insights gained thus far to study further improvements of the non-

parametric approach. Then, we describe the relationship between the non-parametric approaches and

classical Bayesian methods.

V. IMPROVING THE ESTIMATE: DIFFUSION, TWICING, AND BREGMAN ITERATIONS

As we described above, the optimal spectrum for the filter matrix W is dictated by the Wiener condition

(22), which requires clairvoyant knowledge of the signal or at least careful estimation of the SNR at each

spatial frequency. In practice, however, nearly all similarity-based methods we described so far (with

the notable exception of the BM3D) design a matrix W based on a choice of the kernel function, and

without regard to the ultimate consequences of this choice on the spectrum of the resulting filter. Hence,

at least in the mean-squared error sense, such choices of W are always deficient. So one might naturally

wonder if such estimates can be improved through some post-facto iterative mechanism. The answer is

an emphatic yes, and this is the subject of this section.

To put the problem in more practical terms, the estimate derived from the application of a non-

parametric denoising filter W will not only remove some noise, but invariably some of the underlying

signal as well. A number of different approaches have been proposed to reintroduce this ”lost” component

back to the estimate. Interestingly, as we describe later in Section VI, similar to the steepest descent

methods employed in the solution of more classical Bayesian optimization approaches, the iterative

method employed here involve multiple applications of a filter (or sequence of filters) on the data, and

somehow pooling the results. Next we discuss the two most well-known and frequently used approaches:

anisotropic diffusion [31]; and the recently popularized Bregman iterations [32], which is closely related

14

Page 15: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

to L2-boosting [33]. The latter is a generalization of twicing introduced by Tukey [34] more than thirty

years ago.

A. Diffusion

So far, we have described a general formulation for denoising as a spatially adaptive, data-dependent

filtering procedure (3) amounting to

z = Wy.

Now consider applying the filter multiple times. That is, we define z0 = y, and the iteration

zk = Wzk−1 = Wky. (23)

The net effect of each application of W is essentially a step of anisotropic diffusion [31], [19]. This can

be seen as follows. From the iteration (23) we have

zk = Wzk−1, (24)

= zk−1 − zk−1 + Wzk−1, (25)

= zk−1 + (W− I) zk−1, (26)

which we can rewrite as

zk − zk−1 = (W − I) zk−1. (27)

Recall that W = D−1/2 LD1/2. Hence we have

W − I = D−1/2 (L− I) D1/2 = D−1/2 LD1/2,

where L is the graph Laplacian operator mentioned earlier [20], [19]. Defining the normalized variable

zk = D1/2 zk, (27) embodies a discrete version of anisotropic diffusion:

zk − zk−1 = L zk−1, ←→ ∂z(t)

∂t= ∇2z(t) Diffusion Eqn. (28)

where the left-hand side of the above is a discretization of the derivative operator∂z(t)

∂t .

Returning to the diffusion estimator (23), the bias in the estimate after k iterations is

biask = E(zk)− z = E(Wky) − z = Wkz− z = (Wk − I)z.

Recalling that the matrix W has a unit eigenvalue corresponding to a (constant) vector, we note that since

Wk − I has the same eigenvectors as W, the above sequence of estimators produce unbiased estimates

of constant images, but are biased for all other underlying images z.

15

Page 16: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

The squared magnitude of the bias is given by

‖biask‖2 = ‖(Wk − I)z‖2. (29)

Writing the image z in the column space of V as z = Vb0, as before we have:

‖biask‖2 = ‖(Wk − I)z‖2 = ‖V(Sk − I)b0‖2 =

n∑

i=1

(λki − 1)2b20i. (30)

As k grows, so does the magnitude of the bias, ultimately converging to ‖b0‖2 − b201. This behavior is

consistent with what is observed in practice; namely, increasing iterations of diffusion produce more and

more blurry (biased) results. Next, we derive the variance of the diffusion estimator:

cov(zk) = cov(Wky) = cov(Wke) = σ2Wk(Wk)T ,

which gives

var(zk) = tr(cov(zk)) = σ2n∑

i=1

λ2ki .

As k grows, the variance tends to a constant value of nσ2. Overall, the mean-squared-error (MSE) is

given by

MSEk = ‖biask‖2 + var(zk) =

n∑

i=1

(λki − 1)2b20i + σ2λ2k

i . (31)

Diffusion experiments for denoising three types of patches (shown in Fig. 3) are carried out to illustrate

the evolution of MSE. The variance of input Gaussian noise is set as σ2 = 25. Filters based on both

LARK [3] and NLM [1] estimated from the latent noise-free patches are tested, and Sinkhorn algorithm

(see Appendix B) is implemented to make the filter matrices doubly-stochastic. Predicted MSEk, var(zk)

and ‖biask‖2 according to (31) through the diffusion iteration are shown in Fig. 4 (LARK) and Fig. 6

(NLM), where we can see that diffusion monotonically reduces the estimation variance and increases the

bias. In some cases (such as (a) and (b) in Fig. 4) diffusion further suppresses MSE and improves the

estimation performance. True MSEs for the standard (asymmetric) filters and their symmetrized versions

computed by Monte-Carlo simulations are also shown, where in each simulation 100 independent noise

realizations are averaged. We see that the estimated MSEs of the symmetrized filters match quite well

with the predicted ones (see Fig. 5, 7). The asymmetric filters, meanwhile generate slightly higher MSE,

but behave closely to the symmetrized ones especially in regions around the optimal MSEs.

In the next set of Monte-Carlo simulations, we illustrate the effect of W being computed from clean

vs. noisy images (see Figs. 14-15). Noise variance σ2 = 0.5, which is relatively small, simulating the

situation where ”pre-processing” has been applied to suppress estimation variance. It can be observed that

16

Page 17: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

(a) (b) (c)

Fig. 3. Example patches: (a) flat, (b) edge, (c) texture.

2 4 6 8 10

1

2

3

4

5

MSE

Variance

Bias2

Iteration number

2 4 6 8 10

2

4

6

8

10

MSE

Variance

Bias2

Iteration number

2 4 6 8 100

50

100

150

200

250

300

350

MSE

Variance

Bias2

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 4. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (31) for LARK [3] filters in diffusion process.

2 4 6 8 103

3.5

4

4.5

5

5.5

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 108

8.5

9

9.5

10

10.5

11

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 100

50

100

150

200

250

300

350

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 5. Plots of predicted MSE and Monte-Carlo estimated MSE in diffusion process using LARK [3] filters. In the Monte-Carlo

simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.

17

Page 18: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

2 4 6 8 100

1

2

3

4

5

6

7

MSE

Variance

Bias2

Iteration number

2 4 6 8 100

5

10

15

20

25

30

MSE

Variance

Bias2

Iteration number

2 4 6 8 100

5

10

15

MSE

Variance

Bias2

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 6. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (31) for NLM [1] filters in diffusion process.

2 4 6 8 103.5

4

4.5

5

5.5

6

6.5

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 100

5

10

15

20

25

30

35

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 10

8

10

12

14

16

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 7. Plots of predicted MSE and Monte-Carlo estimated MSE in diffusion process using NLM [1] filters. In the Monte-Carlo

simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.

the MSEs estimated from Monte-Carlo simulations are quite close to the ones predicted from the ideal

filters, which confirms the assumption in Section IV that filter matrix W can be treated as deterministic

under most circumstances.

To further analyze the change of MSE in the diffusion process, let us consider the contribution of MSE

in each component (or ”mode”) separately:

MSEk =

n∑

i=1

MSE(i)k (32)

where MSE(i)k denotes the MSE of the i-th mode in the kth diffusion iteration, which is given by:

MSE(i)k = (λk

i − 1)2b20i + σ2λ2ki . (33)

18

Page 19: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

2 4 6 8 101

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number2 4 6 8 10

1

2

3

4

5

6

7

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 100

10

20

30

40

50

60

70

80

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 8. Plots of predicted MSE in diffusion based on ideally estimated symmetric SKR filters, and MSE of noisy

symmetric/asymmetric SKR filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation

100 independent noise realizations are implemented.

2 4 6 8 103.5

4

4.5

5

5.5

6

6.5

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 100

5

10

15

20

25

30

35

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

2 4 6 8 100

2

4

6

8

10

12

14

16

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 9. Plots of predicted MSE in diffusion based on ideally estimated symmetric NLM filters, and MSE of noisy

symmetric/asymmetric NLM filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation

100 independent noise realizations are implemented.

Equivalently, we can write

MSE(i)k =

MSE(i)k

σ2= snri(λ

ki − 1)2 + λ2k

i , (34)

where as before, snri = b20i

σ2 is defined as the signal to noise ratio in the i-th mode.

One may be interested to know at which iteration MSE(i)k is minimized. As we have seen, with an

arbitrary filter W, there is no guarantee that even the very first iteration of diffusion will improve the

estimate in the MSE sense. The derivative of (34) with respect to k is

∂MSE(i)k

∂k= 2λk

i log λi

[(snri + 1)λk

i − snri

]. (35)

19

Page 20: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

For i ≥ 2, since λi < 1 we note that the sign of the derivative depends on the term inside the brackets,

namely, (snri + 1)λki − snri. To guarantee that in the kth iteration the MSE of the i-th mode decreases

(negative derivative), we must have:

(snri + 1)λqi − snri > 0 for any 0 ≤ q < k. (36)

In fact the condition

(snri + 1)λki − snri ≥ 0 (37)

guarantees that the derivative is always negative for q ∈ (0, k) because given any scalar t = qk ∈ (0, 1):

(snri +1)λki − snri ≥ 0⇒ λtk

i ≥(

snri

snri + 1

)t

⇒ (snri +1)λtki ≥ snri

(snri

snri + 1

)t−1

> snri (38)

The condition (37) has an interesting interpretation. Rewriting (37) we have:

log(1 + snri) ≤ log

(1

1− λ′i

)

︸ ︷︷ ︸ǫi

, (39)

where λ′i = λki denotes the i-th eigenvalue of Wk, which is the equivalent filter matrix for the kth

iteration. The left-hand side of the inequality is Shannon’s channel capacity [35] of the i-th mode for the

problem y = z + e. We name the right-hand side expression ǫi the entropy of the i-th mode of the filter

Wk. The larger ǫi, the more variability can be expected in the output image produced by the i-th mode

of the denoising filter. The above condition implies that if the filter entropy exceeds or equals the channel

capacity of the i-th mode (i.e. the mode is sufficiently ”strong”), then the k-th iteration of diffusion will

in fact produce an improvement in the corresponding mode. One may also be interested in identifying an

approximate value of k for which the overall MSEk is minimized. This depends on the signal energy

distribution ({b20i}) over all the modes. The study of this topic, which we consider outside the scope of

the present paper, will help to design a strategy capable of enhancing the performance of many existing

denoising filters.

Another interesting question is this: given a particular W, how much further MSE reduction can be

achieved by implementing diffusion? For example, can we use diffusion to improve the 3D collaborative

Wiener filter in BM3D [29]? Assume that this filter has already achieved the ideal Wiener filter condition

in each mode, namely

λ∗i =snri

snri + 1. (40)

Then we can see that:

(snri + 1)(λ∗i )k − snri < 0 for any k > 1, (41)

20

Page 21: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

which means for all the modes, diffusion will definitely worsen (increase) the MSE. In other words,

multi-iteration diffusion is not useful for filers where the Wiener condition has been (approximately)

achieved. Put another way, suppose that the minimum MSE(i) is achieved in the k∗-th iteration. Hence:

k∗ = log

(snri

snri + 1

)/ log λi. (42)

Then the i-th eigenvalue of the equivalent filter matrix Wk∗

becomes λ′i = snri/(snri + 1), which is in

fact the Wiener filter. Of course, in practice we can seldom find a single iteration number k∗ minimizing

MSE(i) for all the modes simultaneously; but in general, the take-away lesson is that optimizing a given

filter through diffusion can in some cases (39) make its corresponding eigenvalues closer to those of an

ideal Wiener filter that minimizes MSE.

B. Twicing, L2-Boosting, and Bregman Iterations

An alternative to repeated applications of the filter W is to consider the residual signals, defined as the

difference between the estimated signal and the measured signal. The use of the residuals in improving

estimates has a rather long history, dating at least back to the work of John Tukey [34] who termed the

idea ”twicing”. More recently, the idea has been rediscovered in the applied mathematics community

under the rubric of Bregman iterations [32], and in the machine learning and statistics literature [33] as

L2-boosting. In whatever guise, the basic notion – to paraphrase Tukey, is to use the filtered residuals to

enhance the estimate by adding some ”roughness” to it. Put another way, if the residuals contain some of

the underlying signal, filtering them should recover this left-over signal at least in part. These estimates

too trade off bias against variance with increasing iterations, though in a fundamentally different way

than the diffusion estimator we discussed earlier.

Formally, the residuals are defined as the difference between the estimated signal and the measured

signal: rk = y− zk−1, where here we define the initialization10 z0 = Wy. With this definition, we write

the iterated estimates as

zk = zk−1 + Wrk = zk−1 + W(y − zk−1). (43)

Specifically, for k = 1

z1 = z0 + W(y − z0) = Wy + W(y −Wy) =(2W −W2

)y

. This first iterate z1 is precisely the ”twicing” estimate of Tukey [34]. More recently, this particular

filter has been suggested (i.e. rediscovered) in other contexts. In [19], [20], for instance, Coifman et al.

10Note that due to the use of residuals, this is a different initialization than the one used in the diffusion iterations.

21

Page 22: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

suggested it as an ad-hoc way to enhance the effectiveness of diffusion. The intuitive justification is that(2W −W2

)= (2I−W)W can be considered as a two step process; namely blurring, or diffusion

(as embodied by W), followed by an additional step of inverse diffusion or sharpening (as embodied by

2I−W.) As can be seen, the ad-hoc suggestion has a clear interpretation here. Furthermore, we see that

the estimate z1 can also be thought of as a kind of nonlinear, adaptive unsharp masking process, further

clarifying its effect. In [36] this procedure and its relationship to the Bregman iterations were studied in

detail11.

To study the statistical behavior of the estimates, we rewrite the iterative process in (43) more explicitly

in terms of the data y. We have [23], [33]:

zk =k∑

j=0

W(I −W)j y =(I− (I−W)k+1

)y. (44)

It is instructive to contrast the above with the diffusion process. Clearly the above iteration does not

monotonically blur the data; but a rather more interesting behavior for the bias-variance tradeoff is

observed. Analogous to our earlier derivation, the bias in the estimate after k iterations is

biask = E(zk)− z =(I− (I−W)k+1

)E(y)− z = − (I−W)k+1

z.

The squared magnitude of the bias is given by

‖biask‖2 = ‖ (I−W)k+1z‖2 =

n∑

i=1

(1− λi)2k+2b20i. (45)

The behavior of the bias in this setting is in stark contrast to the increasing bias of the diffusion process

with iterations. Namely, here the bias in fact decays with iterations.

The variance of the estimator and the corresponding mean-squared error are:

cov(zk) = cov[(

I− (I−W)k+1)y]

= σ2(I− (I−W)k+1

)(I− (I−W)k+1

)T,

which gives

var(zk) = tr(cov(zk)) = σ2n∑

i=1

(1− (1− λi)

k+1)2.

As k grows, the variance tends to a constant value of nσ2. Overall, the mean-squared-error (MSE) is

MSEk = ‖biask‖2 + var(zk) =

n∑

i=1

(1− λi)2k+2b20i + σ2

(1− (1− λi)

k+1)2. (46)

11For the sake of completeness, we also mention in passing that in the non-stochastic setting, the residual-based iterations

(43) are known as matching pursuit [37], an approach that has found many applications and extensions for fitting over-complete

dictionaries.

22

Page 23: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

0 2 4 6 8 100

5

10

15

20

MSE

Variance

Bias2

Iteration number

0 2 4 6 8 100

5

10

15

20

25

MSE

Variance

Bias2

Iteration number

0 2 4 6 8 100

5

10

15

20

25

30

35

MSE

Variance

Bias2

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 10. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (46) for LARK [3] filters in twicing process.

0 2 4 6 8 105

10

15

20

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 1010

12

14

16

18

20

22

24

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 1010

15

20

25

30

35

40

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 11. Plots of predicted MSE and Monte-Carlo estimated MSE in twicing process using LARK [3] filters. In the Monte-Carlo

simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.

Experiments for denoising the patches given in Fig. 3 using the twicing approach are carried out, where

the same doubly-stochastic LARK and NLM filters used for the diffusion tests in Fig. 4 and 6 are

employed. Plots of predicted MSEk, var(zk) and ‖biask‖2 with respect to iteration are illustrated in

Fig. 10 (LARK) and Fig. 12 (NLM). It can be observed that in contrast to diffusion, twicing monotonically

reduces the estimation bias and increases the variance. So twicing may in fact reduce the MSE in cases

where diffusion fails to do so (such as (c) in Fig. 4). True MSEs for the normal asymmetric filters and

their symmetrized versions are estimated through Monte-Carlo simulations, and they all match very well

with the predicted ones (see Fig. 11, 13).

Similar to the diffusion analysis before, we plot MSE resulting from the use of W filters directly

estimated from noisy patches through Monte-Carlo simulations with noise variance σ2 = 0.5, and compare

them with the predicted MSEs from the ideally estimated filters. Again, the MSEs estimated from Monte-

23

Page 24: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

0 2 4 6 8 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MSE

Variance

Bias2

Iteration number0 2 4 6 8 10

0

1

2

3

4

5

6

MSE

Variance

Bias2

Iteration number

0 2 4 6 8 100

5

10

15

20

25

MSE

Variance

Bias2

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 12. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (46) for NLM [1] filters in twicing process.

0 2 4 6 8 100.5

1

1.5

2

2.5

3

3.5

4

4.5

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 103.5

4

4.5

5

5.5

6

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 1010

15

20

25

MS

E

Predicted (sym.)

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 13. Plots of predicted MSE and Monte-Carlo estimated MSE in twicing process using NLM [1] filters. In the Monte-Carlo

simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.

Carlo simulations are quite close to the predicted data. (See Figs. 14 and 15.)

Here again, the contribution to MSEk of the i-th mode can be written as:

MSE(i)k = (1− λi)

2k+2b20i + σ2(1− (1− λi)

k+1)2

(47)

Proceeding in a fashion similar to the diffusion case, we can analyze the derivative of MSE(i)k with

respect to k to see whether the iterations improve the estimate. We have:

∂MSE(i)k

∂k= 2(1 − λi)

k+1 log(1− λi)[(snri + 1)(1 − λi)

k+1 − 1]. (48)

Reasoning along parallel lines to the earlier discussion, the following condition guarantees that the k-th

iteration improves the estimate:

(snri + 1)(1 − λi)k+1 ≥ 1 (49)

24

Page 25: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

0 2 4 6 8 100.5

0.6

0.7

0.8

0.9

1

1.1

1.2M

SE

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 100.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 100

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 14. Plots of predicted MSE in twicing based on ideally estimated symmetric SKR filters, and MSE of noisy

symmetric/asymmetric SKR filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation

100 independent noise realizations are implemented.

0 2 4 6 8 100.5

1

1.5

2

2.5

3

3.5

4

4.5

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number0 2 4 6 8 10

0

1

2

3

4

5

6

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

0 2 4 6 8 100.4

0.6

0.8

1

1.2

1.4

1.6

1.8

MS

E

Ideally Predicted

Monte−Carlo (sym.)

Monte−Carlo (asy.)

Iteration number

(a) Flat (b) Edge (c) Texture

Fig. 15. Plots of predicted MSE in diffusion based on ideally estimated symmetric NLM filters, and MSE of noisy

symmetric/asymmetric NLM filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation

100 independent noise realizations are implemented.

Rewriting the above we have

log(1 + snri) ≥ log

(1

1− λ′i

)

︸ ︷︷ ︸ǫi

(50)

where now λ′i = 1 − (1 − λi)k+1 is the i-th eigenvalue of I − (I −W)k+1, which is the equivalent

filter matrix for the kth twicing iteration. This is analogous to the condition we derived earlier. But

here we observe that if the entropy does not exceed the channel capacity (i.e. the i-th mode of the

filter is sufficiently ”weak” or ”ineffective”), then iterating with the residuals will indeed produce an

improvement. This makes reasonable sense because a ”weak” filter will remove too much detail from

the denoised estimate. This ”lost” detail is contained in the residuals, which the iterations will attempt

25

Page 26: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

to return to the estimate. Again, if assume the minimum MSE(i)k can be achieved in the k∗-th iteration:

k∗ = log

(1

snri + 1

)/ log(1− λ′i)− 1 (51)

Then the i-th eigenvalue of the equivalent filter matrix I− (I−W)k∗+1 becomes snri/(snri +1), which

is the Wiener condition. In most cases we cannot optimize all the modes with a uniform number of

iterations, but globally optimizing the MSE through twicing can make the eigenvalues closer to the ones

of the ideal Wiener filter that minimizes the MSE.

From the above analysis, we can see that it is possible to improve the performance of many existing

denoising algorithms in the MSE sense by implementing iterative filtering (see LARK filtering examples

in Fig. 16). However, to choose either diffusion or twicing and to determine the optimal iteration number

require prior knowledge (or estimation) of the latent signal energy {b20i} and this is an ongoing challenge.

Research on estimating the SNR of input image or video data without a reference ”ground truth” is of

broad interest in the community as of late [38]. In particular, this problem points to potentially important

connections with ongoing work in no-reference image quality assessment [39] as well.

VI. RELATIONSHIP TO BAYESIAN REGULARIZATION APPROACHES

It is useful to contrast the non-parametric framework described so far with the more classical Bayesian

estimation approach. Informally, to have a Bayesian interpretation of the estimator in (13), we could treat

both yi and xi as random variables with joint probability density function p(x, y). Then z(x) can be

thought of as the minimum mean-squared estimate of z based on the data y, conditioned on x [40]:

z(x) = E(y|x) =

∫y p(y|x) dy =

∫yp(x, y)

p(x)dy =

∫y p(x, y) dy∫p(x, y) dy

.

Comparing this expression to the filtering formulation (13):

z(xj) =∑

i

K(xi, xj , yi, yj)∑iK(xi, xj, yi, yj)

yi =∑

i

W (xi, xj , yi, yj) yi

we observe that the form and function of the kernel K(·) is to locally capture the essence of the (unknown)

joint density function p(x, y).

A more common Bayesian approach is to consider a prior for the unknown z, or to equivalently use

a regularization approach. More specifically:

Maximum a-posteriori (MAP): z = arg minz

1

2‖y − z‖2 +

λ

2R(z) (52)

where the first term on the right-hand side is the data-fidelity term, and the second (regularization) term

essentially enforces a soft constraints on the global structure and smoothness of the signal being restored.

26

Page 27: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 16. Denoising example using stationary iterative LARK filtering: (a)-(c) input noisy patches; noise variance σ2 = 25.

(d)-(f) output patches by filtering once. (g)-(i) MSE optimized outputs: (g) is the 6th diffusion output; (h) is the 4th diffusion

output; (i) is the 3rd twicing output.

27

Page 28: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

In the MAP approach, the regularization term is a functional R(z) which is typically (but not always)

convex, to yield a unique minimum for the overall cost. Particularly popular recent examples include

R(z) = ‖∇z‖, and R(z) = ‖z‖1. The above approach implicitly contains a global (Gaussian) model of

the noise (captured by the quadratic data-fidelity term) and an explicit model of the signal (captured by

the regularization term.)

Whatever choices we make for the regularization functional, the regularization approach is global in the

sense that the implicit assumptions about the noise and the signal constrain the degrees of freedom in the

resulting filters Wk and therefore in the solution, hence limiting the global behavior of the estimate. This

typically results in well-behaved (though not always desirable) solutions. This is precisely (and at least

implicitly) the main motivation for using (local/non-local) adaptive non-parametric techniques in place of

global regularization methods. Indeed, though the effect of regularization is not explicitly present in the

non-parametric framework, its work is implicitly done by the design of the kernel function K(·), which

affords us local control, and therefore more freedom and often better adaptation to the given data, resulting

in more powerful techniques with broader applicability. Though we do not discuss them here, of course

hybrid methods are also possible, where the first term in (52) is replaced by the corresponding weighted

term (3), with regularization also applied explicitly [41]. As we shall see below, the iterative methods for

improving the non-parametric estimates (such as diffusion, twicing, etc.) mirror the behavior of steepest

descent iterations in the Bayesian case, but with a fundamental difference. Namely, the corresponding

implicit regularization functionals deployed by the non-parametric methods are significantly different and

more complex than their typical Bayesian counterpart.

To connect MAP to the non-parametric setting, we proceed in similar lines as Elad [13] by considering

the simplest iterative approach: the steepest descent (SD) method. The SD iteration for MAP, with fixed

step size µ is

MAP: zk = zk−1 − µ [(zk−1 − y) + λ∇R(zk−1)] . (53)

A. Bayesian Interpretation of Diffusion

To describe a Bayesian interpretation for the diffusion process, we can compare (27) to the steepest

descent MAP iterations in (53). More specifically, we equate the right-hand sides of

zk+1 = zk − µ [(zk − y) + λ∇R(zk)] (54)

zk+1 = zk + (W− I) zk (55)

28

Page 29: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

which gives

−µ [(zk − y) + λ∇R(zk)] = (W − I) zk.

Solving for ∇R(zk) we obtain

∇R(zk) =1

µλ(W − (1− µ)I) (y − zk)−

1

µλ(I−W)y (56)

When W is symmetric12, integrating both sides we have (to within an additive constant:)

Rdiff (zk) =1

2µλ(y− zk)

T ((1− µ)I−W) (y− zk) +1

µλyT (I−W) zk (57)

We observe therefore that the (implicit) regularization term is a sum of two terms, the first being a

quadratic function of the residuals, and the second a sort of correlation between the data y and the k-th

iterate. As such, it is not a standard Bayesian prior – it is not simply a functional modeling the underlying

image; instead, it depends both on the data, and also evolves as a function of k.

The Bayesian interpretation of the residual-based methods is, somewhat surprisingly, more straightfor-

ward but still non-standard, as we illustrate below.

B. Bayesian Interpretation of the Residual Iterations

The residual-based iterations (43), can have a Bayesian interpretation too. We arrive at this in a manner

similar to what we did above with diffusion. In particular, comparing (43) to the steepest descent MAP

iterations in (53) we have

zk+1 = zk − µ [(zk − y) + λ∇R(zk)] (58)

zk+1 = zk + W(y − zk) (59)

Again, equating the right-hand sides we get:

−µ [(zk − y) + λ∇R(zk)] = W(y − zk).

Solving for ∇R(zk) we obtain

∇R(zk) =1

µλ(W − µI) (y − zk) , (60)

Again, with W symmetric, we have (to within an additive constant:)

Rresid(zk) =1

2µλ(y − zk)

T (W − µI) (y − zk) (61)

12Since the left-hand side of (56) is the gradient of a scalar-valued function, the right-hand side must be curl-free. This is

possible if and only if W is symmetric.

29

Page 30: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Therefore the (implicit) regularization term is an adaptive, quadratic function of the residuals. As such,

it is not a standard Bayesian prior.

VII. CONCLUDING REMARKS

It has been said that in both literature and in film, there are only seven basic plots (comedy, tragedy,

etc.) – that all stories are combinations or variations on these basic themes. Perhaps it is taking the

analogy too far to say that an exact parallel exists in our field as well. But it is fair to argue that the basic

tools of our trade in recent years have revolved around a small number key concepts as well, which I

tried to highlight in this paper.

• The most successful modern approaches in image and video processing are non-parametric. We

have drifted away from model-based methods which have dominated signal processing for decades.

• In one-dimensional signal processing, there is a long history of design and analysis of adaptive filter-

ing algorithms. A corresponding line of thought has only recently become the dominant paradigm in

processing higher-dimensional data. Indeed, adaptivity to data is a central theme of all the algorithms

and techniques discussed here, which represent a snapshot of the state of the art.

• The traditional approach that every student of our field learns in school is to carefully design a

filter, apply it to the given set of data, and call it a day. In contrast, many if not most, of the recent

approaches have instead involved repeated applications of a filter or sequence of filters to a data

set and combining the results. Such ideas have been around for some time in statistics, machine

learning, and elsewhere, but we have just begun to make careful use of sophisticated iteration and

bootstrapping mechanisms.

As I highlighted here, deep connections between techniques used commonly in computer vision, image

processing, and machine learning exist. The practitioners of these fields have been using each other’s

techniques either implicitly or explicitly for a while. The pace of this convergence has quickened, and

this is not a coincidence. It has come about through many years of scientific iteration in what my late

friend and colleague Gene Golub called the ”serendipity of science” – an aptitude we have all developed

for making desirable discoveries by happy accident.

VIII. ACKNOWLEDGEMENTS

I would like to thank all my students in the MDSP research group at UC Santa Cruz for their feedback

and insights on earlier versions of this manuscript. In particular, I acknowledge the diligent help of Xiang

30

Page 31: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Zhu with many of the simulations contained here. I also thank my friend and colleague Prof. Michael

Elad of the Technion for his feedback and discussions.

APPENDIX A

GENERALIZATION OF THE NON-PARAMETRIC FRAMEWORK TO ARBITRARY BASES

The non-parametric approach in (3) can be further extended to include a more general model of the

signal z(x) in some appropriate basis. Namely, expanding the regression function z(x) in a desired basis,

we can formulate the following optimization problem:

z(xj) = arg minβl(xj)

n∑

i=1

[yi −

N∑

l=0

βl(xj)φl(xi, xj)

]2

K(yi, y, xi, xj), (62)

where N is the model (or regression) order. For instance, with the basis set φl(xi, xj) = (xi − xj)l, we

have the Taylor series expansion, leading to the local polynomial approaches of classical kernel regression

theory [7], [40], [3]. Alternatively, the basis vectors could be learned from the given image using a method

such as K-SVD [42]. In matrix notation we have

β(xj) = arg minβ(xj)

[y −Φjβ(xj)]T

Kj [y −Φjβ(xj)] , (63)

where

Φj =

φ0(x1, xj) φ1(x1, xj) · · · φN (x1, xj)

φ0(x2, xj) φ1(x2, xj) · · · φN (x2, xj)...

......

φ0(xn, xj) φ1(xn, xj) · · · φN (xn, xj)

(64)

is a matrix containing the basis vectors in its columns; and where β(xj) = [β0, β1, · · · , βN ]T (xj) is the

coefficient vector of the local signal representation in this basis. Meanwhile, Kj is the diagonal matrix

of weights as defined earlier. In order to maintain the ability to represent the ”DC” pixel values, we can

insist that the matrix Φj contain the vector 1n = [1, 1, · · · , 1]T as one of its columns. Without loss of

generality, we assume this to be the first column so that by definition φ0(xi, xj) = 1 for all i, and j.

Denoting the j-th row of Φj as φTj = [φ0(xj , xj), φ1(xj , xj), · · · , φN (xj, xj)] we have the close-form

solution

z(xj) = φTj β(xj) (65)

= φTj

(ΦT

j KjΦj

)−1ΦT

j Kj︸ ︷︷ ︸wT

j

y = wTj y (66)

31

Page 32: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

This again, is a weighted combination of the pixels, though a rather more complicated one than the

earlier formulation. Interestingly, the filter vector wj still has elements that sum to 1. This can be seen

as follows. We have

wTj Φj = φT

j

(ΦT

j KjΦj

)−1ΦT

j KjΦj = φTj .

Writing this explicitly, we observe

wTj

1 φ1(x1, xj) · · · φN (x1, xj)

1 φ1(x2, xj) · · · φN (x2, xj)...

......

1 φ1(xn, xj) · · · φN (xn, xj)

= [1, φ1(xj , xj), · · · , φN (xj , xj)]

Considering the inner product of wTj with the first column of Φj , we have wT

j 1n = 1 as claimed.

The use of a general basis as described above carries one disadvantage, however. Namely, since the

coefficient vectors wj are influenced now not only by the choice of the kernel, but also by the choice of

the basis, the elements of wj can no longer be guaranteed to be positive.

APPENDIX B

SYMMETRIC APPROXIMATION OF W AND ITS PROPERTIES

We approximate the matrix W = D−1K with a doubly-stochastic (symmetric) positive definite matrix.

The algorithm we use to effect this approximation is due to Sinkhorn [25], [26]. He proved that given

a matrix with strictly positive elements (such as W), there exist diagonal matrices R = diag(r) and

C = diag(c) such that

W = RW C

is doubly stochastic. That is,

W1n = 1n and 1TnW = 1T

n (67)

Furthermore, the vectors r and c are unique to within a scalar (i.e. α r, c/α.) Sinkhorn’s algorithm for

obtaining r and c in effect involves repeated normalization of the rows and columns (See Algorithm 1

for details) so that they sum to one, and is provably convergent and optimal in the cross-entropy sense

[27]. To see that the resulting matrix W is symmetric positive definite, we note a corollary of Sinkhorn’s

result: When a symmetric matrix is considered, the unique diagonal scalings are in fact identical [25],

[26]. Applied to K, we see that the symmetric diagonal scaling K = ∆K∆ yields a symmetric doubly-

stochastic matrix. But we have W = D−1K, so

W = RD−1KC

32

Page 33: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

Since the W and the diagonal scalings are unique (to within a scalar), we must have RD−1 = C = ∆,

and therefore W = K. That is to say, applying Sinkhorn’s algorithm to either K or its row-sum normalized

version W yields the very same (symmetric positive definite doubly stochastic) result, W.

Algorithm 1 Algorithm for scaling a matrix A to a nearby doubly-stochastic matrix A

Given a matrix A, let (n, n) = size(A) and initialize r = ones(n, 1);

for k = 1 : iter;

c = 1./(AT r);

r = 1./(Ac);

end

C = diag(c); R = diag(r);

A = RAC

The convergence of this iterative algorithm is known to be linear [26], with rate given by the subdominant

eigenvalue λ2 of W. It is of interest to know how much the diagonal scalings will perturb the eigenvalues

of W. This will indicate how accurate our approximate analysis in Section V will be, which uses the

eigenvalues of W instead of W. Fortunately, Kahan [43], [22] provides a nice result: Suppose that A is

a non-Hermitian matrix with eigenvalues |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Let A be a perturbation of A that is

Hermitian with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn, then

n∑

i=1

|λi − λi|2 ≤ 2 ‖A− A‖2F

Indeed, if A and A are both positive definite (as is the case for both K and W) the scalar factor 2 on the

right-hand side of the inequality may be discarded. Specializing the result for W, defining ∆ = W−W

and normalizing by n, we have

1

n

n∑

i=1

|λi − λi|2 ≤1

n‖∆‖2F

The right-hand side can be further bounded as described in [44].

Hence for patches of reasonable size (say n ≥ 11), the bound on the right-hand side, and the eigenvalues

of the respective matrices are quite close, as demonstrated in Figure 17. Bounds may be established on

the distance between the eigenvectors of W and W as well [44]. Of course, since both matrices are row-

stochastic, they share their first right eigenvector v1 = v1, corresponding to a unit eigenvalue. The second

eigenvectors are arguably more important, as they tell us about the dominant structure of the region being

33

Page 34: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

0 100 200 300 4000

0.2

0.4

0.6

0.8

1

Eig

en

valu

e

Eigenvalue index

Symmetric filter

Asymmetric filter

0 100 200 300 400−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eig

envalu

e

Eigenvalue index

Symmetric filter

Asymmetric filter

(a) (b)

Fig. 17. Eigenvalues of row-stochastic (asymmetric) filter matrices and the corresponding doubly-stochastic (symmetric) filter

matrices symmetrized using Sinkhorn’s algorithm: (a) a NLM filter example; (b) a LARK filter example.

filtered, and the corresponding effect of the filters. Specifically, as indicated in [28], applying Theorem

5.2.8 from [22] gives a bound on the perturbation of the second eigenvectors:

‖v2 − v2‖ ≤‖4∆‖F

ν −√

2‖∆‖FIt is worth noting that the bound essentially does not depend on the dimension n of the data y. This is

encouraging, as it means that by approximating W with W, we get a uniformly bounded perturbation

to the second eigenvector regardless of the filter window size. That is to say, the approximation is quite

stable.

As a footnote, we mention the interesting fact that, when applied to a symmetric squared distance

matrix, Sinkhorn’s algorithm has a rather nice geometric interpretation [45], [46]. Namely, if P is the set

of points that generated the symmetric distance matrix A, then elements of its Sinkhorn scaled version

B = ∆A∆ correspond to the square distances between the corresponding points in a set Q, where the

points in Q are obtained by a stereographic projection of the points in P . The points in Q are confined

to a hyper-sphere of dimension d embedded in Rd+1, where d is the dimension of the subspace spanned

by the points in P .

APPENDIX C

STABILITY OF FILTER COEFFICIENTS TO PERTURBATIONS DUE TO NOISE

The statistical analysis of the non-parametric filters W is quite complicated when W is assumed

stochastic. Fortunately, however, the stability results in [28] allow us to approximate the resulting filters

34

Page 35: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

as approximately deterministic. More specifically, we assume that the noise e corrupting the data has

i.i.d. samples from a symmetric (but not necessarily Gaussian) distribution, with zero-mean and finite

variance σ2. The results in [28] imply that if the noise variance σ2 is small relative to the clean data z

(i.e., when signal-to-noise ratio is high), the weight matrix W = W(y) computed from the noisy data

y = z + e is near the latent weight matrix W. That is, with the number of samples n sufficiently large

[28],

‖W −W‖2F ≤p c1σ(2)e + c2σ

(4)e , (68)

for some constants c1 and c2, where σ(2)e and σ

(4)e denote the second and fourth order moments of

‖e‖, respectively, and “≤p” indicates that the inequality holds in probability. When the noise variance is

sufficiently small, the fourth order moment is even smaller; hence, the change in the resulting coefficient

matrix is bounded by a constant multiple of the small noise variance.

Approximations to the moments of the perturbation dW = W −W are also given in [28], [47]:

E [dW] ≈ E [dD]D−2K−D−1E [dK] ,

E[dW2

]≈ E

[dD2

]D−4K2 + D−2E

[dK2

]− E [dD dK] D−3 ◦K.

where dD = D −D, and dK = K −K, and ◦ denotes element-wise product. A thorough description

of the perturbation analysis for the class of Gaussian data-adaptive weights is given in [28], [47]. To

summarize, for a sufficiently small noise variance, the matrix of weights can be treated, with confidence,

as a deterministic quantity which depends only on the content of the underlying latent image z.

REFERENCES

[1] A. Buades, B. Coll, and J. M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale Modeling

and Simulation (SIAM interdisciplinary journal), vol. 4, no. 2, pp. 490–530, 2005.

[2] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” Proceeding of the 1998 IEEE International

Conference of Compute Vision, Bombay, India, pp. 836–846, January 1998.

[3] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing and reconstruction,” IEEE Transactions on

Image Processing, vol. 16, no. 2, pp. 349–366, February 2007.

[4] K. Dabov, A. Foi, V. Katkovnic, and K. Egiazarian, “Image denoising by sparse 3D transform-domain collaborative

filtering,” IEEE Transactions of Image Processing, vol. 16, no. 8, pp. 2080–2095, August 2007.

[5] S. M. Smith and J. M. Brady, “Susan – a new approach to low level image processing,” International Journal of Computer

Vision, vol. 23, no. 1, pp. 45–78, 1997.

[6] L. Yaroslavsky, Digital Picture Processing. Springer Verlag, 1987.

[7] M. P. Wand and M. C. Jones, Kernel Smoothing, ser. Monographs on Statistics and Applied Probability. London; New

York: Chapman and Hall, 1995.

35

Page 36: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

[8] W. Hardle, Applied Nonparametric Regression. Cambridge [England] ; New York: Cambride University Press, 1990.

[9] P. Lancaster and K. Salkauskasr, “Surfaces generated by moving least squares methods,” Mathematics of Computation,

vol. 87, pp. 141–158, 1981.

[10] D. Levin, “The approximation power of moving least-squares,” Mathematics of Computation, vol. 67, no. 224, pp. 1517–

1531, 1998.

[11] M. Alexa, J. Behr, D. Cohen-or, S. Fleishman, and D. Levin, “Computing and rendering point set surfaces,” IEEE

Transactions on Visualization and Computer Graphics, vol. 9, no. 1, pp. 3–15, 2003.

[12] R. Gonzalez and R. Woods, Digital Image Processing, Third Edition. Prentice Hall, 2008.

[13] M. Elad, “On the origin of the bilateral filter and ways to improve it,” IEEE Transactions on Image Processing, vol. 11,

no. 10, pp. 1141–1150, October 2002.

[14] C. Kervrann and J. Boulanger, “Optimal spatial adaptation for patch-based image denoising,” Image Processing, IEEE

Transactions on, vol. 15, no. 10, pp. 2866–2878, Oct. 2006.

[15] S. P. Awate and R. T. Whitaker, “Unsupervised, information-theoretic, adaptive image filtering for image restoration,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 364–376, March 2006.

[16] H. Knutsson and C.-F. Westin, “Normalized and differential convolution: Methods for interpolation and filtering of

incomplete and uncertain data,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, pp. 515–523, June 16-19 1993.

[17] A. Spira, R. Kimmel, and N. Sochen, “A short time beltrami kernel for smoothing images and manifolds,” IEEE Trans.

Image Processing, vol. 16, no. 6, pp. 1628–1636, 2007.

[18] T. Hofmann, B. Scholkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36,

no. 3, p. 11711220, 2008.

[19] A. Singer, Y. Shkolinsky, and B. Nadler, “Diffusion interpretation of nonlocal neighborhood filters for signal denoising,”

SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 118–139, 2009.

[20] A. Szlam, M. Maggioni, and R. Coifman, “Regularization on graphs with function-adapted processes,” Journal of Machine

Learning Research, 2008, volume = 9, pages = 1711-1739,.

[21] H. J. Landau and A. M. Odlyzko, “Bounds for eigenvalues of certain stochastic matrices,” Linear Algebra and Its

Applications, vol. 38, pp. 5–15, June 1981.

[22] W. J. Stewart, Introduction to the Numerical Soultion of Markov Chains. Princeton University Press, 1994.

[23] E. Seneta, Non-negative Matrices and Markov Chains, ser. Springer Series in Statistics. Springer, 1981.

[24] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1991.

[25] P. A. Knight, “The Sinkhorn-Knopp algorithm: Convergence and applications,” SIAM J. Matrix Anal. Appl., vol. 30, no. 1,

pp. 261–275, 2008.

[26] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The Annals of

Mathematical Statistics, vol. 35, no. 2, pp. 876–879, 1964, 1964.

[27] J. Darroch and D. Ratcliff, “Generalized iterative scaling for log-linear models,” Annals of Mathematical Statistics, vol. 43,

pp. 1470–1480, 1972.

[28] L. Huang, D. Yan, M. I. Jordan, and N. Taft, “Spectral clustering with perturbed data,” Neural Information Processing

Systems (NIPS), 2008.

[29] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative

filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, August 2007.

36

Page 37: 1 A Tour of Modern Image Processingvision/courses/2010_2/... · This new generation of algorithms exploit both local and non-local redundancies or ”self-similarities” in the signals

[30] P. Chatterjee and P. Milanfar, “Is denoising dead?” IEEE Trans. on Image Proc., vol. 19, no. 4, pp. 895–911, April 2010.

[31] P. Perona and J. Malik, “Scale-space and edge detection using anistropic diffusion,” IEEE Transactions in Pattern Analysis

and Machine Intelligence, vol. 12, no. 9, pp. 629–639, July 1990.

[32] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularization method for total variation-based image

restoration,” Multiscale Modeling and Simulation, vol. 4, no. 2, pp. 460–489, 2005.

[33] P. Buhlmann and B. Yu, “Boosting with the l2 loss: Regression and classification,” Journal of the American Statistical

Association, vol. 98, no. 462, pp. 324–339, 2003.

[34] J. W. Tukey, Exploratory Data Analysis. Addison Wesley, 1977.

[35] D. P. Palomar and S. Verdu, “Gradient of mutual information in linear vector gaussian channels,” IEEE Trans. on Information

Theory, vol. 52, no. 1, pp. 141–154, 2006.

[36] M. Charest and P. Milanfar, “On iterative regularization and its application,” IEEE Trans. on Circuits and Systems for

Video Technology, vol. 18, no. 3, pp. 406–411, March 2008.

[37] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing,

vol. 41, no. 12, pp. 3397 – 3415, Dec. 1993.

[38] F. Luisier, T. Blu, and M. Unser, “A new SURE approach to image denoising: Interscale orthonormal wavelet thresholding,”

IEEE Transactions on Image Processing, vol. 16, no. 3, pp. 593–606, March 2007.

[39] Z. Wang and A. Bovik, “Mean squared error: Love it or leave it? a new look at signal fidelity measures,” IEEE Signal

Processing Magazine, vol. 26, no. 1, pp. 98–117, Jan 2009.

[40] L. Wasserman, All of Nonparametric Statistics. Springer, 2005.

[41] A. Bruckstein, “On globally optimal local modeling: From moving least square to overparametrization.” SIAM, 2010,

Conference on Imaging Science, Chicago, IL.

[42] M. E. M. Aharon and A. Bruckstein, “The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse

representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006.

[43] W. Kahan, “Spectra of nearly hermitian matrices,” Proceedings of the American Mathematical Society, vol. 48, pp. 11–17,

1975.

[44] P. Milanfar, “Asymptotic nearness of stochastic and doubly-stochastic matrices,” submitted.

[45] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, “Accurate image search using the contextual dissimilarity measure,”

IEEE Trans. on Pattern Analysis and Machine Intell., vol. 32, no. 1, pp. 2–11, January 2010.

[46] C. Johnson, R. Masson, and M. Trosset, “On the diagonal scaling of euclidean distance matrices to doubly stochastic

matrices,” Linear Algebra and its Applications, vol. 397, no. 1, pp. 253–264, 2005.

[47] L. H. D. Yan and M. I. Jordan, “Fast approximate spectral clustering,” 15th ACM Conference on Knowledge Discovery

and Data Mining (SIGKDD), 2009.

37