Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
A Tour of Modern Image Processing
Peyman Milanfar∗
Abstract
Recent developments in computational imaging and restoration have heralded the arrival and con-
vergence of several powerful methods for adaptive processing of multidimensional data. Examples
include Moving Least Square (from Graphics), the Bilateral Filter and Anisotropic Diffusion (from
Machine Vision), Boosting and Spectral Methods (from Machine Learning), Non-local Means (from
Signal Processing), Bregman Iterations (from Applied Math), Kernel Regression and Iterative Scaling
(from Statistics). While these approaches found their inspirations in diverse fields of nascence, they are
deeply connected. In this paper
• I present a practical and unified framework to understand some of the basic underpinnings of these
methods, with the intention of leading the reader to a broad understanding of how they interrelate.
I also illustrate connections between these techniques and Bayesian approaches.
• The proposed framework is used to arrive at new insights, methods, and both practical and theoretical
results. In particular, several novel optimality properties of algorithms in wide use such as BM3D,
and methods for their iterative improvement (or non-existence thereof) are discussed.
• Several theoretical results are discussed which will enable the performance analysis and subsequent
improvement of any existing restoration algorithm.
While much of the material discussed is applicable to wider class of linear degradation models (e.g.
noise, blur, etc.,) in order to keep matters focused, we consider the problem of denoising here.
∗Electrical Engineering Department, University of California, Santa Cruz CA, 95064 USA.
e-mail: [email protected], Phone:(650) 322-2311, Fax: (410) 322-2312
This work was supported in part by the US Air Force Grant FA9550-07-1-0365 and the National Science Foundation Grant
CCF-1016018
February 18, 2011 DRAFT
I. INTRODUCTION
Consider the measurement model
yi = zi + ei, for i = 1, · · · , n, (1)
where zi = z(xi) is the underlying latent signal of interest at a position xi = [xi,1, xi,2]T
, yi is the
noisy measured signal (pixel value), and ei is the corrupting zero-mean, i.i.d. white noise with variance
σ2. The problem of interest then is to recover the complete set of samples of z(x), which we denote
vectorially as z = [z(x1), z(x2), · · · , z(xn)]T from the corresponding data set y. To restate the problem
more concisely, the complete measurement model in vector notation is given by 1
y = z + e. (2)
It has been realized for some time now that effective restoration of signals will require methods which
either model the signal a priori (i.e. Bayesian) or learn the underlying characteristics of the signal from the
given data (i.e. learning, or non-parametric methods.) Most recently, the latter category of approaches has
become exceedingly popular. Perhaps the most striking recent example is the popularity of patch-based
methods [1], [2], [3], [4]. This new generation of algorithms exploit both local and non-local redundancies
or ”self-similarities” in the signals being treated. Earlier on, the bilateral filter [2] was developed with
very much the same idea in mind, as were its spiritually close predecessors the Susan Filter [5] and the
filters of Yaroslavsky [6]. The common philosophy among these and related techniques is the notion of
measuring and making use of affinities between a given data point (or more generally patch or region)
of interest, and others in the given measured signal y. These similarities are then used in a filtering
context to give higher weights to contributions from more similar data values, and to properly discount
data points that are less similar.
Despite the large amount of recent literature on techniques based on these ideas, simply put, the
key differences between the methods have been relatively minor, yet somewhat poorly understood.
In particular, the underlying framework for each of these methods is distinct only to the extent that
the weights assigned to different data points is decided upon differently. To be more concrete and
mathematically precise, let us consider the denoising problem (2) again. The estimate of the signal z(x)
1Surely a similar analysis to what follows can and should be carried out for more general inverse problems such as
deconvolution, interpolation, etc. where z would be generally replaced by Az, but these will require more detailed analysis and
will be left for future work.
2
at the position x is found using a (non-parametric) point estimation framework; namely, the weighted
least squares problem
z(xj) = arg minz(xj)
n∑
i=1
[yi − z(xj)]2 K(xi, xj , yi, yj) (3)
where the weight (or kernel) function K(·) is a positive, unimodal function which measures the ”similar-
ity” between the samples yi and yj , at respective positions xi and xj . If the kernel function is restricted
to be only a function of the spatial locations xi and xj , then the resulting formulation is what is known
as kernel regression in the classical non-parametric statistics literature [7], [8].
Interestingly, in the early 1980’s the essentially identical concept of moving least-squares emerged
independently [9], [10]. This idea has since been widely adopted in the computer graphics community
[11] as a very effective tool for smoothing and interpolating data in 3-D. Surprisingly, despite the obvious
connections between moving least-squares and the adaptive filters based on similarity, their kinship has
remained largely hidden so far.
II. EXISTING ALGORITHMS
Over the years, the measure of similarity K(xi, xj , yi, yj) has been defined in a number of different
ways, leading to a cacophony of filters, including some of the most well-known recent approaches to
image denoising [2], [3], [1]. Figure 1 gives a graphical illustration of how different choices of similarity
kernels leads to different classes of filters, some of which we discuss next.
A. Classical Regression Filters [7], [8]:
Naturally, the most naive way to measure the ”distance” between two pixels is to simply consider their
spatial Euclidean distance; namely, using a Gaussian kernel,
K(xi, xj , yi, yj) = exp
(−‖xi − xj‖2h2
x
).
Such filters essentially lead to (possibly space-varying) Gaussian filters which are quite familiar from
traditional image processing [12]. While it is possible to adapt the variance parameter hx to the local
image statistics to obtain a relatively modest improvement in performance, the lack of stronger adaptivity
to the underlying structure of the signal of interest is a major drawback of these classical approaches.
3
B. The Bilateral Filter (BL) [2]:
Another manifestation of the formulation in (3) is the bilateral filter where the spatial and photometric
distances between two pixels are taken into account in separable fashion as follows:
K(xi, xj , yi, yj) = exp
(−‖xi − xj‖2h2
x
)exp
(−(yi − yj)2
h2y
)= exp
{−‖xi − xj‖2h2
x
+−(yi − yj)
2
h2y
}
(4)
As can be observed in the exponent on the right-hand side, and in Figure 1, the similarity metric here
is a weighted Euclidean distance between the vectors (xi, yi) and (xj , yj). This approach has several
advantages. Namely, while the kernel is easy to construct, and computationally simple to calculate, it
yields useful local adaptivity to the given data. In addition, it has only two control parameters (hx, hy),
which make it very convenient to use. Unfortunately, as is well-known, this filter does not provide effective
performance in low signal-to-noise scenarios [13].
C. Non-Local Means (NLM) [1], [14], [15]:
The non-local means algorithm has stirred a great deal of interest in the community in recent years.
At its core, however, it is a generalization of the bilateral filter in the sense that the photometric term in
the bilateral similarity kernel, which is measured point-wise, is simply replaced with one that is patch-
wise instead. A second difference is that (at least in theory) the geometric distance between the patches
(corresponding to the first term in the bilateral similarity kernel), is essentially ignored, leading to strong
contribution from patches that may not be physically near the pixel of interest (hence the name non-local).
To summarize, the NLM kernel is
K(xi, xj , yi, yj) = exp
(−‖xi − xj‖2h2
x
)exp
(−‖yi − yj‖2h2
y
)with hx −→∞, (5)
where yi and yj refer now to patches of pixels centered at pixels yi and yj , respectively. In practice,
two implementation details should be observed. First, the patch-wise photometric distance ‖yi−yj‖2 in
the above is in fact measured as (yi − yj)T G (yi − yj) where G is a fixed diagonal matrix containing
Gaussian weights which give higher importance to the center of the respective patches. Second, it is
rather impractical to consider all the patches yi within an image as possibly close to yj , so typically the
search is limited to a reasonable spatial neighborhood of yj , which means that in effect the NLM filter
too works more or less locally; said another way, hx is never infinite in practice.
Despite its popularity, the performance of the NLM filter leaves much to be desired. The true potential of
this filtering scheme was demonstrated only later with the Optimal Spatial Adaptation (OSA) approach of
Boulanger and Kervrann [14]. In their approach, the photometric distance was refined to include estimates
4
of the local noise variances within each patch. Namely, they computed a local diagonal covariance matrix,
and defined the locally adaptive photometric distance as (yi − yj)TV−1
j (yi − yj) in such a way as to
minimize an estimate of the local mean-squared error. Furthermore, they considered iterative applications
of the filter as discussed in Section V.
Bilateral filter
The Euclidean distance
The photometric distance
Non-local means
The geodesic distance
LARK
The spatial distance
Classic kernel regression
Fig. 1. Similarity Metrics and the Resulting Filters
D. Locally Adaptive Regression (Steering) Kernels (LARK) [3]:
The key idea behind this measure of similarity, originally proposed in [3] is to robustly obtain the local
structure of images by analyzing the photometric (pixel value) differences based on estimated gradients,
and to use this structure information to adapt the shape and size of a canonical kernel. The LARK kernel
is defined as follows:
K(xi, xj , yi, yj) = exp{−(xi − xj)
TCi(xi − xj)}, (6)
where the matrix Ci = C(yi, yj) is estimated from the given data as
Ci =∑
j
z2xi,1
(xj) zxi,1(xj)zxi,2
(xj)
zxi,1(xj)zxi,2
(xj) z2xi,2
(xj)
.
5
Specifically, zxi,∗(xj) are the estimated gradients of the underlying signal at point xi, computed from the
given measurements yj in a patch around the point of interest. The reader may recognize the above as
the well-studied “structure tensor” [16]. The advantage of the LARK descriptor is that it is exceedingly
robust to noise and perturbations of the data. The formulation is also theoretically well-motivated since
the quadratic exponent in (6) essentially encodes the local geodesic distance between the points (xi, yi)
and (xj, yj) on the graph of the function z(x, y) thought of as a manifold embedded in 3-D. The geodesic
distance was also used in the context of the Beltrami-Flow kernel in [17] in an analogous fashion.
III. GENERALIZATIONS AND PROPERTIES
The above discussion can be naturally generalized by defining the augmented data variable t =[xT , yT
]Tand a general Gaussian kernel as follows:
K(ti, tj) = exp{−(ti − tj)
T Q(ti − tj)}, (7)
Qi,j =
Qx 0
0 Qy
(8)
where Q is symmetric positive definite (SPD).
With Qx = 1h2
x
I and Qy = 0 we have classical kernel regression. Whereas one obtains the bilateral
filter framework when Qx = 1h2
x
I and Qy = 1h2
y
diag[0, 0, · · · , 1, · · · , 0, 0] with the latter diagonal matrix
picking out the center pixel in the patch defined by ti − tj . When Qx = 0 and Qy = 1h2
y
G, we have
the nonlocal means filter and its variants. Finally, the LARK kernel in (6) is obtained when Qx = Ci
and Qy = 0. More generally, the matrix Q can be selected so that it has non-zero off-diagonal block
elements. However, no practical algorithms with this choice have been proposed so far. As detailed below,
with a symmetric positive definite Q, this general approach results in valid symmetric positive definite
kernels, a property that is used throughout the rest of our discussion2.
Next, we present a more general framework for selecting the similarity functions, which will help us
identify the wider class of admissible kernels. This would also allow us to define similarity kernels more
formally, and to understand how to produce new kernels from ones already defined [18]. Formally, a
scalar-valued function K(t, s) over a compact region of its domain Rn, is called an admissible kernel if
• K is symmetric: K(t, s) = K(s, t)
2The definition of t given here is only one of many possible choices. Our treatment in this paper is equally valid when, for
instance, t = T (x,y) is any feature derived from a convenient linear or nonlinear transformation of the original data [12].
6
• K is positive definite. That is, for any collection of points ti, i = 1, · · · , n, the Gram matrix
Ki,j = K(ti, tj) is positive definite.
Such kernels satisfy some useful properties such as positivity, K(t, t) ≥ 0, and the Cauchy-Schwartz
inequality K(t, s) ≤√K(t, t)K(s, s).
Another way to characterize SPD kernels is using Mercer’s Theorem ([18]). Namely, suppose that
K(t, s) is a SPD kernel, and consider the integral operator
K f =
∫K(t, s) f(s) ds. (9)
This operator should be understood as simply a filter applied to the signal f (without normalization), or
equivalently in the discrete case as the numerator of the first term on the right-hand side of (12) below.
If for any square-integrable function f(t), we have that∫K(t, s) f(t) f(s) dt ds ≥ 0,
then Mercer’s theorem [18] states that there is a spectral representation (i.e. eigen-decomposition)
K(t, s) =
∞∑
i=0
µi ψi(t)ψi(s), (10)
where {µi} are the (non-negative) eigenvalues corresponding to an orthonormal basis {ψi} of L2, which
consists of the eigenfunctions of the operator Kf . In essence,K(t, s) is a continuous functional equivalent
of a symmetric positive-definite matrix.
With the above definitions in place, there are numerous ways to construct new valid kernels from
existing ones. We list some of the most useful ones below, without proof [18]. Given two valid kernels
K1(t, s) and K2(t, s), the following constructions yield admissible kernels:
1) K(t, s) = αK1(t, s) + βK2(t, s) for any pair α, β ≥ 0
2) K(t, s) = K1(t, s)K2(t, s)
3) K(t, s) = k(t) k(s), where k(·) is a scalar-valued function.
4) K(t, s) = p (K1(t, s)), where p(·) is a polynomial with positive coefficients.
5) K(t, s) = exp(K1(t, s))
Furthermore, t may be considered to be any feature vector t = T (x,y) constructed from the data (e.g.
t =[xT , yT
]Tbeing a rather trivial example).
Whatever the choice of the kernel function, the weighted least-square optimization problem (3) has a
simple solution. In matrix notation we can write
z(xj) = arg minz(xj)
[y − z(xj)1n]T Kj [y − z(xj)1n] (11)
7
where 1n = [1, · · · , 1]T , and Kj = diag [K(x1, xj , y1, yj), K(x2, xj , y2, yj), · · · , K(xn, xj , yn, yj)].
The closed form solution to the above is
z(xj) =(1T
nKj1n
)−11T
nKjy (12)
=
(∑
i
K(xi, xj , yi, yj)
)−1(∑
i
K(xi, xj , yi, yj) yi
)
=∑
i
K(xi, xj , yi, yj)∑iK(xi, xj , yi, yj)
yi
=∑
i
W (xi, xj , yi, yj) yi
= wTj y. (13)
So in general, the estimate z(xj) of the signal at position xj is given by a weighted sum of all the given
data points y(xi), each contributing a weight commensurate with its similarity as indicated by K(·),with the measurement y(xj) at the position of interest. Furthermore, as should be apparent in (12), the
weights sum to one. To control computational complexity, or to design local versus non-local filters,
we may choose to set the weight for ”sufficiently far-away” pixels to zero or a small value, leading to
a weighted sum involving a relatively small neighborhood of data points in a properly defined vicinity
of the sample of interest. This is essentially the only distinction between locally adaptive processing
methods (such as BL, and LARK) and so-called non-local methods such as NLM. It is worth noting that
in the formulation above, despite the simple form of (12), in general we have a nonlinear estimator since
the weights W (xi, xj, yi, yj) depend on the noisy data3.
To summarize the discussion so far, we have presented a general framework which absorbs many
existing algorithms as special cases. This was done in several ways, including a general description of
the set of admissible similarity kernels, which allows the construction of a wide variety of new kernels not
considered before in the image processing literature. Next, we turn our attention to the matrix formulation
of the non-parametric filtering approach. As we shall see, this provides a framework for more in-depth
and intuitive understanding of the resulting filters, their subsequent improvement, and for their respective
asymptotic and numerical properties.
3It is worth noting that the non-parametric approach in (3) can be further extended to include a more general model of the
signal z(x) in a desired basis. We briefly discuss this case in the appendix, but leave its full treatment for future research.
8
IV. THE MARTIX FORMULATION AND ITS PROPERTIES
In this section, we analyze the filtering problems posed earlier in the language of linear algebra and
make several theoretical and practical observations. In particular, we are able to not only study the
numerical/algebraic properties of the resulting filters, but also to analyze some of their fundamental
statistical properties.
To begin, recall the convenient vector form of the filters:
z(xj) = wTj y. (14)
where wTj = [W (x1, xj , y1, yj), W (x2, xj , y2, yj), · · · , W (xn, xj , yn, yj)] is a vector of weights for
each j. Writing the above at once for all j we have
z =
wT1
wT2
...
wTn
y = W y. (15)
As such, the filters defined by the above process can be analyzed as the product of a matrix of weights
W with the vector of the given data y. First, a notational matter: W is in general a function of the data,
so strictly speaking, the notation W(y) would be more descriptive; but we will suppress this dependence
in what follows, in the interest of a simpler presentation.
Next, we highlight some important properties of the matrix W, which lend much insight to our analysis.
Referring to (12), W can be written as the product
W = D−1 K, (16)
where Kij = K(xi, xj , yi, yj) and D is a positive definite diagonal matrix with the normalizing factors
Djj = diag{∑iK(xi, xj , yi, yj)} along its diagonals. We can write
W = D−1 K = D−1/2 D−1/2 KD−1/2︸ ︷︷ ︸
L
D1/2 (17)
It is evident that since K and D are symmetric positive definite, then so is L. Meanwhile, W and L
are related through a diagonal similarity transformation, and therefore have the same eigenvalues. Hence,
interestingly, W has real, non-negative eigenvalues (even though it is not symmetric!)4 An alternative,
though not necessarily more intuitive, description of the action of W is possible in terms of graphical
4That is, W is positive definite but not symmetric.
9
models. Namely, if we consider the data (xi, yi) to be nodes on a weighted graph with weights between
nodes i and j given by the kernel values K(xi, xj , yi, yj), the matrix L = L− I is precisely the “graph
Laplacian” [19], [20]5,6.
Next, we note that W is a positive row-stochastic matrix – namely, by definition, the sum of each of
its rows is equal to one, as should be clear from (12). This implies [21], [22] that W has spectral radius
0 ≤ λ(W) ≤ 1. Indeed, the largest eigenvalue of W is simple (unrepeated) and is exactly λ1 = 1, with
corresponding eigenvector v1 = (1/√n)[1, 1, · · · , 1]T = (1/
√n)1n. Intuitively, this means that filtering
by W will leave a constant signal (i.e., a ”flat” image) unchanged. In fact, with its spectrum inside the
unit disk, W is an ergodic matrix [23], [24], which is to say that its powers converge to a matrix of rank
one, with identical rows. More explicitly, using the eigenvalue decomposition of W we can write
Wk = VSkV−1 = VSkUT =
n∑
i=1
λki viu
Ti ,
where we have defined UT = V−1, while denoting by uTi the left eigenvectors of W which are the
columns of U. Therefore,
limk→∞
Wk = 1nuT1 .
So u1 summarizes the asymptotic effect of applying the filter W many times.
As we made clear earlier, W = D−1K is generically not a symmetric matrix, though it always
has real, positive eigenvalues. As such, it should not come as a surprise that W is quite close to a
symmetric positive definite matrix. As we illustrate in the Appendix, this is indeed the case. As such,
we shall find it useful to approximate W with such a symmetric positive definite matrix. To effect this
approximation requires a bit of care, as the resulting matrix must still be row-stochastic. Fortunately,
this is not difficult to do. Sinkhorn’s algorithm [25], [26] for matrix scaling provides a provably optimal
[27] way to accomplish this task7. The obtained matrix is a close approximation of a given W which
is both symmetric positive definite, and now doubly (i.e., row- and column-) stochastic. Working with
5The name is not surprising. Consider the canonical case in which W is a low-pass filter (though possibly nonlinear and
space-varying) filter. Since W and L share the same spectrum, L is also a low-pass filter. Therefore, L = L− I is generically
a high-pass filter, justifying the Laplacian label from the point of view of filtering.
6Another interpretation of the matrix W is in terms of Markov chains. Namely W is a probability transition matrix for a
Markov chain defined on the entries of the data vector y [19], [20].
7We refer the reader to the appendix and citations therein, where we illustrate and justify both the approximation procedure,
and its fidelity in detail.
10
a symmetric (or rather ”symmetrized”) W makes possible much of the analysis that follows in the
remainder of this paper8.
From this point forward therefore, we shall consider W to be a symmetric positive definite, and
(doubly-) stochastic matrix. The spectrum of W determines the effect of the filter on the noisy signal y.
We write the eigenvalue decomposition:
W = VSVT (18)
where S = diag [λ1, · · · , λn] contains the eigenvalues in decreasing order 0 ≤ λn ≤ · · · ≤ λ1 = 1.
As an example, we illustrate the spectrum of the LARK [3] kernel in Figure 2 for different types of
patches. As can be observed, the spectrum of W decays rather quickly for ”flat” patches, indicating
that the filter is very aggressive in denoising these types of patches. As we consider increasingly more
complex anisotropic patches containing edges, corners, and various textures, the spectrum of the filter is
(automatically) adjusted to be less aggressive in particular directions. To take the specific case of a patch
containing a simple edge, the filter is adjusted in unsupervised fashion to perform strong denoising along
the edge, and to preserve and enhance the structure of the patch across the edge. This type of behavior
is typical of the adaptive non-parametric filters we have discussed so far, but it is particularly striking
and stable in the case of LARK kernels shown in Figure 2.
Of course, the matrix W is a function of the noisy data y and hence it is strictly speaking a stochastic
variable. But another interesting property of W is that when the similarity kernel K(·) is of the general
Gaussian type as in (7), as described in the Appendix, the resulting filter coefficients W are quite stable
to perturbations of the data by noise of modest size [28]. From a practical point of view, and insofar as
the computation of the matrix W is concerned, it is always reasonable to assume that the noise variance
is relatively small, because in practice we typically compute W on a “pre-processed” version of the noisy
image y anyway9. Going forward, therefore, we consider W as fixed. Some experiments in Section V
confirm the validity of this assumption.
A. Statistical Analysis of the Filters’ Performance
We are now in a position to carry out an analysis of the performance of filters defined earlier. In
particular, let us compute an approximation to the bias and variance of the estimator z = Wy first. The
8Interestingly, there is an inherent and practical advantage in using symmetrization. As we will see in the experimental results,
given any W, employing its the symmetrized version generally improves performance in the mean-squared error sense. We do
not present a proof of this observation here, and leave its analysis for a forthcoming submission.
9This ”small noise” assumption is made only for the analysis of the filter coefficients, and is not invoked in rest of the paper.
11
Fig. 2. Spectrum of the LARK filter on different types of patches
bias in the estimate is
bias = E(z)− z = E(Wy) − z ≈Wz− z = (W − I) z.
Recalling that the matrix W has a unit eigenvalue corresponding to a (constant) vector, we note that z is
an unbiased estimate if z is a constant image, but is biased for all other underlying images. The squared
magnitude of the bias is given by
‖bias‖2 = ‖(W − I)z‖2. (19)
Writing the image z in the column space of the orthogonal matrixV as z = Vb0, we can rewrite the
squared bias magnitude as
‖bias‖2 = ‖(W − I)z‖2 = ‖V(S− I)b0‖2 = ‖(S− I)b0‖2 =
n∑
i=1
(λi − 1)2b20i. (20)
We note that the last sum can be expressed over the indices i = 2, · · · , n because the first term in fact
vanishes because λ1 = 1. Next, we consider the variance of the estimate. We have
cov(z) = cov(Wy) ≈ cov(We) = σ2WWT =⇒ var(z) = tr(cov(z)) = σ2n∑
i=1
λ2i .
12
Overall, the mean-squared-error (MSE) is given by
MSE = ‖bias‖2 + var(z) =
n∑
i=1
(λi − 1)2b20i + σ2λ2i . (21)
a) BM3D [29] Explained : Let us pause for a moment and imagine that we can optimally ”design”
the matrix W from scratch in such a way as to minimize the mean-squared error above. That is, consider
the set of eigenvalues for which (21) is minimized. Differentiating the expression for MSE and setting
equal to zero leads to a familiar expression:
λ∗i =b20i
b20i + σ2=
1
1 + snr−1i
. (22)
This is of course nothing but a manifestation of the Wiener filter with snri = b20i
σ2 denoting the signal-to-
noise ratio at each component i of the signal. This observation naturally raises the question of whether
any existing filters in fact get close to this optimal performance [30]. An example filter which is designed
to approximately achieve this goal is BM3D [29], which is considered at present to be the state of the
art in denoising. The BM3D algorithm can be briefly summarized by the following three steps:
1) Patches from an input image are classified according to their similarity, and are grouped into 3-D
clusters accordingly.
2) A so-called ”3D collaborative Wiener filtering” is implemented to process each cluster.
3) Filtered patches inside all the clusters are aggregated to form an output.
The core process of the BM3D algorithm is the collaborative Wiener filtering that transforms a patch
cluster through a fixed orthonormal basis V (namely DCT) [29], where the coefficients of each component
are then scaled by some ”shrinkage” factor {λi}, which are precisely chosen according to (22). Following
this shrinkage step, the data are transformed back to the spatial domain to generate denoised patches for
the aggregation process. In practice, of course, one does not have access to {b20i}, so they are roughly
estimated from the given noisy image y, as is done also in [29]. Though BM3D is not directly based upon
a regression framework as other patch-based methods (such as NLM, LARK) are, the diagonal Wiener
shrinkage matrix S = diag[λ1, · · · , λn], and the overall 3-D collaborative Wiener filtering process can now
be understood using a symmetric, positive-definite filter matrix W with orthonormal eigen-decomposition
VSVT . If the orthonormal basis V contains a constant vector vi′ = 1√n1n, one can easily make W
doubly-stochastic by setting its corresponding shrinkage factor λi′ = 1.
To summarize this section, we have presented a widely applicable matrix formulation z = Wy for the
denoising problem, where W is henceforth considered symmetric, positive-definite, and doubly-stochastic.
This formulation has so far allowed us to characterize the performance of the resulting filters in terms
13
of their spectra, and to obtain insights as to the relative statistical efficiency of existing filters such as
BM3D. One final observation worth making here is that the Birkhoff – von Neumann Theorem [24]
(p. 527) notes that W is doubly stochastic if and only if it is a convex combination of permutation
matrices. Namely, W =∑n
l=1 αlPl, where Pl are permutation matrices; αl are non-negative scalars
with∑αl = 1, and n is at most (n− 1)2 + 1. This mean that regardless of the choice of an admissible
kernel, each pixel of the estimated image
z = W y =
n∑
l=1
αl (Ply)
is always a convex linear combination of at most n = (n− 1)2 + 1 (not necessarily nearby) pixels of the
noisy input.
Going forward, we first use the insights gained thus far to study further improvements of the non-
parametric approach. Then, we describe the relationship between the non-parametric approaches and
classical Bayesian methods.
V. IMPROVING THE ESTIMATE: DIFFUSION, TWICING, AND BREGMAN ITERATIONS
As we described above, the optimal spectrum for the filter matrix W is dictated by the Wiener condition
(22), which requires clairvoyant knowledge of the signal or at least careful estimation of the SNR at each
spatial frequency. In practice, however, nearly all similarity-based methods we described so far (with
the notable exception of the BM3D) design a matrix W based on a choice of the kernel function, and
without regard to the ultimate consequences of this choice on the spectrum of the resulting filter. Hence,
at least in the mean-squared error sense, such choices of W are always deficient. So one might naturally
wonder if such estimates can be improved through some post-facto iterative mechanism. The answer is
an emphatic yes, and this is the subject of this section.
To put the problem in more practical terms, the estimate derived from the application of a non-
parametric denoising filter W will not only remove some noise, but invariably some of the underlying
signal as well. A number of different approaches have been proposed to reintroduce this ”lost” component
back to the estimate. Interestingly, as we describe later in Section VI, similar to the steepest descent
methods employed in the solution of more classical Bayesian optimization approaches, the iterative
method employed here involve multiple applications of a filter (or sequence of filters) on the data, and
somehow pooling the results. Next we discuss the two most well-known and frequently used approaches:
anisotropic diffusion [31]; and the recently popularized Bregman iterations [32], which is closely related
14
to L2-boosting [33]. The latter is a generalization of twicing introduced by Tukey [34] more than thirty
years ago.
A. Diffusion
So far, we have described a general formulation for denoising as a spatially adaptive, data-dependent
filtering procedure (3) amounting to
z = Wy.
Now consider applying the filter multiple times. That is, we define z0 = y, and the iteration
zk = Wzk−1 = Wky. (23)
The net effect of each application of W is essentially a step of anisotropic diffusion [31], [19]. This can
be seen as follows. From the iteration (23) we have
zk = Wzk−1, (24)
= zk−1 − zk−1 + Wzk−1, (25)
= zk−1 + (W− I) zk−1, (26)
which we can rewrite as
zk − zk−1 = (W − I) zk−1. (27)
Recall that W = D−1/2 LD1/2. Hence we have
W − I = D−1/2 (L− I) D1/2 = D−1/2 LD1/2,
where L is the graph Laplacian operator mentioned earlier [20], [19]. Defining the normalized variable
zk = D1/2 zk, (27) embodies a discrete version of anisotropic diffusion:
zk − zk−1 = L zk−1, ←→ ∂z(t)
∂t= ∇2z(t) Diffusion Eqn. (28)
where the left-hand side of the above is a discretization of the derivative operator∂z(t)
∂t .
Returning to the diffusion estimator (23), the bias in the estimate after k iterations is
biask = E(zk)− z = E(Wky) − z = Wkz− z = (Wk − I)z.
Recalling that the matrix W has a unit eigenvalue corresponding to a (constant) vector, we note that since
Wk − I has the same eigenvectors as W, the above sequence of estimators produce unbiased estimates
of constant images, but are biased for all other underlying images z.
15
The squared magnitude of the bias is given by
‖biask‖2 = ‖(Wk − I)z‖2. (29)
Writing the image z in the column space of V as z = Vb0, as before we have:
‖biask‖2 = ‖(Wk − I)z‖2 = ‖V(Sk − I)b0‖2 =
n∑
i=1
(λki − 1)2b20i. (30)
As k grows, so does the magnitude of the bias, ultimately converging to ‖b0‖2 − b201. This behavior is
consistent with what is observed in practice; namely, increasing iterations of diffusion produce more and
more blurry (biased) results. Next, we derive the variance of the diffusion estimator:
cov(zk) = cov(Wky) = cov(Wke) = σ2Wk(Wk)T ,
which gives
var(zk) = tr(cov(zk)) = σ2n∑
i=1
λ2ki .
As k grows, the variance tends to a constant value of nσ2. Overall, the mean-squared-error (MSE) is
given by
MSEk = ‖biask‖2 + var(zk) =
n∑
i=1
(λki − 1)2b20i + σ2λ2k
i . (31)
Diffusion experiments for denoising three types of patches (shown in Fig. 3) are carried out to illustrate
the evolution of MSE. The variance of input Gaussian noise is set as σ2 = 25. Filters based on both
LARK [3] and NLM [1] estimated from the latent noise-free patches are tested, and Sinkhorn algorithm
(see Appendix B) is implemented to make the filter matrices doubly-stochastic. Predicted MSEk, var(zk)
and ‖biask‖2 according to (31) through the diffusion iteration are shown in Fig. 4 (LARK) and Fig. 6
(NLM), where we can see that diffusion monotonically reduces the estimation variance and increases the
bias. In some cases (such as (a) and (b) in Fig. 4) diffusion further suppresses MSE and improves the
estimation performance. True MSEs for the standard (asymmetric) filters and their symmetrized versions
computed by Monte-Carlo simulations are also shown, where in each simulation 100 independent noise
realizations are averaged. We see that the estimated MSEs of the symmetrized filters match quite well
with the predicted ones (see Fig. 5, 7). The asymmetric filters, meanwhile generate slightly higher MSE,
but behave closely to the symmetrized ones especially in regions around the optimal MSEs.
In the next set of Monte-Carlo simulations, we illustrate the effect of W being computed from clean
vs. noisy images (see Figs. 14-15). Noise variance σ2 = 0.5, which is relatively small, simulating the
situation where ”pre-processing” has been applied to suppress estimation variance. It can be observed that
16
(a) (b) (c)
Fig. 3. Example patches: (a) flat, (b) edge, (c) texture.
2 4 6 8 10
1
2
3
4
5
MSE
Variance
Bias2
Iteration number
2 4 6 8 10
2
4
6
8
10
MSE
Variance
Bias2
Iteration number
2 4 6 8 100
50
100
150
200
250
300
350
MSE
Variance
Bias2
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 4. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (31) for LARK [3] filters in diffusion process.
2 4 6 8 103
3.5
4
4.5
5
5.5
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 108
8.5
9
9.5
10
10.5
11
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 100
50
100
150
200
250
300
350
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 5. Plots of predicted MSE and Monte-Carlo estimated MSE in diffusion process using LARK [3] filters. In the Monte-Carlo
simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.
17
2 4 6 8 100
1
2
3
4
5
6
7
MSE
Variance
Bias2
Iteration number
2 4 6 8 100
5
10
15
20
25
30
MSE
Variance
Bias2
Iteration number
2 4 6 8 100
5
10
15
MSE
Variance
Bias2
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 6. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (31) for NLM [1] filters in diffusion process.
2 4 6 8 103.5
4
4.5
5
5.5
6
6.5
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 100
5
10
15
20
25
30
35
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 10
8
10
12
14
16
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 7. Plots of predicted MSE and Monte-Carlo estimated MSE in diffusion process using NLM [1] filters. In the Monte-Carlo
simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.
the MSEs estimated from Monte-Carlo simulations are quite close to the ones predicted from the ideal
filters, which confirms the assumption in Section IV that filter matrix W can be treated as deterministic
under most circumstances.
To further analyze the change of MSE in the diffusion process, let us consider the contribution of MSE
in each component (or ”mode”) separately:
MSEk =
n∑
i=1
MSE(i)k (32)
where MSE(i)k denotes the MSE of the i-th mode in the kth diffusion iteration, which is given by:
MSE(i)k = (λk
i − 1)2b20i + σ2λ2ki . (33)
18
2 4 6 8 101
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number2 4 6 8 10
1
2
3
4
5
6
7
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 100
10
20
30
40
50
60
70
80
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 8. Plots of predicted MSE in diffusion based on ideally estimated symmetric SKR filters, and MSE of noisy
symmetric/asymmetric SKR filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation
100 independent noise realizations are implemented.
2 4 6 8 103.5
4
4.5
5
5.5
6
6.5
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 100
5
10
15
20
25
30
35
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
2 4 6 8 100
2
4
6
8
10
12
14
16
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 9. Plots of predicted MSE in diffusion based on ideally estimated symmetric NLM filters, and MSE of noisy
symmetric/asymmetric NLM filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation
100 independent noise realizations are implemented.
Equivalently, we can write
MSE(i)k =
MSE(i)k
σ2= snri(λ
ki − 1)2 + λ2k
i , (34)
where as before, snri = b20i
σ2 is defined as the signal to noise ratio in the i-th mode.
One may be interested to know at which iteration MSE(i)k is minimized. As we have seen, with an
arbitrary filter W, there is no guarantee that even the very first iteration of diffusion will improve the
estimate in the MSE sense. The derivative of (34) with respect to k is
∂MSE(i)k
∂k= 2λk
i log λi
[(snri + 1)λk
i − snri
]. (35)
19
For i ≥ 2, since λi < 1 we note that the sign of the derivative depends on the term inside the brackets,
namely, (snri + 1)λki − snri. To guarantee that in the kth iteration the MSE of the i-th mode decreases
(negative derivative), we must have:
(snri + 1)λqi − snri > 0 for any 0 ≤ q < k. (36)
In fact the condition
(snri + 1)λki − snri ≥ 0 (37)
guarantees that the derivative is always negative for q ∈ (0, k) because given any scalar t = qk ∈ (0, 1):
(snri +1)λki − snri ≥ 0⇒ λtk
i ≥(
snri
snri + 1
)t
⇒ (snri +1)λtki ≥ snri
(snri
snri + 1
)t−1
> snri (38)
The condition (37) has an interesting interpretation. Rewriting (37) we have:
log(1 + snri) ≤ log
(1
1− λ′i
)
︸ ︷︷ ︸ǫi
, (39)
where λ′i = λki denotes the i-th eigenvalue of Wk, which is the equivalent filter matrix for the kth
iteration. The left-hand side of the inequality is Shannon’s channel capacity [35] of the i-th mode for the
problem y = z + e. We name the right-hand side expression ǫi the entropy of the i-th mode of the filter
Wk. The larger ǫi, the more variability can be expected in the output image produced by the i-th mode
of the denoising filter. The above condition implies that if the filter entropy exceeds or equals the channel
capacity of the i-th mode (i.e. the mode is sufficiently ”strong”), then the k-th iteration of diffusion will
in fact produce an improvement in the corresponding mode. One may also be interested in identifying an
approximate value of k for which the overall MSEk is minimized. This depends on the signal energy
distribution ({b20i}) over all the modes. The study of this topic, which we consider outside the scope of
the present paper, will help to design a strategy capable of enhancing the performance of many existing
denoising filters.
Another interesting question is this: given a particular W, how much further MSE reduction can be
achieved by implementing diffusion? For example, can we use diffusion to improve the 3D collaborative
Wiener filter in BM3D [29]? Assume that this filter has already achieved the ideal Wiener filter condition
in each mode, namely
λ∗i =snri
snri + 1. (40)
Then we can see that:
(snri + 1)(λ∗i )k − snri < 0 for any k > 1, (41)
20
which means for all the modes, diffusion will definitely worsen (increase) the MSE. In other words,
multi-iteration diffusion is not useful for filers where the Wiener condition has been (approximately)
achieved. Put another way, suppose that the minimum MSE(i) is achieved in the k∗-th iteration. Hence:
k∗ = log
(snri
snri + 1
)/ log λi. (42)
Then the i-th eigenvalue of the equivalent filter matrix Wk∗
becomes λ′i = snri/(snri + 1), which is in
fact the Wiener filter. Of course, in practice we can seldom find a single iteration number k∗ minimizing
MSE(i) for all the modes simultaneously; but in general, the take-away lesson is that optimizing a given
filter through diffusion can in some cases (39) make its corresponding eigenvalues closer to those of an
ideal Wiener filter that minimizes MSE.
B. Twicing, L2-Boosting, and Bregman Iterations
An alternative to repeated applications of the filter W is to consider the residual signals, defined as the
difference between the estimated signal and the measured signal. The use of the residuals in improving
estimates has a rather long history, dating at least back to the work of John Tukey [34] who termed the
idea ”twicing”. More recently, the idea has been rediscovered in the applied mathematics community
under the rubric of Bregman iterations [32], and in the machine learning and statistics literature [33] as
L2-boosting. In whatever guise, the basic notion – to paraphrase Tukey, is to use the filtered residuals to
enhance the estimate by adding some ”roughness” to it. Put another way, if the residuals contain some of
the underlying signal, filtering them should recover this left-over signal at least in part. These estimates
too trade off bias against variance with increasing iterations, though in a fundamentally different way
than the diffusion estimator we discussed earlier.
Formally, the residuals are defined as the difference between the estimated signal and the measured
signal: rk = y− zk−1, where here we define the initialization10 z0 = Wy. With this definition, we write
the iterated estimates as
zk = zk−1 + Wrk = zk−1 + W(y − zk−1). (43)
Specifically, for k = 1
z1 = z0 + W(y − z0) = Wy + W(y −Wy) =(2W −W2
)y
. This first iterate z1 is precisely the ”twicing” estimate of Tukey [34]. More recently, this particular
filter has been suggested (i.e. rediscovered) in other contexts. In [19], [20], for instance, Coifman et al.
10Note that due to the use of residuals, this is a different initialization than the one used in the diffusion iterations.
21
suggested it as an ad-hoc way to enhance the effectiveness of diffusion. The intuitive justification is that(2W −W2
)= (2I−W)W can be considered as a two step process; namely blurring, or diffusion
(as embodied by W), followed by an additional step of inverse diffusion or sharpening (as embodied by
2I−W.) As can be seen, the ad-hoc suggestion has a clear interpretation here. Furthermore, we see that
the estimate z1 can also be thought of as a kind of nonlinear, adaptive unsharp masking process, further
clarifying its effect. In [36] this procedure and its relationship to the Bregman iterations were studied in
detail11.
To study the statistical behavior of the estimates, we rewrite the iterative process in (43) more explicitly
in terms of the data y. We have [23], [33]:
zk =k∑
j=0
W(I −W)j y =(I− (I−W)k+1
)y. (44)
It is instructive to contrast the above with the diffusion process. Clearly the above iteration does not
monotonically blur the data; but a rather more interesting behavior for the bias-variance tradeoff is
observed. Analogous to our earlier derivation, the bias in the estimate after k iterations is
biask = E(zk)− z =(I− (I−W)k+1
)E(y)− z = − (I−W)k+1
z.
The squared magnitude of the bias is given by
‖biask‖2 = ‖ (I−W)k+1z‖2 =
n∑
i=1
(1− λi)2k+2b20i. (45)
The behavior of the bias in this setting is in stark contrast to the increasing bias of the diffusion process
with iterations. Namely, here the bias in fact decays with iterations.
The variance of the estimator and the corresponding mean-squared error are:
cov(zk) = cov[(
I− (I−W)k+1)y]
= σ2(I− (I−W)k+1
)(I− (I−W)k+1
)T,
which gives
var(zk) = tr(cov(zk)) = σ2n∑
i=1
(1− (1− λi)
k+1)2.
As k grows, the variance tends to a constant value of nσ2. Overall, the mean-squared-error (MSE) is
MSEk = ‖biask‖2 + var(zk) =
n∑
i=1
(1− λi)2k+2b20i + σ2
(1− (1− λi)
k+1)2. (46)
11For the sake of completeness, we also mention in passing that in the non-stochastic setting, the residual-based iterations
(43) are known as matching pursuit [37], an approach that has found many applications and extensions for fitting over-complete
dictionaries.
22
0 2 4 6 8 100
5
10
15
20
MSE
Variance
Bias2
Iteration number
0 2 4 6 8 100
5
10
15
20
25
MSE
Variance
Bias2
Iteration number
0 2 4 6 8 100
5
10
15
20
25
30
35
MSE
Variance
Bias2
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 10. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (46) for LARK [3] filters in twicing process.
0 2 4 6 8 105
10
15
20
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 1010
12
14
16
18
20
22
24
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 1010
15
20
25
30
35
40
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 11. Plots of predicted MSE and Monte-Carlo estimated MSE in twicing process using LARK [3] filters. In the Monte-Carlo
simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.
Experiments for denoising the patches given in Fig. 3 using the twicing approach are carried out, where
the same doubly-stochastic LARK and NLM filters used for the diffusion tests in Fig. 4 and 6 are
employed. Plots of predicted MSEk, var(zk) and ‖biask‖2 with respect to iteration are illustrated in
Fig. 10 (LARK) and Fig. 12 (NLM). It can be observed that in contrast to diffusion, twicing monotonically
reduces the estimation bias and increases the variance. So twicing may in fact reduce the MSE in cases
where diffusion fails to do so (such as (c) in Fig. 4). True MSEs for the normal asymmetric filters and
their symmetrized versions are estimated through Monte-Carlo simulations, and they all match very well
with the predicted ones (see Fig. 11, 13).
Similar to the diffusion analysis before, we plot MSE resulting from the use of W filters directly
estimated from noisy patches through Monte-Carlo simulations with noise variance σ2 = 0.5, and compare
them with the predicted MSEs from the ideally estimated filters. Again, the MSEs estimated from Monte-
23
0 2 4 6 8 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MSE
Variance
Bias2
Iteration number0 2 4 6 8 10
0
1
2
3
4
5
6
MSE
Variance
Bias2
Iteration number
0 2 4 6 8 100
5
10
15
20
25
MSE
Variance
Bias2
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 12. Plots of predicted MSEk, var(zk) and ‖biask‖2 using (46) for NLM [1] filters in twicing process.
0 2 4 6 8 100.5
1
1.5
2
2.5
3
3.5
4
4.5
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 103.5
4
4.5
5
5.5
6
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 1010
15
20
25
MS
E
Predicted (sym.)
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 13. Plots of predicted MSE and Monte-Carlo estimated MSE in twicing process using NLM [1] filters. In the Monte-Carlo
simulations, row-stochastic (asymmetric) filters and their symmetrized versions are all tested.
Carlo simulations are quite close to the predicted data. (See Figs. 14 and 15.)
Here again, the contribution to MSEk of the i-th mode can be written as:
MSE(i)k = (1− λi)
2k+2b20i + σ2(1− (1− λi)
k+1)2
(47)
Proceeding in a fashion similar to the diffusion case, we can analyze the derivative of MSE(i)k with
respect to k to see whether the iterations improve the estimate. We have:
∂MSE(i)k
∂k= 2(1 − λi)
k+1 log(1− λi)[(snri + 1)(1 − λi)
k+1 − 1]. (48)
Reasoning along parallel lines to the earlier discussion, the following condition guarantees that the k-th
iteration improves the estimate:
(snri + 1)(1 − λi)k+1 ≥ 1 (49)
24
0 2 4 6 8 100.5
0.6
0.7
0.8
0.9
1
1.1
1.2M
SE
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 100.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 100
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 14. Plots of predicted MSE in twicing based on ideally estimated symmetric SKR filters, and MSE of noisy
symmetric/asymmetric SKR filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation
100 independent noise realizations are implemented.
0 2 4 6 8 100.5
1
1.5
2
2.5
3
3.5
4
4.5
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number0 2 4 6 8 10
0
1
2
3
4
5
6
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
0 2 4 6 8 100.4
0.6
0.8
1
1.2
1.4
1.6
1.8
MS
E
Ideally Predicted
Monte−Carlo (sym.)
Monte−Carlo (asy.)
Iteration number
(a) Flat (b) Edge (c) Texture
Fig. 15. Plots of predicted MSE in diffusion based on ideally estimated symmetric NLM filters, and MSE of noisy
symmetric/asymmetric NLM filters through Monte-Carlo simulations. The input noise variance σ2 = 0.5, and for each simulation
100 independent noise realizations are implemented.
Rewriting the above we have
log(1 + snri) ≥ log
(1
1− λ′i
)
︸ ︷︷ ︸ǫi
(50)
where now λ′i = 1 − (1 − λi)k+1 is the i-th eigenvalue of I − (I −W)k+1, which is the equivalent
filter matrix for the kth twicing iteration. This is analogous to the condition we derived earlier. But
here we observe that if the entropy does not exceed the channel capacity (i.e. the i-th mode of the
filter is sufficiently ”weak” or ”ineffective”), then iterating with the residuals will indeed produce an
improvement. This makes reasonable sense because a ”weak” filter will remove too much detail from
the denoised estimate. This ”lost” detail is contained in the residuals, which the iterations will attempt
25
to return to the estimate. Again, if assume the minimum MSE(i)k can be achieved in the k∗-th iteration:
k∗ = log
(1
snri + 1
)/ log(1− λ′i)− 1 (51)
Then the i-th eigenvalue of the equivalent filter matrix I− (I−W)k∗+1 becomes snri/(snri +1), which
is the Wiener condition. In most cases we cannot optimize all the modes with a uniform number of
iterations, but globally optimizing the MSE through twicing can make the eigenvalues closer to the ones
of the ideal Wiener filter that minimizes the MSE.
From the above analysis, we can see that it is possible to improve the performance of many existing
denoising algorithms in the MSE sense by implementing iterative filtering (see LARK filtering examples
in Fig. 16). However, to choose either diffusion or twicing and to determine the optimal iteration number
require prior knowledge (or estimation) of the latent signal energy {b20i} and this is an ongoing challenge.
Research on estimating the SNR of input image or video data without a reference ”ground truth” is of
broad interest in the community as of late [38]. In particular, this problem points to potentially important
connections with ongoing work in no-reference image quality assessment [39] as well.
VI. RELATIONSHIP TO BAYESIAN REGULARIZATION APPROACHES
It is useful to contrast the non-parametric framework described so far with the more classical Bayesian
estimation approach. Informally, to have a Bayesian interpretation of the estimator in (13), we could treat
both yi and xi as random variables with joint probability density function p(x, y). Then z(x) can be
thought of as the minimum mean-squared estimate of z based on the data y, conditioned on x [40]:
z(x) = E(y|x) =
∫y p(y|x) dy =
∫yp(x, y)
p(x)dy =
∫y p(x, y) dy∫p(x, y) dy
.
Comparing this expression to the filtering formulation (13):
z(xj) =∑
i
K(xi, xj , yi, yj)∑iK(xi, xj, yi, yj)
yi =∑
i
W (xi, xj , yi, yj) yi
we observe that the form and function of the kernel K(·) is to locally capture the essence of the (unknown)
joint density function p(x, y).
A more common Bayesian approach is to consider a prior for the unknown z, or to equivalently use
a regularization approach. More specifically:
Maximum a-posteriori (MAP): z = arg minz
1
2‖y − z‖2 +
λ
2R(z) (52)
where the first term on the right-hand side is the data-fidelity term, and the second (regularization) term
essentially enforces a soft constraints on the global structure and smoothness of the signal being restored.
26
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 16. Denoising example using stationary iterative LARK filtering: (a)-(c) input noisy patches; noise variance σ2 = 25.
(d)-(f) output patches by filtering once. (g)-(i) MSE optimized outputs: (g) is the 6th diffusion output; (h) is the 4th diffusion
output; (i) is the 3rd twicing output.
27
In the MAP approach, the regularization term is a functional R(z) which is typically (but not always)
convex, to yield a unique minimum for the overall cost. Particularly popular recent examples include
R(z) = ‖∇z‖, and R(z) = ‖z‖1. The above approach implicitly contains a global (Gaussian) model of
the noise (captured by the quadratic data-fidelity term) and an explicit model of the signal (captured by
the regularization term.)
Whatever choices we make for the regularization functional, the regularization approach is global in the
sense that the implicit assumptions about the noise and the signal constrain the degrees of freedom in the
resulting filters Wk and therefore in the solution, hence limiting the global behavior of the estimate. This
typically results in well-behaved (though not always desirable) solutions. This is precisely (and at least
implicitly) the main motivation for using (local/non-local) adaptive non-parametric techniques in place of
global regularization methods. Indeed, though the effect of regularization is not explicitly present in the
non-parametric framework, its work is implicitly done by the design of the kernel function K(·), which
affords us local control, and therefore more freedom and often better adaptation to the given data, resulting
in more powerful techniques with broader applicability. Though we do not discuss them here, of course
hybrid methods are also possible, where the first term in (52) is replaced by the corresponding weighted
term (3), with regularization also applied explicitly [41]. As we shall see below, the iterative methods for
improving the non-parametric estimates (such as diffusion, twicing, etc.) mirror the behavior of steepest
descent iterations in the Bayesian case, but with a fundamental difference. Namely, the corresponding
implicit regularization functionals deployed by the non-parametric methods are significantly different and
more complex than their typical Bayesian counterpart.
To connect MAP to the non-parametric setting, we proceed in similar lines as Elad [13] by considering
the simplest iterative approach: the steepest descent (SD) method. The SD iteration for MAP, with fixed
step size µ is
MAP: zk = zk−1 − µ [(zk−1 − y) + λ∇R(zk−1)] . (53)
A. Bayesian Interpretation of Diffusion
To describe a Bayesian interpretation for the diffusion process, we can compare (27) to the steepest
descent MAP iterations in (53). More specifically, we equate the right-hand sides of
zk+1 = zk − µ [(zk − y) + λ∇R(zk)] (54)
zk+1 = zk + (W− I) zk (55)
28
which gives
−µ [(zk − y) + λ∇R(zk)] = (W − I) zk.
Solving for ∇R(zk) we obtain
∇R(zk) =1
µλ(W − (1− µ)I) (y − zk)−
1
µλ(I−W)y (56)
When W is symmetric12, integrating both sides we have (to within an additive constant:)
Rdiff (zk) =1
2µλ(y− zk)
T ((1− µ)I−W) (y− zk) +1
µλyT (I−W) zk (57)
We observe therefore that the (implicit) regularization term is a sum of two terms, the first being a
quadratic function of the residuals, and the second a sort of correlation between the data y and the k-th
iterate. As such, it is not a standard Bayesian prior – it is not simply a functional modeling the underlying
image; instead, it depends both on the data, and also evolves as a function of k.
The Bayesian interpretation of the residual-based methods is, somewhat surprisingly, more straightfor-
ward but still non-standard, as we illustrate below.
B. Bayesian Interpretation of the Residual Iterations
The residual-based iterations (43), can have a Bayesian interpretation too. We arrive at this in a manner
similar to what we did above with diffusion. In particular, comparing (43) to the steepest descent MAP
iterations in (53) we have
zk+1 = zk − µ [(zk − y) + λ∇R(zk)] (58)
zk+1 = zk + W(y − zk) (59)
Again, equating the right-hand sides we get:
−µ [(zk − y) + λ∇R(zk)] = W(y − zk).
Solving for ∇R(zk) we obtain
∇R(zk) =1
µλ(W − µI) (y − zk) , (60)
Again, with W symmetric, we have (to within an additive constant:)
Rresid(zk) =1
2µλ(y − zk)
T (W − µI) (y − zk) (61)
12Since the left-hand side of (56) is the gradient of a scalar-valued function, the right-hand side must be curl-free. This is
possible if and only if W is symmetric.
29
Therefore the (implicit) regularization term is an adaptive, quadratic function of the residuals. As such,
it is not a standard Bayesian prior.
VII. CONCLUDING REMARKS
It has been said that in both literature and in film, there are only seven basic plots (comedy, tragedy,
etc.) – that all stories are combinations or variations on these basic themes. Perhaps it is taking the
analogy too far to say that an exact parallel exists in our field as well. But it is fair to argue that the basic
tools of our trade in recent years have revolved around a small number key concepts as well, which I
tried to highlight in this paper.
• The most successful modern approaches in image and video processing are non-parametric. We
have drifted away from model-based methods which have dominated signal processing for decades.
• In one-dimensional signal processing, there is a long history of design and analysis of adaptive filter-
ing algorithms. A corresponding line of thought has only recently become the dominant paradigm in
processing higher-dimensional data. Indeed, adaptivity to data is a central theme of all the algorithms
and techniques discussed here, which represent a snapshot of the state of the art.
• The traditional approach that every student of our field learns in school is to carefully design a
filter, apply it to the given set of data, and call it a day. In contrast, many if not most, of the recent
approaches have instead involved repeated applications of a filter or sequence of filters to a data
set and combining the results. Such ideas have been around for some time in statistics, machine
learning, and elsewhere, but we have just begun to make careful use of sophisticated iteration and
bootstrapping mechanisms.
As I highlighted here, deep connections between techniques used commonly in computer vision, image
processing, and machine learning exist. The practitioners of these fields have been using each other’s
techniques either implicitly or explicitly for a while. The pace of this convergence has quickened, and
this is not a coincidence. It has come about through many years of scientific iteration in what my late
friend and colleague Gene Golub called the ”serendipity of science” – an aptitude we have all developed
for making desirable discoveries by happy accident.
VIII. ACKNOWLEDGEMENTS
I would like to thank all my students in the MDSP research group at UC Santa Cruz for their feedback
and insights on earlier versions of this manuscript. In particular, I acknowledge the diligent help of Xiang
30
Zhu with many of the simulations contained here. I also thank my friend and colleague Prof. Michael
Elad of the Technion for his feedback and discussions.
APPENDIX A
GENERALIZATION OF THE NON-PARAMETRIC FRAMEWORK TO ARBITRARY BASES
The non-parametric approach in (3) can be further extended to include a more general model of the
signal z(x) in some appropriate basis. Namely, expanding the regression function z(x) in a desired basis,
we can formulate the following optimization problem:
z(xj) = arg minβl(xj)
n∑
i=1
[yi −
N∑
l=0
βl(xj)φl(xi, xj)
]2
K(yi, y, xi, xj), (62)
where N is the model (or regression) order. For instance, with the basis set φl(xi, xj) = (xi − xj)l, we
have the Taylor series expansion, leading to the local polynomial approaches of classical kernel regression
theory [7], [40], [3]. Alternatively, the basis vectors could be learned from the given image using a method
such as K-SVD [42]. In matrix notation we have
β(xj) = arg minβ(xj)
[y −Φjβ(xj)]T
Kj [y −Φjβ(xj)] , (63)
where
Φj =
φ0(x1, xj) φ1(x1, xj) · · · φN (x1, xj)
φ0(x2, xj) φ1(x2, xj) · · · φN (x2, xj)...
......
φ0(xn, xj) φ1(xn, xj) · · · φN (xn, xj)
(64)
is a matrix containing the basis vectors in its columns; and where β(xj) = [β0, β1, · · · , βN ]T (xj) is the
coefficient vector of the local signal representation in this basis. Meanwhile, Kj is the diagonal matrix
of weights as defined earlier. In order to maintain the ability to represent the ”DC” pixel values, we can
insist that the matrix Φj contain the vector 1n = [1, 1, · · · , 1]T as one of its columns. Without loss of
generality, we assume this to be the first column so that by definition φ0(xi, xj) = 1 for all i, and j.
Denoting the j-th row of Φj as φTj = [φ0(xj , xj), φ1(xj , xj), · · · , φN (xj, xj)] we have the close-form
solution
z(xj) = φTj β(xj) (65)
= φTj
(ΦT
j KjΦj
)−1ΦT
j Kj︸ ︷︷ ︸wT
j
y = wTj y (66)
31
This again, is a weighted combination of the pixels, though a rather more complicated one than the
earlier formulation. Interestingly, the filter vector wj still has elements that sum to 1. This can be seen
as follows. We have
wTj Φj = φT
j
(ΦT
j KjΦj
)−1ΦT
j KjΦj = φTj .
Writing this explicitly, we observe
wTj
1 φ1(x1, xj) · · · φN (x1, xj)
1 φ1(x2, xj) · · · φN (x2, xj)...
......
1 φ1(xn, xj) · · · φN (xn, xj)
= [1, φ1(xj , xj), · · · , φN (xj , xj)]
Considering the inner product of wTj with the first column of Φj , we have wT
j 1n = 1 as claimed.
The use of a general basis as described above carries one disadvantage, however. Namely, since the
coefficient vectors wj are influenced now not only by the choice of the kernel, but also by the choice of
the basis, the elements of wj can no longer be guaranteed to be positive.
APPENDIX B
SYMMETRIC APPROXIMATION OF W AND ITS PROPERTIES
We approximate the matrix W = D−1K with a doubly-stochastic (symmetric) positive definite matrix.
The algorithm we use to effect this approximation is due to Sinkhorn [25], [26]. He proved that given
a matrix with strictly positive elements (such as W), there exist diagonal matrices R = diag(r) and
C = diag(c) such that
W = RW C
is doubly stochastic. That is,
W1n = 1n and 1TnW = 1T
n (67)
Furthermore, the vectors r and c are unique to within a scalar (i.e. α r, c/α.) Sinkhorn’s algorithm for
obtaining r and c in effect involves repeated normalization of the rows and columns (See Algorithm 1
for details) so that they sum to one, and is provably convergent and optimal in the cross-entropy sense
[27]. To see that the resulting matrix W is symmetric positive definite, we note a corollary of Sinkhorn’s
result: When a symmetric matrix is considered, the unique diagonal scalings are in fact identical [25],
[26]. Applied to K, we see that the symmetric diagonal scaling K = ∆K∆ yields a symmetric doubly-
stochastic matrix. But we have W = D−1K, so
W = RD−1KC
32
Since the W and the diagonal scalings are unique (to within a scalar), we must have RD−1 = C = ∆,
and therefore W = K. That is to say, applying Sinkhorn’s algorithm to either K or its row-sum normalized
version W yields the very same (symmetric positive definite doubly stochastic) result, W.
Algorithm 1 Algorithm for scaling a matrix A to a nearby doubly-stochastic matrix A
Given a matrix A, let (n, n) = size(A) and initialize r = ones(n, 1);
for k = 1 : iter;
c = 1./(AT r);
r = 1./(Ac);
end
C = diag(c); R = diag(r);
A = RAC
The convergence of this iterative algorithm is known to be linear [26], with rate given by the subdominant
eigenvalue λ2 of W. It is of interest to know how much the diagonal scalings will perturb the eigenvalues
of W. This will indicate how accurate our approximate analysis in Section V will be, which uses the
eigenvalues of W instead of W. Fortunately, Kahan [43], [22] provides a nice result: Suppose that A is
a non-Hermitian matrix with eigenvalues |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Let A be a perturbation of A that is
Hermitian with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn, then
n∑
i=1
|λi − λi|2 ≤ 2 ‖A− A‖2F
Indeed, if A and A are both positive definite (as is the case for both K and W) the scalar factor 2 on the
right-hand side of the inequality may be discarded. Specializing the result for W, defining ∆ = W−W
and normalizing by n, we have
1
n
n∑
i=1
|λi − λi|2 ≤1
n‖∆‖2F
The right-hand side can be further bounded as described in [44].
Hence for patches of reasonable size (say n ≥ 11), the bound on the right-hand side, and the eigenvalues
of the respective matrices are quite close, as demonstrated in Figure 17. Bounds may be established on
the distance between the eigenvectors of W and W as well [44]. Of course, since both matrices are row-
stochastic, they share their first right eigenvector v1 = v1, corresponding to a unit eigenvalue. The second
eigenvectors are arguably more important, as they tell us about the dominant structure of the region being
33
0 100 200 300 4000
0.2
0.4
0.6
0.8
1
Eig
en
valu
e
Eigenvalue index
Symmetric filter
Asymmetric filter
0 100 200 300 400−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Eig
envalu
e
Eigenvalue index
Symmetric filter
Asymmetric filter
(a) (b)
Fig. 17. Eigenvalues of row-stochastic (asymmetric) filter matrices and the corresponding doubly-stochastic (symmetric) filter
matrices symmetrized using Sinkhorn’s algorithm: (a) a NLM filter example; (b) a LARK filter example.
filtered, and the corresponding effect of the filters. Specifically, as indicated in [28], applying Theorem
5.2.8 from [22] gives a bound on the perturbation of the second eigenvectors:
‖v2 − v2‖ ≤‖4∆‖F
ν −√
2‖∆‖FIt is worth noting that the bound essentially does not depend on the dimension n of the data y. This is
encouraging, as it means that by approximating W with W, we get a uniformly bounded perturbation
to the second eigenvector regardless of the filter window size. That is to say, the approximation is quite
stable.
As a footnote, we mention the interesting fact that, when applied to a symmetric squared distance
matrix, Sinkhorn’s algorithm has a rather nice geometric interpretation [45], [46]. Namely, if P is the set
of points that generated the symmetric distance matrix A, then elements of its Sinkhorn scaled version
B = ∆A∆ correspond to the square distances between the corresponding points in a set Q, where the
points in Q are obtained by a stereographic projection of the points in P . The points in Q are confined
to a hyper-sphere of dimension d embedded in Rd+1, where d is the dimension of the subspace spanned
by the points in P .
APPENDIX C
STABILITY OF FILTER COEFFICIENTS TO PERTURBATIONS DUE TO NOISE
The statistical analysis of the non-parametric filters W is quite complicated when W is assumed
stochastic. Fortunately, however, the stability results in [28] allow us to approximate the resulting filters
34
as approximately deterministic. More specifically, we assume that the noise e corrupting the data has
i.i.d. samples from a symmetric (but not necessarily Gaussian) distribution, with zero-mean and finite
variance σ2. The results in [28] imply that if the noise variance σ2 is small relative to the clean data z
(i.e., when signal-to-noise ratio is high), the weight matrix W = W(y) computed from the noisy data
y = z + e is near the latent weight matrix W. That is, with the number of samples n sufficiently large
[28],
‖W −W‖2F ≤p c1σ(2)e + c2σ
(4)e , (68)
for some constants c1 and c2, where σ(2)e and σ
(4)e denote the second and fourth order moments of
‖e‖, respectively, and “≤p” indicates that the inequality holds in probability. When the noise variance is
sufficiently small, the fourth order moment is even smaller; hence, the change in the resulting coefficient
matrix is bounded by a constant multiple of the small noise variance.
Approximations to the moments of the perturbation dW = W −W are also given in [28], [47]:
E [dW] ≈ E [dD]D−2K−D−1E [dK] ,
E[dW2
]≈ E
[dD2
]D−4K2 + D−2E
[dK2
]− E [dD dK] D−3 ◦K.
where dD = D −D, and dK = K −K, and ◦ denotes element-wise product. A thorough description
of the perturbation analysis for the class of Gaussian data-adaptive weights is given in [28], [47]. To
summarize, for a sufficiently small noise variance, the matrix of weights can be treated, with confidence,
as a deterministic quantity which depends only on the content of the underlying latent image z.
REFERENCES
[1] A. Buades, B. Coll, and J. M. Morel, “A review of image denoising algorithms, with a new one,” Multiscale Modeling
and Simulation (SIAM interdisciplinary journal), vol. 4, no. 2, pp. 490–530, 2005.
[2] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” Proceeding of the 1998 IEEE International
Conference of Compute Vision, Bombay, India, pp. 836–846, January 1998.
[3] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing and reconstruction,” IEEE Transactions on
Image Processing, vol. 16, no. 2, pp. 349–366, February 2007.
[4] K. Dabov, A. Foi, V. Katkovnic, and K. Egiazarian, “Image denoising by sparse 3D transform-domain collaborative
filtering,” IEEE Transactions of Image Processing, vol. 16, no. 8, pp. 2080–2095, August 2007.
[5] S. M. Smith and J. M. Brady, “Susan – a new approach to low level image processing,” International Journal of Computer
Vision, vol. 23, no. 1, pp. 45–78, 1997.
[6] L. Yaroslavsky, Digital Picture Processing. Springer Verlag, 1987.
[7] M. P. Wand and M. C. Jones, Kernel Smoothing, ser. Monographs on Statistics and Applied Probability. London; New
York: Chapman and Hall, 1995.
35
[8] W. Hardle, Applied Nonparametric Regression. Cambridge [England] ; New York: Cambride University Press, 1990.
[9] P. Lancaster and K. Salkauskasr, “Surfaces generated by moving least squares methods,” Mathematics of Computation,
vol. 87, pp. 141–158, 1981.
[10] D. Levin, “The approximation power of moving least-squares,” Mathematics of Computation, vol. 67, no. 224, pp. 1517–
1531, 1998.
[11] M. Alexa, J. Behr, D. Cohen-or, S. Fleishman, and D. Levin, “Computing and rendering point set surfaces,” IEEE
Transactions on Visualization and Computer Graphics, vol. 9, no. 1, pp. 3–15, 2003.
[12] R. Gonzalez and R. Woods, Digital Image Processing, Third Edition. Prentice Hall, 2008.
[13] M. Elad, “On the origin of the bilateral filter and ways to improve it,” IEEE Transactions on Image Processing, vol. 11,
no. 10, pp. 1141–1150, October 2002.
[14] C. Kervrann and J. Boulanger, “Optimal spatial adaptation for patch-based image denoising,” Image Processing, IEEE
Transactions on, vol. 15, no. 10, pp. 2866–2878, Oct. 2006.
[15] S. P. Awate and R. T. Whitaker, “Unsupervised, information-theoretic, adaptive image filtering for image restoration,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 364–376, March 2006.
[16] H. Knutsson and C.-F. Westin, “Normalized and differential convolution: Methods for interpolation and filtering of
incomplete and uncertain data,” Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pp. 515–523, June 16-19 1993.
[17] A. Spira, R. Kimmel, and N. Sochen, “A short time beltrami kernel for smoothing images and manifolds,” IEEE Trans.
Image Processing, vol. 16, no. 6, pp. 1628–1636, 2007.
[18] T. Hofmann, B. Scholkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36,
no. 3, p. 11711220, 2008.
[19] A. Singer, Y. Shkolinsky, and B. Nadler, “Diffusion interpretation of nonlocal neighborhood filters for signal denoising,”
SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 118–139, 2009.
[20] A. Szlam, M. Maggioni, and R. Coifman, “Regularization on graphs with function-adapted processes,” Journal of Machine
Learning Research, 2008, volume = 9, pages = 1711-1739,.
[21] H. J. Landau and A. M. Odlyzko, “Bounds for eigenvalues of certain stochastic matrices,” Linear Algebra and Its
Applications, vol. 38, pp. 5–15, June 1981.
[22] W. J. Stewart, Introduction to the Numerical Soultion of Markov Chains. Princeton University Press, 1994.
[23] E. Seneta, Non-negative Matrices and Markov Chains, ser. Springer Series in Statistics. Springer, 1981.
[24] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1991.
[25] P. A. Knight, “The Sinkhorn-Knopp algorithm: Convergence and applications,” SIAM J. Matrix Anal. Appl., vol. 30, no. 1,
pp. 261–275, 2008.
[26] R. Sinkhorn, “A relationship between arbitrary positive matrices and doubly stochastic matrices,” The Annals of
Mathematical Statistics, vol. 35, no. 2, pp. 876–879, 1964, 1964.
[27] J. Darroch and D. Ratcliff, “Generalized iterative scaling for log-linear models,” Annals of Mathematical Statistics, vol. 43,
pp. 1470–1480, 1972.
[28] L. Huang, D. Yan, M. I. Jordan, and N. Taft, “Spectral clustering with perturbed data,” Neural Information Processing
Systems (NIPS), 2008.
[29] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative
filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, August 2007.
36
[30] P. Chatterjee and P. Milanfar, “Is denoising dead?” IEEE Trans. on Image Proc., vol. 19, no. 4, pp. 895–911, April 2010.
[31] P. Perona and J. Malik, “Scale-space and edge detection using anistropic diffusion,” IEEE Transactions in Pattern Analysis
and Machine Intelligence, vol. 12, no. 9, pp. 629–639, July 1990.
[32] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularization method for total variation-based image
restoration,” Multiscale Modeling and Simulation, vol. 4, no. 2, pp. 460–489, 2005.
[33] P. Buhlmann and B. Yu, “Boosting with the l2 loss: Regression and classification,” Journal of the American Statistical
Association, vol. 98, no. 462, pp. 324–339, 2003.
[34] J. W. Tukey, Exploratory Data Analysis. Addison Wesley, 1977.
[35] D. P. Palomar and S. Verdu, “Gradient of mutual information in linear vector gaussian channels,” IEEE Trans. on Information
Theory, vol. 52, no. 1, pp. 141–154, 2006.
[36] M. Charest and P. Milanfar, “On iterative regularization and its application,” IEEE Trans. on Circuits and Systems for
Video Technology, vol. 18, no. 3, pp. 406–411, March 2008.
[37] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing,
vol. 41, no. 12, pp. 3397 – 3415, Dec. 1993.
[38] F. Luisier, T. Blu, and M. Unser, “A new SURE approach to image denoising: Interscale orthonormal wavelet thresholding,”
IEEE Transactions on Image Processing, vol. 16, no. 3, pp. 593–606, March 2007.
[39] Z. Wang and A. Bovik, “Mean squared error: Love it or leave it? a new look at signal fidelity measures,” IEEE Signal
Processing Magazine, vol. 26, no. 1, pp. 98–117, Jan 2009.
[40] L. Wasserman, All of Nonparametric Statistics. Springer, 2005.
[41] A. Bruckstein, “On globally optimal local modeling: From moving least square to overparametrization.” SIAM, 2010,
Conference on Imaging Science, Chicago, IL.
[42] M. E. M. Aharon and A. Bruckstein, “The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse
representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006.
[43] W. Kahan, “Spectra of nearly hermitian matrices,” Proceedings of the American Mathematical Society, vol. 48, pp. 11–17,
1975.
[44] P. Milanfar, “Asymptotic nearness of stochastic and doubly-stochastic matrices,” submitted.
[45] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, “Accurate image search using the contextual dissimilarity measure,”
IEEE Trans. on Pattern Analysis and Machine Intell., vol. 32, no. 1, pp. 2–11, January 2010.
[46] C. Johnson, R. Masson, and M. Trosset, “On the diagonal scaling of euclidean distance matrices to doubly stochastic
matrices,” Linear Algebra and its Applications, vol. 397, no. 1, pp. 253–264, 2005.
[47] L. H. D. Yan and M. I. Jordan, “Fast approximate spectral clustering,” 15th ACM Conference on Knowledge Discovery
and Data Mining (SIGKDD), 2009.
37