Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012



(General) To retrieve a clean dataset by deleting outliers. (Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.

Citation preview


Computer Vision Project Manifold Blurring Mean Shift algorithms for manifold denoising Kévin Adda, Florent Renucci

Table of contents

Introduction  ........................................................................................................................................  2  I – Description of the algorithm  .....................................................................................................  3  

II – Setting the parameters  ..............................................................................................................  4  II.1 – σ  ............................................................................................................................................................  4  II.2 – L = dimension of the subspace  .......................................................................................................  4  II.3 – Number of iterations and number of neighbours  ......................................................................  4  

III – Denoising and blurring of a manifold  ..................................................................................  5  III.1 – Code  ...................................................................................................................................................  5  III.2 - Results  ................................................................................................................................................  5  

IV – MNIST letters labelling  ...........................................................................................................  7  IV.1 – Code  ....................................................................................................................................................  7  IV.2 – Results  ................................................................................................................................................  8  

V – Conclusion  ...................................................................................................................................  9  

Appendix 1: Code spiral  ...............................................................................................................  10  Appendix 2: Code letters  ..............................................................................................................  11  




Denoising a dataset is the pre-processing operation aiming at isolating points which seem to be consistent with the global pattern of the set. There are many ways of describing the data referred as noise. In computer vision, image denoising consists in the recovery of a digital image that has been contaminated by additive white Gaussian noise, whereas video denoising consists in removing noise either by spatial methods (image denoising on each frame of the video for instance) and/or temporal methods.

Manifold denoising consists in finding numerically distant observation of a dataset. We usually refer to those observations as outliers. Unlike images, the source and thus the nature of noise might be unknown. Hence, methods should not make any assumption on the noise structure.

The objective of this algorithm is to denoise data by blurring it. The blurring step consists in moving points in the direction of its nearest neighbours: we then aggregate the points, and if there is an outlier, whose nearest neighbours are then close to each other but far from it, the outlier will move closer to this existing group. The move is computed thanks to a projection based on a Principal Component Analysis. The less we make assumptions about the data, the closer to reality the results are, that is why we use a nonparametric method.

This algorithm can be used for denoising matter only, or as a pre-processing step to classification. We will thus present the results of its application on some manifolds, and then on the classification of MNIST digits. We will use a neural network and a structured SVM to classify the letters, and then we will pre-process the data, classify it again, and observe a very interesting decrease on the error rate.



I – Description of the algorithm

The steps of the Manifold Blurring Mean Shift (MBMS) algorithm are the following ones:

-­‐ Blurring mean-shift update : we design a Gaussian kernel based on a particular point, compute it on its nearest neighbours, and we update this point:


K x! =exp −

x! − x!


exp −x! − x


2σ!!!∈!! !

Projection on a sub-dimensional space with PCA: once the data has been centred, the best linear L-dimensional manifold in terms of reconstruction error is computed be projecting it orthogonally on the manifold:


𝑥! −  (𝑈.𝑈! 𝑥! −  µ + µ) ²!!∈!! !

Such that:

𝑈!𝑈 = 𝐼𝑑!

The first step consists in moving the point closer to its neighbours; the second step consists in moving the point closer to the manifold. One can remark that only one step could theoretically give good results. The improvement in using the two steps will be explained further on. Intuitively, using only the 2nd step is equivalent to projecting the point to the sub-dimensional space generated considering the PCA projection of x! −mean NN x! , which is mathematically equivalent to setting σ = ∞. Using only the first step is equivalent to avoid a loss of information during the projection, that is to say projecting in a sub-dimensional space rigorously equal to the ambient space, or equivalently reducing the dimension of the entire to the dimension of the sub dimensional space. These are particular cases of the general MBMS algorithm. Of course, if we set



σ = ∞  and   dim sub − dimensional  space =  dim  (entire  space) at the same time, nothing happens (since the two main steps are skipped). Another interesting point is that the algorithm can take into account the N nearest neighbours, which means that at every step the entire graph will be considered. In that case, the algorithm is called MBMSf (f for full). The difference between MBMS and MBMSf is not very important, which means that from a certain k, taking more neighbours than k doesn't improve or decrease the result significantly. However, the choice of the parameters is an important trade-off that we will discuss below, as it qualifies the strength of the denoising effect: a too strong denoising effect might damage the dataset.

II – Setting the parameters

II.1 – The kernel variance 𝝈

As explained earlier, if 𝜎 = ∞, the data motion is more important. More generally the greater 𝜎 is, the stronger the denoising effect. Hence, a good choice for 𝜎 is a value for which the algorithm will succeed without damaging the dataset by distorting the manifold.

II.2 – The dimension of the subspace L (intrinsic dimension)

The greater is L, the more the projection respects the general pattern of the manifold. But at the same time, the motion is less important: at the limit, L = D (the dimension of the ambient space), and there is no motion at all.

II.3 – Number of iterations and number of neighbours

Experience shows that 5 iterations are enough for good results. The number of neighbours must be greater like about 10, and then all the results are equivalent.

L increase 𝜎  increase k increase (above 10)

More iterations (above 5)

Movement Decrease Increase No big change No big change Denoising effect decrease Increase No big change No big change Risk of distortion of the manifold

decrease Increase No big change No big change



III – Denoising and blurring of a manifold  

  As  the  paper  does  it,  our  first  experiment  was  to  apply  the  algorithm  on  datasets  presenting   an   obvious   structural   pattern.   The   paper   chooses   to   use   a   noisy   spiral.  Seeking  for  such  datasets,  we  choose  to  use  a  pinwheel  sets  generator,  which  we  found  on  the  website  of  the  Harvard  Intelligent  Probabilistic  Systems  group.  

III.1 – Code

We set the parameters, import a dataset representing a spiral, then :

-­‐ compute the k nearest neighbours for each point, -­‐ run the step 1 : local clustering, -­‐ run the step 2 : Principal Component Analysis on the nearest neighbours, -­‐ project the point on the subdimensional space.

See the code spiral in annex, and attached in the .rar.

We use a spiral because this is one of the most challenging geometric 2-dimensional forms for the Machine Learning algorithms.

We first built our own dataset ("Spiral Built Dataset"), and then chose to use a spiral dataset found on the Internet, generated by pinwheel.m.

To the noisy pinwheel set, we add uniformly distributed outliers.

III.2 - Results

At each iteration, the most numerically distant points are translated and projected toward the manifold, and eventually merging with another point. Thus, most of the outliers are cleaned out from the dataset, but the manifold might be damaged.

Here are the results of the following algorithms on parameters   𝐿, 𝑘,𝜎 , respectively. We chose parameters that are close to the ones chosen in the paper, and that give good results on our datasets:

• MBMSk : (1, 15, 1.1); • MBMSf: GBMSk on full graph (1, . , 1.1); • LTP: MBMSk with 𝜎  infinite (1, 15, ∞); • GBMS: MBMSf with zero dimension projection space (0, . , 1.1).



As we can see, both MBMSf and GBMS, which are full graph algorithm, damage the manifold from the first iteration, and eventually reduce it to an only point or group of points.

More generally, one must be careful while choosing the set of parameters, which are interdependent: for instance, it is possible to work on full graph (i.e.overpass the nearest neighbours step) but only if the data motion is limited (by choosing a relatively small variance).

Here is another example of the GBMS algorithm (L=0, on full graph), with a more appropriate kernel variance.



While the algorithm was reducing the dataset to an only point with kernel variance equal to 1.1, it produces good results with a variance divided by ten. We see how the algorithm is sensible to each parameter, which are interdependent.

Thus, there is a trade-off between the parameters set or for certain couples of parameters. We can set them arbitrarily by looking at the results, or define a way of selecting them. If we use machine learning on blurred data, we can use the error rate as an indicator of well chosen parameters.

IV – MNIST letters labelling

IV.1 – Code

First we import the MNIST letters dataset. A function can print the data, which is a matrix of matrices 16*8 filled by 0 and 1, (1 for white and 0 for black) representing a letter.

Then we pre-process the data with the function ImageLabellingFormatting :



-­‐ From each matrix representing a letter, we extract the "1" elements. It means that if 𝑚!,! = 1 for example, we extract the point 1,3 . We do it for all the matrices, we obtain a vector for each matrix, containing the coordinates of the white points.

-­‐ We can then apply the previous denoising algorithm. -­‐ If the result is not an integer, for example if we plan to move a pixel to the

coordinates (12,54; 14,1), we round it to (13; 14). -­‐ The vector obtained is transformed in a matrix of 0 and 1: that is the exact

opposite of the previous task.

IV.2 – Results

We can compare the initial image and the blurred one. The author does not explain how he blurs an image represented the way it is here, but only the way he blurs an image that has the same representation as the spiral of the III rd part. This is why we decided to use the same approach, not using shades of grey.

It is important to take an even number of neighbors, else a line would not remain a line after blurring : suppose 5 pixels are aligned horizontally, we move the 3rd one, taking 3 neighbors, so we will base ourselves on 2 neighbors on the one side (for example on the left) and 1 neighbor on the other side (for example on the right). In that case, the pixel will be merged with another one, and the line will have holes.

Using a neural network with one lay, we label the images. The error rate is 51%. This means that this algorithm is not really efficient on this problem.

Then we blur the images and do the study again. The good labelling rate is 53% : pre-processing the data allow to do better labelling.

This result also appears when we consider a random subset of the dataset. We decrease the error rate by 2-4%.

After that, we separate the learning dataset from the test dataset. The decrease in error rate is between 3-4% : from 35% to 39%, which means that the algorithm allows to make better labelling by more than 10%, using neural network.

See the code letters in annex, and attached in the .rar.



Good labelling rates

dataset Training/test dataset

No blurring 51% 35% blurring 53% 39%

V – Conclusion

The Manifold Blurring Mean Shift algorithm allows to blur an image in order to:

-­‐ Erase some outliers in merging them in the "real" image; -­‐ Merge outliers and decreasing their number.

The complexity of the algorithm, which is polynomial in the number of points make it difficult to use on bigger sets like noisy images. However, it is useful to smaller data and small images as the MNIST dataset. The advantage of this algorithm is that it makes no hypothesis on the distribution of the noise and outliers, but computes the associated motion using relative position to the dataset.

Finally, the algorithm allowed us to decrease the error rate of a multi-label classification method: we can assume that this pre-processing method allows improving classification performance on noisy datasets.

The results of classification are still quite poor: the neural network with one layer is not performing as wanted on such a classification. We actually got way better performance with a multilabel structured SVM on the initial dataset, but could not apply it on pre-processed data for technical issues.



Appendix 1: Code spiral

%% Harvard spirals % Generate noisy spiral [h_spiral_set,b] = pinwheel(0.1,0.3,5, 200, .5); figure, scatter(h_spiral_set(:,1), h_spiral_set(:, 2)) denoised_h_spiral = h_spiral_set; % Add outliers n_outliers = 80; outliers = -2 + 4*rand(n_outliers, 2); spiral_with_outliers = [denoised_h_spiral; outliers]; denoised_h_spiral = spiral_with_outliers; h_spiral_set = spiral_with_outliers; figure, scatter(h_spiral_set(:,1), h_spiral_set(:, 2)) % Apply MBMS Algorithm denoised_h_spiral = MBMS(L,k,sigma,denoised_h_spiral,n_iteration); figure, scatter(denoised_h_spiral(:,1), denoised_h_spiral(:, 2)) axis([-2,2,-2,2]) figure, subplot(1,2,1) scatter(h_spiral_set(:,1), h_spiral_set(:, 2)) title('Harvard pinwheel - original data') subplot(1,2,2) scatter(denoised_h_spiral(:,1), denoised_h_spiral(:, 2)) title('GBMS denoised data (L = 0, k = 3, sigma = 1.5)') % DEMO - Plotting dataset at each iteration n_iter = 5; denoised_h_spiral = h_spiral_set; figure, scatter(denoised_h_spiral(:,1), denoised_h_spiral(:, 2)) axis([-2,2,-2,2]) title('Initial dataset') for it = 1:n_iter, denoised_h_spiral = MBMS(L,k,sigma,denoised_h_spiral,1); figure,



scatter(denoised_h_spiral(:,1), denoised_h_spiral(:, 2)) axis([-2,2,-2,2]) end

Appendix 2: Code letters  


%% MNIST dataset classification % Import data MNIST_data = importdata(''); % Format data: deletes useless features pixel_data =, 5:end); textdata = MNIST_data.textdata; % Size of the image nrow = 16; ncol = 8; % Plot a data subset with labelling figure, subplot(7, 7, 1) for i = 1:49, subplot(7, 7, i) imshow(vec2mat(pixel_data(i,:),ncol)) title(sprintf('MNIST label: %c', cell2mat(textdata(i, 2)))) end %% Preprocessing the data letters = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'}; n = 10; eta = 0.000025; L = 1; sigma = 1; k = 4; n_iteration = 1; row = floor(size(textdata(:,2))) row = row(1) pixels = zeros(row,nrow*ncol); for i=1:row if (i==mod(1,50)) i end tt = PositiveLabellingFormatting(pixel_data(i,:), 'MNIST'); ttt = MBMS(L,k,sigma,tt{1}, n_iteration); ll = ImageLabellingFormatting(ttt, 'MNIST');



pixels(i,:)=vec(ll); end %% error rates % labelling before preprocessing mlp1 = MLP(textdata(1:row,2),pixel_data(1:row,:),letters,n,eta) ; rate1 = ResultMLP(mlp1,pixel_data(1:row,:),textdata(1:row,2),letters) % error rate % labelling after preprocessing mlp2 = MLP(textdata(1:row,2),pixels(1:row,:),letters,n,eta) ; rate2 = ResultMLP(mlp2,pixels(1:row,:),textdata(1:row,2),letters) % error rate % error rate reduction : % 20 000 : 0.5327 then 0.5294 % 30 000 : 0.5123 then 0.5103 % 40 000 : 0.5878 then 0.5659 % 50 000 : 0.4886 then 0.4719 %% separating learning dataset and test dataset number = 30000; % size of the learning dataset % labelling before preprocessing mlp1 = MLP(textdata(1:number ,2),pixel_data(1:number ,:),letters,n,eta) ; rate1 = ResultMLP(mlp1,pixel_data(number :row,:),textdata(number :row,2),letters) % error rate % labelling after preprocessing mlp2 = MLP(textdata(1:number ,2),pixels(1:number ,:),letters,n,eta) ; rate2 = ResultMLP(mlp2,pixels(number :row,:),textdata(number :row,2),letters) % error rate % error rate reduction : % 40 000 : 0.6528 then 0.6160 % 30 000 : 0.6397 then 0.6145  
