Document23

Mixtures of Gaussians and Advanced Feature EncodingComputer VisionCS 143, Brown

James HaysMany slides from Derek Hoiem, Florent Perronnin, and Herv Jgou

Why do good recognition systems go bad?E.g. Why isnt our Bag of Words classifier at 90% instead of 70%?Training DataHuge issue, but not necessarily a variable you can manipulate.Learning methodProbably not such a big issue, unless youre learning the representation (e.g. deep learning).RepresentationAre the local features themselves lossy? Guest lecture Nov 8th will address this.What about feature quantization? Thats VERY lossy.

Standard Kmeans Bag of Words http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Todays ClassMore advanced quantization / encoding methods that represent the state-of-the-art in image classification and image retrieval.Soft assignment (a.k.a. Kernel Codebook)VLADFisher Vector

Mixtures of Gaussians

MotivationBag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region

Why not including other statistics? http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

We already looked at the Spatial Pyramidlevel 0level 1But today were not talking about ways to preserve spatial information.


Why not including other statistics? For instance:mean of local descriptorshttp://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf


Why not including other statistics? For instance:mean of local descriptors(co)variance of local descriptorshttp://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Simple case: Soft AssignmentCalled Kernel codebook encoding by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.

Simple case: Soft AssignmentCalled Kernel codebook encoding by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.This is fast and easy to implement (try it for Project 3!) but it does have some downsides for image retrieval the inverted file index becomes less sparse.

A first example: the VLADGiven a codebook , e.g. learned with K-means, and a set of local descriptors :

assign:

compute:

concatenate vis + normalize

Jgou, Douze, Schmid and Prez, Aggregating local descriptors into a compact image representation, CVPR10. 3xv1v2v3v4v51 4 2 5 assign descriptors compute x- i vi=sum x- i for cell i

A first example: the VLADA graphical representation of

Jgou, Douze, Schmid and Prez, Aggregating local descriptors into a compact image representation, CVPR10.

The Fisher vectorScore function

Given a likelihood function with parameters , the score function of a given sample X is given by:

Fixed-length vector whose dimensionality depends only on # parameters.

Intuition: direction in which the parameters of the model should we modified to better fit the data.

Aside: Mixture of Gaussians (GMM)For Fisher Vector image representations, is a GMM. GMM can be thought of as soft kmeans.

Each component has a mean and a standard deviation along each direction (or full covariance)

Missing Data Problems: OutliersYou want to train an algorithm to predict whether a photograph is attractive. You collect annotations from Mechanical Turk. Some annotators try to give accurate ratings, but others answer randomly.

Challenge: Determine which people to trust and the average rating by accurate annotators.Photo: Jam343 (Flickr)Annotator Ratings

108928

Missing Data Problems: Object DiscoveryYou have a collection of images and have extracted regions from them. Each is represented by a histogram of visual words.Challenge: Discover frequently occurring object categories, without pre-trained appearance models.http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

Missing Data Problems: SegmentationYou are given an image and want to assign foreground/background pixels.

Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.ForegroundBackgroundSlide: Derek Hoiem

Missing Data Problems: SegmentationChallenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.

Three steps:If we had labels, how could we model the appearance of foreground and background?Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?How can we get both labels and appearance models at once?

ForegroundBackgroundSlide: Derek Hoiem

Maximum Likelihood Estimation

If we had labels, how could we model the appearance of foreground and background?


Maximum Likelihood EstimationdataparametersSlide: Derek Hoiem

Maximum Likelihood EstimationGaussian DistributionSlide: Derek Hoiem

Example: MLE>> mu_fg = mean(im(labels))mu_fg = 0.6012

>> sigma_fg = sqrt(mean((im(labels)-mu_fg).^2))sigma_fg = 0.1007

>> mu_bg = mean(im(~labels))mu_bg = 0.4007

>> sigma_bg = sqrt(mean((im(~labels)-mu_bg).^2))sigma_bg = 0.1007

>> pfg = mean(labels(:));

labelsimfg: mu=0.6, sigma=0.1bg: mu=0.4, sigma=0.1Parameters used to GenerateSlide: Derek Hoiem

Probabilistic Inference 2.Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?


Probabilistic InferenceCompute the likelihood that a particular model generated a samplecomponent or labelSlide: Derek Hoiem

Example: Inference>> pfg = 0.5;>> px_fg = normpdf(im, mu_fg, sigma_fg);>> px_bg = normpdf(im, mu_bg, sigma_bg);>> pfg_x = px_fg*pfg ./ (px_fg*pfg + px_bg*(1-pfg));imfg: mu=0.6, sigma=0.1bg: mu=0.4, sigma=0.1Learned Parametersp(fg | im)Slide: Derek Hoiem

Figure from Bayesian Matting, Chuang et al. 2001Mixture of Gaussian* Example: Matting

Mixture of Gaussian* Example: MattingResult from Bayesian Matting, Chuang et al. 2001

Dealing with Hidden Variables

3.How can we get both labels and appearance models at once?


Mixture of Gaussiansmixture componentcomponent priorcomponent model parametersSlide: Derek Hoiem

Mixture of Gaussians

With enough components, can represent any probability density functionWidely used as general purpose pdf estimator

Slide: Derek Hoiem

Segmentation with Mixture of GaussiansPixels come from one of several Gaussian componentsWe dont know which pixels come from which componentsWe dont know the parameters for the componentsSlide: Derek Hoiem

Simple solutionInitialize parameters

Compute the probability of each hidden variable given the current parameters

Compute new parameters for each model, weighted by likelihood of hidden variables

Repeat 2-3 until convergenceSlide: Derek Hoiem

Mixture of Gaussians: Simple SolutionInitialize parameters

Compute likelihood of hidden variables for current parameters

Estimate new parameters for each model, weighted by likelihood Slide: Derek Hoiem

Expectation Maximization (EM) AlgorithmGoal: Jensens InequalityLog of sums is intractableSee here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.psfor concave funcions, such as f(x)=log(x)Slide: Derek Hoiem

Expectation Maximization (EM) Algorithm

E-step: compute

M-step: solveGoal: Slide: Derek Hoiem

This looks like a soft version of kmeans!

EM for Mixture of Gaussians (on board)

E-step:

M-step:

EM for Mixture of Gaussians (on board)

E-step:

M-step: Slide: Derek Hoiem

EM Algorithm

Maximizes a lower bound on the data likelihood at each iteration

Each step increases the data likelihoodConverges to local maximum

Common tricks to derivationFind terms that sum or integrate to 1Lagrange multiplier to deal with constraints

Slide: Derek Hoiem

Mixture of Gaussian demoshttp://www.cs.cmu.edu/~alad/em/http://lcn.epfl.ch/tutorial/english/gaussian/html/Slide: Derek Hoiem

Hard EMSame as EM except compute z* as most likely values for hidden variables

K-means is an example

AdvantagesSimpler: can be applied when cannot derive EMSometimes works better if you want to make hard predictions at the endButGenerally, pdf parameters are not as accurate as EMSlide: Derek Hoiem

Missing Data Problems: OutliersYou want to train an algorithm to predict whether a photograph is attractive. You collect annotations from Mechanical Turk. Some annotators try to give accurate ratings, but others answer randomly.

Challenge: Determine which people to trust and the average rating by accurate annotators.Photo: Jam343 (Flickr)Annotator Ratings

108928

Missing Data Problems: Object DiscoveryYou have a collection of images and have extracted regions from them. Each is represented by a histogram of visual words.Challenge: Discover frequently occurring object categories, without pre-trained appearance models.http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

Whats wrong with this prediction?P(foreground | image)Slide: Derek Hoiem

Next classMRFs and Graph-cut SegmentationSlide: Derek Hoiem

Missing data problemsOutlier DetectionYou expect that some of the data are valid points and some are outliers butYou dont know which are inliersYou dont know the parameters of the distribution of inliers

Modeling probability densitiesYou expect that most of the colors within a region come from one of a few normally distributed sources butYou dont know which source each pixel comes fromYou dont know the Gaussian means or covariances

Discovering patterns or topicsYou expect that there are some re-occuring topics of codewords within an image collection. You want toFigure out the frequency of each codeword within each topicFigure out which topic each image belongs to Slide: Derek Hoiem

Running Example: SegmentationGiven examples of foreground and backgroundEstimate distribution parametersPerform inference

Knowing that foreground and background are normally distributed

Some initial estimate, but no good knowledge of foreground and background

FV formulas:

The Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

FV formulas:gradient wrt to w

soft BOV

= soft-assignment of patch t to Gaussian i


gradient wrt to and


soft BOV


compared to BOV, include higher-order statistics (up to order 2)

Let us denote: D = feature dim, N = # GaussiansBOV = N-dimFV = 2DN-dim


gradient wrt to and


soft BOV


compared to BOV, include higher-order statistics (up to order 2)

FV much higher-dim than BOV for a given visual vocabulary sizeFV much faster to compute than BOV for a given feature dimThe Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

The Fisher vectorDimensionality reduction on local descriptorsPerform PCA on local descriptors: uncorrelated features are more consistent with diagonal assumption of covariance matrices in GMM FK performs whitening and enhances low-energy (possibly noisy) dimensions

The Fisher vectorNormalization: variance stabilization

Variance stabilizing transforms of the form: (with =0.5 by default)can be used on the FV (or the VLAD).

Reduce impact of bursty visual elementsJgou, Douze, Schmid, On the burstiness of visual elements, ICCV09.

Datasets for image retrievalINRIA Holidays dataset: 1491 shots of personal Holiday snapshot500 queries, each associated with a small number of results 1-11 results1 million distracting images (with some false false positives)

Herv Jgou, Matthijs Douze and Cordelia Schmid Hamming Embedding and Weak Geometric consistency for large-scale image search, ECCV'08

ExamplesRetrievalExample on Holidays:

From: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.


second order statistics are not essential for retrievalFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.


second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.


second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVsoft assignment + whitening of FV helps when number of Gaussians From: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.


second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVsoft assignment + whitening of FV helps when number of Gaussians after dim-reduction however, the FV and VLAD perform similarlyFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

ExamplesClassificationExample on PASCAL VOC 2007:

From: Chatfield, Lempitsky, Vedaldi and Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC11.

Feature dimmAPVQ25K55.30KCB25K56.26LLC25K57.27SV41K58.16FV132K61.69



FV outperforms BOV-based techniques including:VQ: plain vanilla BOVKCB: BOV with soft assignmentLLC: BOV with sparse coding




FV outperforms BOV-based techniques including:VQ: plain vanilla BOVKCB: BOV with soft assignmentLLC: BOV with sparse coding

including 2nd order information is important for classification


Packages

The INRIA package:http://lear.inrialpes.fr/src/inria_fisher/

The Oxford package:http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/

Vlfeat does it too!http://www.vlfeat.org

SummaryWeve looked at methods to better characterize the distribution of visual words in an image:Soft assignment (a.k.a. Kernel Codebook)VLADFisher Vector

Mixtures of Gaussians is conceptually a soft form of kmeans which can better model the data distribution.

Intuition youre encoding the updates to each centroid if you were running the kmeans algorithm.*K=16 centroids / clusters. Blue is positive red is negative. SIFT layout.*But how did we find the Gaussian Mixture model for our feature encoding? Very similar to kmeans.*Assume data is i.i.d. and solve using partial derivatives.**Bayesian matting is not exactly mixture of Gaussians. There are multiple Gaussians, but each unknown pixel is expected to be a combination of exactly two of them.*64 feature dimensions*

Documents

Document23