42
Mixtures of Gaussians and Advanced Feature Encoding Computer Vision CS 143, Brown James Hays Many slides from Derek Hoiem, Florent Perronnin, and Hervé Jégou

Document23

Embed Size (px)

DESCRIPTION

23

Citation preview

  • Mixtures of Gaussians and Advanced Feature EncodingComputer VisionCS 143, Brown

    James HaysMany slides from Derek Hoiem, Florent Perronnin, and Herv Jgou

  • Why do good recognition systems go bad?E.g. Why isnt our Bag of Words classifier at 90% instead of 70%?Training DataHuge issue, but not necessarily a variable you can manipulate.Learning methodProbably not such a big issue, unless youre learning the representation (e.g. deep learning).RepresentationAre the local features themselves lossy? Guest lecture Nov 8th will address this.What about feature quantization? Thats VERY lossy.

  • Standard Kmeans Bag of Words http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

  • Todays ClassMore advanced quantization / encoding methods that represent the state-of-the-art in image classification and image retrieval.Soft assignment (a.k.a. Kernel Codebook)VLADFisher Vector

    Mixtures of Gaussians

  • MotivationBag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region

    Why not including other statistics? http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

  • We already looked at the Spatial Pyramidlevel 0level 1But today were not talking about ways to preserve spatial information.

  • MotivationBag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region

    Why not including other statistics? For instance:mean of local descriptorshttp://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

  • MotivationBag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region

    Why not including other statistics? For instance:mean of local descriptors(co)variance of local descriptorshttp://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

  • Simple case: Soft AssignmentCalled Kernel codebook encoding by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.

  • Simple case: Soft AssignmentCalled Kernel codebook encoding by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.This is fast and easy to implement (try it for Project 3!) but it does have some downsides for image retrieval the inverted file index becomes less sparse.

  • A first example: the VLADGiven a codebook , e.g. learned with K-means, and a set of local descriptors :

    assign:

    compute:

    concatenate vis + normalize

    Jgou, Douze, Schmid and Prez, Aggregating local descriptors into a compact image representation, CVPR10. 3xv1v2v3v4v51 4 2 5 assign descriptors compute x- i vi=sum x- i for cell i

  • A first example: the VLADA graphical representation of

    Jgou, Douze, Schmid and Prez, Aggregating local descriptors into a compact image representation, CVPR10.

  • The Fisher vectorScore function

    Given a likelihood function with parameters , the score function of a given sample X is given by:

    Fixed-length vector whose dimensionality depends only on # parameters.

    Intuition: direction in which the parameters of the model should we modified to better fit the data.

  • Aside: Mixture of Gaussians (GMM)For Fisher Vector image representations, is a GMM. GMM can be thought of as soft kmeans.

    Each component has a mean and a standard deviation along each direction (or full covariance)

  • Missing Data Problems: OutliersYou want to train an algorithm to predict whether a photograph is attractive. You collect annotations from Mechanical Turk. Some annotators try to give accurate ratings, but others answer randomly.

    Challenge: Determine which people to trust and the average rating by accurate annotators.Photo: Jam343 (Flickr)Annotator Ratings

    108928

  • Missing Data Problems: Object DiscoveryYou have a collection of images and have extracted regions from them. Each is represented by a histogram of visual words.Challenge: Discover frequently occurring object categories, without pre-trained appearance models.http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

  • Missing Data Problems: SegmentationYou are given an image and want to assign foreground/background pixels.

    Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.ForegroundBackgroundSlide: Derek Hoiem

  • Missing Data Problems: SegmentationChallenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.

    Three steps:If we had labels, how could we model the appearance of foreground and background?Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?How can we get both labels and appearance models at once?

    ForegroundBackgroundSlide: Derek Hoiem

  • Maximum Likelihood Estimation

    If we had labels, how could we model the appearance of foreground and background?

    ForegroundBackgroundSlide: Derek Hoiem

  • Maximum Likelihood EstimationdataparametersSlide: Derek Hoiem

  • Maximum Likelihood EstimationGaussian DistributionSlide: Derek Hoiem

  • Maximum Likelihood EstimationGaussian DistributionSlide: Derek Hoiem

  • Example: MLE>> mu_fg = mean(im(labels))mu_fg = 0.6012

    >> sigma_fg = sqrt(mean((im(labels)-mu_fg).^2))sigma_fg = 0.1007

    >> mu_bg = mean(im(~labels))mu_bg = 0.4007

    >> sigma_bg = sqrt(mean((im(~labels)-mu_bg).^2))sigma_bg = 0.1007

    >> pfg = mean(labels(:));

    labelsimfg: mu=0.6, sigma=0.1bg: mu=0.4, sigma=0.1Parameters used to GenerateSlide: Derek Hoiem

  • Probabilistic Inference 2.Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?

    ForegroundBackgroundSlide: Derek Hoiem

  • Probabilistic InferenceCompute the likelihood that a particular model generated a samplecomponent or labelSlide: Derek Hoiem

  • Probabilistic InferenceCompute the likelihood that a particular model generated a samplecomponent or labelSlide: Derek Hoiem

  • Probabilistic InferenceCompute the likelihood that a particular model generated a samplecomponent or labelSlide: Derek Hoiem

  • Probabilistic InferenceCompute the likelihood that a particular model generated a samplecomponent or labelSlide: Derek Hoiem

  • Example: Inference>> pfg = 0.5;>> px_fg = normpdf(im, mu_fg, sigma_fg);>> px_bg = normpdf(im, mu_bg, sigma_bg);>> pfg_x = px_fg*pfg ./ (px_fg*pfg + px_bg*(1-pfg));imfg: mu=0.6, sigma=0.1bg: mu=0.4, sigma=0.1Learned Parametersp(fg | im)Slide: Derek Hoiem

  • Figure from Bayesian Matting, Chuang et al. 2001Mixture of Gaussian* Example: Matting

  • Mixture of Gaussian* Example: MattingResult from Bayesian Matting, Chuang et al. 2001

  • Dealing with Hidden Variables

    3.How can we get both labels and appearance models at once?

    ForegroundBackgroundSlide: Derek Hoiem

  • Mixture of Gaussiansmixture componentcomponent priorcomponent model parametersSlide: Derek Hoiem

  • Mixture of Gaussians

    With enough components, can represent any probability density functionWidely used as general purpose pdf estimator

    Slide: Derek Hoiem

  • Segmentation with Mixture of GaussiansPixels come from one of several Gaussian componentsWe dont know which pixels come from which componentsWe dont know the parameters for the componentsSlide: Derek Hoiem

  • Simple solutionInitialize parameters

    Compute the probability of each hidden variable given the current parameters

    Compute new parameters for each model, weighted by likelihood of hidden variables

    Repeat 2-3 until convergenceSlide: Derek Hoiem

  • Mixture of Gaussians: Simple SolutionInitialize parameters

    Compute likelihood of hidden variables for current parameters

    Estimate new parameters for each model, weighted by likelihood Slide: Derek Hoiem

  • Expectation Maximization (EM) AlgorithmGoal: Jensens InequalityLog of sums is intractableSee here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.psfor concave funcions, such as f(x)=log(x)Slide: Derek Hoiem

  • Expectation Maximization (EM) Algorithm

    E-step: compute

    M-step: solveGoal: Slide: Derek Hoiem

  • This looks like a soft version of kmeans!

  • EM for Mixture of Gaussians (on board)

    E-step:

    M-step:

  • EM for Mixture of Gaussians (on board)

    E-step:

    M-step: Slide: Derek Hoiem

  • EM Algorithm

    Maximizes a lower bound on the data likelihood at each iteration

    Each step increases the data likelihoodConverges to local maximum

    Common tricks to derivationFind terms that sum or integrate to 1Lagrange multiplier to deal with constraints

    Slide: Derek Hoiem

  • Mixture of Gaussian demoshttp://www.cs.cmu.edu/~alad/em/http://lcn.epfl.ch/tutorial/english/gaussian/html/Slide: Derek Hoiem

  • Hard EMSame as EM except compute z* as most likely values for hidden variables

    K-means is an example

    AdvantagesSimpler: can be applied when cannot derive EMSometimes works better if you want to make hard predictions at the endButGenerally, pdf parameters are not as accurate as EMSlide: Derek Hoiem

  • Missing Data Problems: OutliersYou want to train an algorithm to predict whether a photograph is attractive. You collect annotations from Mechanical Turk. Some annotators try to give accurate ratings, but others answer randomly.

    Challenge: Determine which people to trust and the average rating by accurate annotators.Photo: Jam343 (Flickr)Annotator Ratings

    108928

  • Missing Data Problems: Object DiscoveryYou have a collection of images and have extracted regions from them. Each is represented by a histogram of visual words.Challenge: Discover frequently occurring object categories, without pre-trained appearance models.http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

  • Whats wrong with this prediction?P(foreground | image)Slide: Derek Hoiem

  • Next classMRFs and Graph-cut SegmentationSlide: Derek Hoiem

  • Missing data problemsOutlier DetectionYou expect that some of the data are valid points and some are outliers butYou dont know which are inliersYou dont know the parameters of the distribution of inliers

    Modeling probability densitiesYou expect that most of the colors within a region come from one of a few normally distributed sources butYou dont know which source each pixel comes fromYou dont know the Gaussian means or covariances

    Discovering patterns or topicsYou expect that there are some re-occuring topics of codewords within an image collection. You want toFigure out the frequency of each codeword within each topicFigure out which topic each image belongs to Slide: Derek Hoiem

  • Running Example: SegmentationGiven examples of foreground and backgroundEstimate distribution parametersPerform inference

    Knowing that foreground and background are normally distributed

    Some initial estimate, but no good knowledge of foreground and background

  • FV formulas:

    The Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

  • FV formulas:gradient wrt to w

    soft BOV

    = soft-assignment of patch t to Gaussian i

    The Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

  • gradient wrt to and

    FV formulas:gradient wrt to w

    soft BOV

    = soft-assignment of patch t to Gaussian i

    compared to BOV, include higher-order statistics (up to order 2)

    Let us denote: D = feature dim, N = # GaussiansBOV = N-dimFV = 2DN-dim

    The Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

  • gradient wrt to and

    FV formulas:gradient wrt to w

    soft BOV

    = soft-assignment of patch t to Gaussian i

    compared to BOV, include higher-order statistics (up to order 2)

    FV much higher-dim than BOV for a given visual vocabulary sizeFV much faster to compute than BOV for a given feature dimThe Fisher vectorRelationship with the BOVPerronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR07.

  • The Fisher vectorDimensionality reduction on local descriptorsPerform PCA on local descriptors: uncorrelated features are more consistent with diagonal assumption of covariance matrices in GMM FK performs whitening and enhances low-energy (possibly noisy) dimensions

  • The Fisher vectorNormalization: variance stabilization

    Variance stabilizing transforms of the form: (with =0.5 by default)can be used on the FV (or the VLAD).

    Reduce impact of bursty visual elementsJgou, Douze, Schmid, On the burstiness of visual elements, ICCV09.

  • Datasets for image retrievalINRIA Holidays dataset: 1491 shots of personal Holiday snapshot500 queries, each associated with a small number of results 1-11 results1 million distracting images (with some false false positives)

    Herv Jgou, Matthijs Douze and Cordelia Schmid Hamming Embedding and Weak Geometric consistency for large-scale image search, ECCV'08

  • ExamplesRetrievalExample on Holidays:

    From: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

  • ExamplesRetrievalExample on Holidays:

    second order statistics are not essential for retrievalFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

  • ExamplesRetrievalExample on Holidays:

    second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

  • ExamplesRetrievalExample on Holidays:

    second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVsoft assignment + whitening of FV helps when number of Gaussians From: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

  • ExamplesRetrievalExample on Holidays:

    second order statistics are not essential for retrievaleven for the same feature dim, the FV/VLAD can beat the BOVsoft assignment + whitening of FV helps when number of Gaussians after dim-reduction however, the FV and VLAD perform similarlyFrom: Jgou, Perronnin, Douze, Snchez, Prez and Schmid, Aggregating local descriptors into compact codes, TPAMI11.

  • ExamplesClassificationExample on PASCAL VOC 2007:

    From: Chatfield, Lempitsky, Vedaldi and Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC11.

    Feature dimmAPVQ25K55.30KCB25K56.26LLC25K57.27SV41K58.16FV132K61.69

  • ExamplesClassificationExample on PASCAL VOC 2007:

    From: Chatfield, Lempitsky, Vedaldi and Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC11.

    FV outperforms BOV-based techniques including:VQ: plain vanilla BOVKCB: BOV with soft assignmentLLC: BOV with sparse coding

    Feature dimmAPVQ25K55.30KCB25K56.26LLC25K57.27SV41K58.16FV132K61.69

  • ExamplesClassificationExample on PASCAL VOC 2007:

    From: Chatfield, Lempitsky, Vedaldi and Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC11.

    FV outperforms BOV-based techniques including:VQ: plain vanilla BOVKCB: BOV with soft assignmentLLC: BOV with sparse coding

    including 2nd order information is important for classification

    Feature dimmAPVQ25K55.30KCB25K56.26LLC25K57.27SV41K58.16FV132K61.69

  • Packages

    The INRIA package:http://lear.inrialpes.fr/src/inria_fisher/

    The Oxford package:http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/

    Vlfeat does it too!http://www.vlfeat.org

  • SummaryWeve looked at methods to better characterize the distribution of visual words in an image:Soft assignment (a.k.a. Kernel Codebook)VLADFisher Vector

    Mixtures of Gaussians is conceptually a soft form of kmeans which can better model the data distribution.

    Intuition youre encoding the updates to each centroid if you were running the kmeans algorithm.*K=16 centroids / clusters. Blue is positive red is negative. SIFT layout.*But how did we find the Gaussian Mixture model for our feature encoding? Very similar to kmeans.*Assume data is i.i.d. and solve using partial derivatives.**Bayesian matting is not exactly mixture of Gaussians. There are multiple Gaussians, but each unknown pixel is expected to be a combination of exactly two of them.*64 feature dimensions*