Similarity Search in Visual Data Ph.D. Thesis Defense Anoop Cherian * Department of Computer Science and Engineering University of Minnesota, Twin-Cities

Similarity Search in Visual Data Ph.D. Thesis Defense Anoop Cherian * Department of Computer Science and Engineering University of Minnesota, Twin-Cities Adviser: Prof. Nikolaos Papanikolopoulos *Contact: [email protected]@cs.umn.edu

Talk Outline Introduction Problem Statement Algorithms for Similarity Search in Matrix Valued Data High Dimensional Vector Data Conclusion Future Work

Thesis Related Publications Journals 1. A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensen-Bregman- LogDet-Divergence with Application to Efficient Similarity Search for Covariance Matrices. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), [Accepted with minor revisions]. (Chapter 3) 2.A. Cherian, V. Morellas, and N. Papanikolopoulos. Efficient Nearest Neighbor Retrieval via Sparse Coding. Pattern Recognition Journal, [Being submitted] (Chapters 5, 7) Conference Publications 1. A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Efficient Similarity Search on Covariance Matrices via the Jensen-Bregman-LogDet-Divergence, Intl. Conf. on Computer Vision (ICCV), 2011. (Chapter 3) 2. A. Cherian, V. Morellas, N. Papanikolopoulos, and S. Badros. Dirichlet Process Mixture Models on Symmetric Positive Definite Matrices for Appearance Clustering in Video Surveillance Applications, Computer Vision and Pattern Recognition (CVPR), 2011. (Chapter 4)

Thesis Related Publications 3. A. Cherian, J. Andersh, V. Morellas, N. Papanikolopoulos, and B. Mettler. Motion Estimation of a Miniature Helicopter using a Single Onboard Camera, American Control Conference (ACC), 2010. (Chapter 5) 4. A. Cherian, S. Sra, and N. Papanikolopoulos. Denoising Sparse Noise via Online Dictionary Learning. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2011. (Chapter 6) 5. A. Cherian, V. Morellas, and N. Papanikolopoulos. Robust Sparse Hashing. Intl. Conf. on Image Processing (ICIP), 2012 (Chapter 6) [Best Student Paper Award] 6. A. Cherian, V. Morellas, and N. Papanikolopoulos. Approximate Nearest Neighbors via Dictionary Learning, Proceedings of SPIE, 2011. (Chapters 5,6,7) 7. S. Sra, and A. Cherian. Generalized Dictionary Learning for Symmetric Positive Definite Matrices with Application to Nearest Neighbor Retrieval, European Conference on Machine Learning (ECML), 2011. (Chapter 8) 8. A. Cherian, and N. Papanikolopoulos. Large Scale Image Search via Sparse Coding. Minnesota Supercomputing Institute (MSI) Poster Presentation, 2012. [Best Poster Award]

Talk Outline Introduction Motivation Problem Statement Algorithms for Similarity Search in Matrix Valued Data High Dimensional Vector Data Conclusion Future Work

Courtesy of Intel

Big-Data Challenge How to connect the information seeker to the right content? Solution Similarity search Three fundamental steps in similarity search 1. Represent the data 2. Describe the query 3. Retrieve data most similar to the query

Visual Data Challenges Art courtesy of Thomas Kinkade Pastoral House Never express yourself more clearly than you are able to think-- Neils Bohr It is sometimes difficult to describe precisely in words, what data is to be retrieved! This is especially the case in visual content retrieval, where similarity is defined by an unconscious process. Therefore, characterizing what we see is hard. It is even harder to teach a machine visual similarity.

A Few Applications using Visual Similarity Search Content-based image retrieval Medical Image Analysis 3D Reconstruction Visual SurveillanceHuman-Machine Interaction

3D Scene Reconstruction: Technical Analysis Courtesy: Google Street View Goal: 3D street view Input: A set of images Algorithm 1. Find point correspondences between pairs of images 2. Estimate camera parameters 3. Estimate camera motion 4. Estimate 3D point locations

3D Scene Reconstruction: Technical Analysis Courtesy: Google Street View Typically SIFT point descriptors (128D) are used as point descriptors Each image produces several thousand SIFT descriptors (let us say 10K SIFTs/image) There are several thousand images required for a reliable reconstruction (assume 1K images). Thus, there are approximately 10Kx1K=10 7 SIFTs. Pair-wise computations require 10 14 comparisons! This is for only one scene think of millions of scenes in a Street-View application! Computational bottleneck: Efficient similarity computation!

Problem Statement Approximate Nearest Neighbor

Problem Challenges High dimensional data Poses the curse of dimensionality Difficult to distinguish near and far points Examples: SIFT (128D), GIST (960D) Large scale datasets Needle in the haystack! Peta-bytes of visual data and billions of data descriptors Desired Similarity Search Algorithm Properties High retrieval accuracy Fast retrieval Low memory footprint Scalability to large datasets Scalability to high dimensional data Robustness to data perturbations Generalizability to various data descriptors Unit ball inside a unit hypercube

Thesis Contributions We propose NN retrieval algorithms for two different data modalities:- Matrix valued data (as symmetric positive definite matrices) A new similarity distance Jensen Bregman LogDet Divergence An unsupervised clustering algorithm High dimensional vector valued data A novel connection between sparse coding and hashing A fast and accurate hashing algorithm for NN retrieval Theoretical analysis of our algorithms Experimental validation of our algorithms: against the state-of-the-art techniques in NN retrieval, and on several computer vision datasets

Talk Outline Introduction Introduction Motivation Motivation Problem Statement Problem Statement Algorithms for Similarity Search in Algorithms for Similarity Search in Matrix Valued Data High Dimensional Vector Data Conclusion Future Work

Covariance of features Appearance silhouette Features (color + gradient + curvature) Covariance Descriptor Advantages Multi-feature fusion Compact Real-time computable Robust to static noise Robust to illumination Robust to affine transforms Matrix (Covariance) Valued Data

Importance of Covariance Valued Data in Vision Diffusion Imaging (3x3D) (DT-MRI) Object Tracking (5x5D), Tuzel, et al. 2006 Activity Recognition (12x12D), Guo et al. 2009 Emotion Recognition (30x30D), Zheng, et al., 2010 Face Recognition (40x40D), Pang et al. 2008 3D Object Recognition (8x8D), Fehr et al. 2012

Geometry of Covariances Covariances form a manifold in Euclidean space due to their positive definiteness property Distances are not straight lines, but curves! Incorporating curvature makes distance computation expensive X Y S p ++

Similarity Metrics on Covariances Affine Invariant Riemannian Metric (AIRM) Natural metric induced by the Riemannian geometry Log-Euclidean Riemannian Metric (LERM) Induced by approximating covariances to a flat geometry Kullback-Leibler Divergence Metric (KLDM) Considering covariances as objects of an associated Gaussian distribution Matrix Frobenius Distance (FROB) Considering covariances as vectors in the Euclidean space

Let f be a convex function d f (X,Y) is the deviation of f(Y) from the tangent through f(X) (see figure on the right) Jensen-Bregman divergence is the average deviation of f from the mid point of X and Y Our new measure is derived by substituting f as the -log|. | function: where X,Y are covariances and log|. | is the logdet function. f(Y) f(X) f Our Distance:Jensen-Bregman LogDet Divergence (JBLD)

Metric\ Property MGNIAINEOFLOPS FROB d(d+1)/2 AIRM 4d 3 LERM (8/3) d 3 KLDM (8/3) d 3 JBLD d3d3 Notation What it means ? MDoes it satisfy all metric properties? GAre gradient computations fast? NIIs the measure invariant to inversion? AIAffine invariance? Notation What it means ? NEAre neg. eigenvalues at infinity? OWill not overestimate AIRM? FLOPSComputational complexity Properties of JBLD

Computational Speedup using JBLD Speedup in computing AIRM and JBLD for increasing matrix dimensionality Speedup in computing gradients of AIRM and JBLD for increasing matrix dimensionality

JBLD Geometry FROB surface AIRM surface KLDM surface JBLD surface

Nearest Neighbors using JBLD Considering NN retrieval on any metric space Scalability Ease for exact NN retrieval Ease for Approximate NN We decided to use a Metric Tree (MT) on JBLD for NN retrieval Square-root of JBLD is a metric. Basically a hierarchical kmeans algorithm From root (which is the entire dataset), bipartitions data recursively.

Experimental Results using JBLD

Experiments: Evaluation Datasets Weizmann Actions dataset ETH Tracking dataset Brodatz Texture dataset Faces in the Wild dataset DatasetCovariance sizeDataset sizeGround truth Actions12x1265KAvailable Textures8x827KAvailable Faces40x4031KAvailable Tracking8x810KAIRM

Experimental Results using JBLD Metric Tree Creation Time NN via Metric tree ANN via Metric tree

Unsupervised Clustering of Covariances Clustering is an important step in NN retrieval K-Means type clustering need known number of clusters (K) Finding K is non-trivial in practice Thus, we propose an unsupervised clustering algorithm on covariances Extension to Dirichlet Process Mixture Model (DPMM) Uses Wishart-Inverse-Wishart (WIW) conjugate pair Also investigates other DPMM models such as, Gaussian on log-Euclidean covariance vectors Gaussian on vectorized covariances

Experimental Results Purity is synonymous with Accuracy. Definitions:- le: LERM, f: FROB, l-KLDM, g: AIRM Faces, 40x40 D, 900 matrices, 110 clusters Simulation results for increasing true number of clusters DPMM computational expense against k-means (using AIRM) and EM (using MoW) Appearances, 5x5 D, 758 matrices, 31 clusters

Importance of Vector Valued Data in Vision Fundamental data type in several applications As histogram based descriptors - Examples: SIFT, Spin Images, etc. As feature descriptors - Example: image patches As filter outputs: - Example: GIST descriptor GIST Texture patches SIFT

KD Trees Partitions space along fixed hyperplanes Locality Sensitive Hashing (LSH), Indyk et al. 2008 Generates hash codes by projecting data to random hyperplanes Spectral Hashing, Torralba et al. 2008 Projection planes derived from orthogonal subspaces of PCA Kernelized Hashing, Kulis et al. 2010 Projection planes derived from PCA over kernel matrix learned from data Shift Invariant Kernel Hashing, Lazebik et al. 2009 Spectral hashing with a cosine based kernel Product Quantization, Jegou et al. 2011 K-means sub-vector clustering followed by standard LSH FLANN, Lowe et al. 2009 Not a hashing algorithm, but a hybrid of Hierarchical K-Means and KD-tree. Related Work KD-Tree X X 1 2 3 4 5 LSH Hash code : 11010

Our Approach Based on Dictionary Learning (DL) and Sparse Coding (SC) Algorithm steps: For each data vector v, 1.Represent v as a sparse vector w using a dictionary B 2.Encode w as a hash code T 3.Store w at H(T), where H is a hash table indexed by T End Given query vector q, 1. Generate sparse vector w q and hash code T q 2. Find ANN(q) in H(T q )

Dictionary Learning and Sparse Coding Dictionary learning:- An algorithm to learn atoms from data. Sparse Coding:- An algorithm to represent data in terms of a few atoms in the dictionary. An Analogy Dictionary Learning Data Dictionary of basic atoms

Dictionary Learning and Sparse Coding Dictionary learning:- An algorithm to learn atoms from data. Sparse Coding:- An algorithm to represent data in terms of a few atoms in the dictionary. An Analogy Dictionary Learning Image dataDictionary of basic atoms

Dictionary Learning and Sparse Coding Dictionary learning:- An algorithm to learn atoms from data. Sparse Coding:- An algorithm to represent data in terms of a few atoms in the dictionary. An Analogy Sparse Coding 0 x Na 0 x Li 0 x Be. 2 x H. 1 x O. 0 x Xe 0 x Rn Sparse atom selection Data vector Sparse representation (lots of zeros)

Dictionary Learning and Sparse Coding Dictionary learning:- An algorithm to learn atoms from data. Sparse Coding:- An algorithm to represent data in terms of a few atoms in the dictionary. An Analogy Sparse Coding 0.0 x. 1.2 x. 0.4 x. 0.0 x Sparse atom selection Image Sparse representation (lots of zeros)

Sparse Codes as Hash Codes 10, 33, 77, 90 Sparse code Subspace Combination Tuple (SCT) (hash code) Data vector Dictionary Sparse codeHash table Hashing Illustration

Sparse Coding & NN Retrieval Connection High probability New data point

Sparse Coding & NN Retrieval Connection

Advantages of Sparse Coding for NN Retrieval Hashing efficiency Large number of hash codes 2 k n C k k-sparse codes against 2 k codes of LSH Storage efficiency Need to store only sparse coefficients Against entire data vectors as in LSH Query efficiency Linear search on low dimensional sparse vectors No curse of dimensionality Sparse coding complexity O(ndk) for a dictionary of n atoms each of dimension d and generating k-sparse codes. 1-sparse 2-sparse

Disadvantage: Sensitivity to Data Perturbation! Sparse coding fits hyperplanes to dense regions of data There are 2 k n C k hyperplanes for k-sparse code and n-atom dictionary Example: Assume n=1024, k=10 We have 10 30 hyperplanes Data partitions can be too small! Small data perturbations can lead data points to change partitions Different partitions imply different hash codes and hashing fails!

Robust NN Retrieval Robust Dictionary Learning Robust Sparse Coding Align dictionary atoms compensating for data perturbations Approaches Let perturbations be noise. Develop a denoising model Make data immune to worst case perturbation Hierarchical data space partitioning Larger partitions subsume smaller partitions Generate multiple hash codes, one for each partition

Robust Dictionary Learning Denoising approach Robust Optimization Data has large and small perturbations Assume Gaussian noise for small perturbations Assume Laplacian for large but sparse perturbations. Denoise for Gaussian + Laplacian noise Resulting denoised data should produce same SCT! Basis learned Subtract off Laplacian noise Subtract off Gaussian noise Worst case perturbation Project data to worst case perturbation Basis learned No assumptions on noise distribution Learn worst case perturbation from a training set Project every data point as if perturbed by the worst case noise Learn basis on the perturbed data Resulting immunized data should produce same SCT!

Robust Dictionary Learning: Experimental Results Denoising approach Robust optimization INRIA Copydays Dataset Graf Bike Bark Boat Wall Leu UBC Tree

Robust Sparse Coding Based on the regularization path of sparse coding Similar data points will have similar regularization paths Similar data points & basis activationsDissimilar data points & basis activations Main idea: Generate multiple SCTs for each increasing regularizations Multi-Regularization Sparse Coding (MRSC) algorithm Increasing regularization means bigger data partitions & more robustness

Robust Sparse Coding: Experimental Results MNIST Digits CIFAR 10 objects SHREC spin Images (2M) Holidays SIFT (2M)

Robust Sparse Coding: Experimental Results (SIFT) Timing Robustness Scalability Timing/Scalability

Sparse Coding for Covariances: Generalized Dictionary Learning Basic idea Extend sparse coding framework for matrix valued data Sparse vector Sparse diagonal matrix Vector dictionary Non-negative rank-one dictionary

Generalized Dictionary Learning: Experimental Results DatasetCovariance size Dataset size Dictionary% of dataset searched LabelMe objects7x725K7x285.11 % Faces (FERRET)40x4010K40x1603.54 % Textures (Brodatz + Curet) 5x560K5x506.26 % AppearancesFacesTexture

We considered NN problems on two different data types Covariance data High dimensional vector data For covariance data, We proposed an efficient similarity metric-Jensen Bregman LogDet Divergence Proposed novel unsupervised clustering algorithms with high clustering accuracy For vector data We established a connection between LSH and sparse coding Proposed efficient algorithms for robust NN retrieval We proposed a framework for sparse coding covariances- Generalized Dictionary Learning Conclusion

Future Work Covariance data Application of JBLD for DT-MRI applications Semi-supervised Dirichlet process mixture models Metric learning on covariance manifolds Locality sensitive hashing on covariances High dimensional vector data Hamming embedding via dictionary learning Dictionary Learning under constraints Bulk sparse coding Large scale dictionary learning

Thank you! Image courtesy : http://www.spokanecriminaldefenseattorney.net/spokane-domestic-violence-attorney/http://www.spokanecriminaldefenseattorney.net/spokane-domestic-violence-attorney/

Documents

Similarity Search in Visual Data Ph.D. Thesis Defense Anoop Cherian * Department of Computer Science and Engineering University of Minnesota, Twin-Cities