Kernel Descriptors for Visual Recognition

Kernel Descriptors for Visual Recognition by L.Bo, X.Ren and D.Fox

A Term Paper Report by Priyatham Bollimpalli (10010148)

Summary of the Paper

Popular Computer Vision algorithms like SIFT and HOG compute feature descriptor for an

image. A descriptor for an image is in simple terms, a concise representation of the image

properties which enables them to be used for many practical applications such as object

recognition, scene detection, image matching etc. Inspired from the orientation histogram

approach used in SIFT and HOG, this paper defines kernel orientation histogram and then

designs kernel descriptors for gradient, colour and local binary pattern (shape) using match

kernels. The definition of these kernels resulted in the reduction of granularity of low level

pixel features and made the idea of similarity between patches(high level features) come true.

To generate kernels in a computationally feasible manner, first match kernels are

approximated to finite dimension taking a set of finite basis vector from sampled normalized

gradient vectors. Then to reduce the redundancy and generate the compact features, Kernel

Principal Component Analysis is done. It is shown experimentally that the error which results

in these two stages is very less. Now gradient, colour and shape kernel descriptors are

computed more efficiently and in a simple, straight forward way over the images.

Experiment is done on four publicly available datasets: Scene-15, Caltech101, CIFAR10 and

CIFAR10-ImageNet. These datasets are for image classification and Laplacian kernel SVMs is

used in the experiments to classify. It is shown that the gradient kernel descriptor performs

best among the proposed kernel descriptors. All of them perform better than the SIFT

descriptor and other sophisticated feature learning methods.

The main novelty in the paper is that this is the first work done on kernels which is based on

low-level visual feature learning and that shows better performance than very famous

methods which are used as default choice for many applications. But some of the limitations

of this proposed scheme is the high computational time (even after optimizing) compared to

other methods and difficulty in learning pixel attributes from large image collection to

approximate the kernel. But since this area of research is new, alternative kernel functions or

using the existing one in combination of other kernel methods may get around this limitation,

further improving the performance or using in other areas where SIFT is used such as object

tracking, multi-view matching etc.

Details and Explanation of the paper

The gradient orientation at a pixel plays an important role in describing the features of the

image and this concept has been extensively used in many image descriptors. For example,

SIFT descriptor assigns the orientations to 8 bins as depicted below across 4 x 4 block.

Feature vector of each pixel z is defined as F(z) = m(z)(z) where m(z) is the magnitude of

the gradient and the ith component of (z) is 1 if the gradient falls in ith bin and 0 otherwise.

Soft bin formulation can also be used as (z) = max(cos((z ),ai )9, 0) where (z ) is the gradient

and ai is bin center. Over a patch P, histogram of gradients is obtained by

𝐹ℎ = ∑ �̃�(𝑧)δ(z)𝑧∈𝑃 where �̃�(𝑧) = 𝑚(𝑧)/√∑ 𝑚(𝑧)2 + 𝜖 𝑧∈𝑃 (normalized magnitudes)

Intuitively, the similarity between two patches P and Q from different images is defined as

Since there are only inner product in the RHS, kernel functions can be defined between two

pixels and hence kernalized notion of similarity between two patches (as in HOG) is obtained.

But defining the kernel in this way introduces quantization errors and poor performance.

So to capture image variations properly, Gradient match kernel is defined as follows.

Here kp and ko are Gaussian kernels over position of pixel and orientations respectively. To

get more accuracy and for defining in uniform way, the values of pixel positions and

orientations are normalized.

The motivation for defining the gradient match kernel K as product of three kernels is as

follows. First we have to weigh the contribution of each pixel gradient magnitude and

normalized linear kernel is used for this. Then a measure of similarity of gradient orientations

should be included and the last Gaussian kernel kp measures how close two pixels are

spatially. By similar motivation, colour match kernel is defined (c(z) is the colour at z).

In shape kernel, s is the standard deviation of pixel values in the 3 x 3 neighborhood, b (z) is

binary column vector with the pixel value differences in a local window around z. Thus in

Shape Kernel descriptor, the contribution of each local binary pattern s(z) is weighed, and

shape similarity is obtained through local binary patterns b(z).

Features over image patches can be expressed as

Since Gaussian kernels are used, Fgrad(P) has infinite dimensions. Directly applying KPCA may

be computationally infeasible when the number of patches is very large. So first match kernels

are approximated directly by learning finite-dimensional features obtained by projecting

Fgrad(P ) into a set of basis vectors. An example to approximate Gaussian kernel over gradients

to d dimensions is shown below. Here xi are sampled normalized gradient vectors.

Note that the Kronekar product ⨂ is used to compute the features which still results in large

number of dimensions. Now to achieve fewer compact features, KPCA is done. This makes

the computation time of evaluation practical. The tth kernel principle component is written as

Finally the gradient kernel descriptor is expressed as shown below. It is shown that the error

incurred in approximating the match kernels in this way is very less.

The gradient (KDES-G), color (KDES-C), and shape (KDES-S) kernel descriptors are compared

to SIFT and several other state of the art object recognition algorithms using four publicly

available datasets of Scene-15, Caltech101, CIFAR10, and CIFAR10-ImageNet. Except in

CIFAR10, Laplacian kernel SVMs are used in the experiments. The summary of the result is

shown below. The combination of the three kernel descriptors is observed to boost the

performance by 2%. Thus we can see that the proposed kernel descriptor outperforms all the

other methods.

Scene-15 Caltech-101

KDES 86.7% KDES 76.4% CDBN[2] 65.5%

SIFT 82.2% SPM [1] 64.4% LCC[4] 73.4%

CIFAR10 KDES 76.0% LCC[4] 74.5%

mcRBM-DBN[3] 71.0% TCNN[5] 73.1%

[1]Lazebnik, Schmid, Ponce, CVPR '06 [2]Lee, Grosse, Ranganath, Ng, ICML '09 [3]Ranzato, Hinton, CVPR '10 [4]Yu, Zhang, ICML '10 [5]Le, Ngiam, Chen, Chia, Koh, Ng, NIPS '10

Engineering

Kernel Descriptors for Visual Recognition