A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

A Model for Learning the Semantics of Pictures

V. Lavrenko, R. Manmatha, J. JeonCenter for Intelligent Information Retrieval

Computer Science Department,

University of Massachusetts Amherst

Neural Information Processing Systems Conference

poster

Abstract

• Automatically annotate an image with keywords and to retrieve images based on text queries.

• We assume that every image is divided into regions, each described by a continuous-valued feature vector.

• Given a training set of images with annotations, we compute a joint probabilistic model of image features and words

• Experiments show that our model significantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval.

Introduction

• We propose a model which looks at the probability of associating words with image regions.

• The surrounding context often simplifies the interpretation of regions as a specific objects.– the association of a region with the word tiger is

increased by the fact that there is a grass region and a water region in the same image and should be decreased if instead there is a region corresponding to the interior of an aircraft.

Introduction

• Thus the association of different regions provides context while the association of words with image regions provides meaning

• Continuous-space Relevance Model (CRM)– a statistical generative model related to

relevance models in information retrieval– directly associates continuous features with

words and does not require an intermediate clustering stage.

Related Work

• Mori et al. (1999)– Co-occurrence Model– regular grid

• Duygulu et al. (2002)– Translation Model– Segmentation, blobs

• Jeon et al. (2003)– Cross-media relevance model (CMRM)– analogous to the cross-lingual retrieval proble

m

Related Work

• Blei and Jordan (2003)– Correlation Latent Dirichlet Allocation (LDA) M

odel– This model assumes that a Dirichlet distributio

n can be used to generate a mixture of latent factors. This mixture of latent factors is then used to generate words and regions.

Related Work

• Differences between CRM and CMRM– CMRM is a discrete model and cannot take ad

vantage of continuous features– CMRM relies on clustering of the feature vect

ors into blobs. Annotation quality of the CMRM is very sensitive to clustering errors, and depends on being able to a-priori select the right cluster granularity

A Model of Annotated Images

• Learning a joint probability distribution P(w,r) over the regions r of some image and the words w in its annotation.

• Image Annotation– compute a conditional likelihood P(w|r) which can t

hen be used to guess the most likely annotation w for the image

• Image Retrieval– compute the query likelihood P(wqry|rJ) for every im

age J in the dataset. We can then rank images in the collection according to their likelihood of having the query as annotation

Representation of Images and Annotations

• Let C denote the finite set of all possible pixel colors– c0: transparent color , which will be handy when we hav

e to layer image regions.• Assume that all images are of a fixed size W*H• Represent any image as an element of a finite set

R=C^W*H– each image contains several distinct regions {r1…rn}– Each region is itself an element of R and contains the pix

els of some prominent object in the image– all pixels around the object are set to be transparent



• function g which maps image regions rR to real-valued vectors

• The value g(r) represents a set of features, or characteristics of an image region.

• an annotation for a given image is a set of words {w1…wm} drawn from some finite vocabulary V

• modeling a joint probability for observing a set of image regions {r1…rn} together with the set of annotation words {w1…wm}

A Model for Generating Annotated Images

• Generated image J is based on three distinct probability distributions– Words in wJ are an i.i.d. random sample from

some underlying multinomial distribution Pv(‧|J)

– regions rJ are produced from a corresponding set of generator vectors g1…gn according to a process PR(ri|gi) which is independent of J

– The generator vectors g1…gn are themselves an i.i.d. random sample from some underlying multi-variate density function Pg(‧|J)


• Let rA={r1…rnA} denote the regions of some image A, which is not in the training set T

• Let wB={w1…wnB} be some arbitrary sequence of words

• P(rA,wA), the joint probability of observing an image denoted by rA together with annotation words wB


• The overall process for jointly generating wB and rA is as follows:– Pick a training image JT with some probability P

T(J)– For b=1…nB

• Pick the annotation word wb from the multinomisl distribution PV(•|J)

– For a=1…nA

• Sample a generator vector ga from the probability density Pg(• |J)

• Pick the image region ra according to the probability PR

(ra|ga)


• The probability of a joint observation {rA,wB} is given by:

Estimating parameters on the model

• PT(J)=1/NT , where NT is the size of training set

•

– where Ng is the number of all regions r’ in R such that g(r’)=g


•

– rJ={r1…rn} be the set of regions of image J – placing a Gaussian kernel over the feature v

ector g(ri) of every region of image J – Each kernel is parameterized by the feature

covariance matrix Σ. we assumed Σ=β‧I, where I is the identity matrix. β plays the role of kernel bandwidth: it determines the smoothness of Pg around the support point g(ri).


• Let be the simplex of all multinomial distributions over V

• Assume a Dirichlet prior over that has parameters {μpv: vV}

• μ is a constant, pv is the relative frequency of observing the word v in the training set

• Introducing the observation wJ results in a Dirichlet posterior over with parameters {μpv+Nv,J: vV}

• Nv,J is the number of times v occurs in the observation wJ

Experimental Results

• Dataset: provided by Duygulu et al.– http://www.cs.arizona.edu/people/kobus/ research/

data/eccv 2002– 5,000 images from 50 Corel Stock Photo cds

– Each image contains an annotation of 1-5 keywords. Overall there are 371 words.

– every image in the dataset is pre-segmented into regions using general-purpose algorithms, such as normalized cuts

• pre-computed feature vector for every segmented region• feature set consists of 36 features: 18 color features, 12 textu

re features and 6 shape features.

Experimental Results

– divided the dataset into 3 parts - with 4,000 training set images, 500 evaluation set images and 500 images in the test set.• The evaluation set is used to find system parame

ters. After fixing the parameters, we merged the 4,000 training set and 500 evaluation set images to make a new training set. This corresponds to the training set of 4500 images and the test set of 500 images used by Duygulu et al.

Results: Automatic Image Annotation

• Given a set of image regions rJ we use equation (1) to arrive at the conditional distribution P(w| rJ)

• Take the top 5 words from that distribution and call them the automatic annotation of the image in question

• compute annotation recall and precision for every word in the testing set


– Recall is the number of images correctly annotated with a given word, divided by the number of images that have that word in the human annotation.

– Precision is the number of correctly annotated images divided by the total number of images annotated with that particular word

– Recall and precision values are averaged over the set of testing words.



Results: Ranked Retrieval of Images

• Given a text query wqry and a testing collection of un-annotated images. For each testing image we use equation (1) to get the conditional probability P(wqry|rJ)

• use four sets of queries, constructed from all 1-, 2-, 3- and 4-word combinations of words that occur at least twice in the testing set.

• An image is considered relevant to a given query if its manual annotation contains all of the query words.


• evaluation metrics– precision at 5 retrieved images– non-interpolated average precision


Conclusions and Future Work

• Proposed a new statistical generative model for learning the semantics of images.– this model works significantly better than a

number of other models for image annotation and retrieval.

– Our model works directly on the continuous features.

• Future work will include the extension of this work to larger datasets

Documents

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,