Object Recognition Part 2

Object Recognition Part 2

Authors: Kobus Barnard, Pinar Duygulu, Nado

de Freitas, and David Forsyth

Slides by Rong Zhang

CSE 595 – Words and Pictures Presentation Day: 2-24-11

What is Object Recognition?• Process of attaching words to images and to image

regions. (An object)• Essentially translating visual representations such as

pictures or illustrations into language.

IMG Source: http://www.featurepics.com/online/Many-Objects-Isolated-1432542.aspx

Why Use This Method?

• This method is concrete and testable.• No need to specify in advance which objects and

or scene semantics are to be considered.• Can exploit large datasets such as:– Indexed image databases (photo collections online).– Web images with captions or other associated text.– Or even video data along with speech recognition.

• Can be used to develop vision tools which are applicable to the general recognition problem.

How Do We Do This?

• Use clustering methods (last paper presentation), this will “learn” the joint statistics of image words and segments.– This method has been applied traditional image data base

applications.– It is suggested that these methods can produce words

associated with images outside the training set.– These produced words that are indicative to the scene

semantics.• Strong connection between the produced words and scene.

Scene Based Word Prediction• Previous presentation showed us how we can generate words for

images.– The tree structure:

• We can generate this tree via top-down (general to specifics) or bottom up (specifics to generals).

• Leaf to Root = Cluster.

• D = Set of Observations, d = specific document, c = clusters, i = indexed items, l = indexed levels.• Relies on documents specific to the training set. So for images outside the

training set requires more complex work.

What if we want to consider a document not in the training set?

• Need to estimate the vertical mixing weights or marginalizing out the training data(not good for large datasets).

• We can use a cluster specific average computed during training.

• Or we can re-fit the model with the document under consideration.

• For some applications it works well, which indicated that the selection of vertical nodes used is the most important and that further modeling of distribution doesn’t matter as much.

Word Prediction• To predict words from images we assume that we have a

new document with a set of observed image segments, S, and we use the following:

• Here we don’t really take into account the document index, d, in P(l). – We are interested in applying it to documents outside the training

set.

Region based word prediction (correspondence)

• The previous models implicitly learn some correspondence through co-occurrence because there was a fitting advantage to having “topics” collected at the nodes. – Tigers and orange stripy areas.

• The following word prediction method can to find how well the nodes bind words and regions:

Integrating Correspondence

• Integrating the correspondence method into the model and learning it during training is a much better way to it.

• So now we assume that observed words and regions (image segment) are emitted in pairs, that D={(w,s)i}.

Getting Better

• Previous methods would produce a “likely” cluster from a query on a segment. The predicted words had more freedom.– Not exactly what we’re looking for.

• The new model brings to the table, stronger ties between the emitted words and regions.– But this increases complexity of the learning

algorithm (expected).

Learning During Training

• We sample likely correspondences using estimates of the probabilities that each word is emitted with each segment.

• Done before the E step in the model fitting process.– The assignment (word and segment) is then used in the

estimation of the expectation of the indicator variables.• We assume now that the prob. a word and a

segment is tied together can be estimated by the prob. that they are emitted from the same node.

The Final Probability

• We consider each segment in turn and choose words using the following formula.

• But we do so while restricted to words not yet accounted for or without restrictions once all words have been paired with.

Measuring Performance

• We want to be able to point out results that are far below human abilities. (really bad ones…)

• We can observe word predictions on held out data. – The held out data would also already have associated

words or text. We can use this for our comparisons.

The Difficulties…

• Well our vocabulary is fairly large…• How do we tell how close, or accurate, a word is to

another word?– Is there a scale of measuring how “close” a word is to

another? • General purpose lose function is the observation that

certain errors aren’t as bad or critical as others.– For example, “cat” for “tiger” is nowhere near as critical of an

error as “car” for “vegetable”. – Again, the issue still remains. Closeness of a term to another

term is very subjective.

Solution?

• We can count the words.• Even still, the task is hard. (Vocabulary)– The number of incorrect terms far outnumber the number

of appropriate ones.• We also can’t just subtract the bad predictions from

the good ones, it would be bad.

~ Solution

• But we can normalize the correct and incorrect classifications:

• ; r + w = total number of predicted words. where: N = vocab size

n = num of actual words for the image r = num of words predicted correctly w = num of words predicted incorrectly

– Returns 0 if everything or nothing was predicted.– Returns 1 if exactly the actual word set is predicted.– Returns -1 if the complement of the above is predicted.

Measuring Performance of Region + Word prediction

• Measuring region/segment oriented word prediction performance is, however, much harder than the straight annotation.– we don’t have the correspondence information.

• But we can still use this annotation task as a proxy. The performance is correlated. – We can report back the results with image word based

prediction methods such as ave-vert, doc-vert. – We can also use the summing over the words emitted by

the regions (pair-cluster, pair-only) method.

Humans can get involved too

• We can further complement the previously mentioned methods with human judgment.

• We can count the number of times the word with the highest prob, for a region, was acceptable as an indexed term, had any ties to the region and or had some sort of visual connection to it.

• But, things get a bit fuzzy if the region crosses multiple natural boundaries. – We tend to count the most prevailing one, >50%.

Experiment

• Used 160 CD’s, from the Coral image data set.– Each disk had a relatively specific topic such as “aircraft”. – Used 80 as the sample.

• Of the sample, 75% of it was used for the training process and 25% was used for the test process.

• Results indicated that learning correspondence was helpful for the annotation task.– We can apply this to tasks that don’t require correspondence such

as auto-annotation.

• Training with correspondence improved performance from 14.2% to 17.3%

• The top row is fairly good.

• As you can see, the tiger is having slight problems.

• Race Car isn’t all that great either.

• As you can see, using the held out data drops performance

• But when correspondence is added, performance improves.

• This shows that learning correspondence is generally helpful in the word prediction method for held out data.

Documents

Object Recognition Part 2