Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney

-Pratiksha Shah

What’s the problem

• We want to automatically annotate and index images (considering their ever growing number)

• Visual cues alone can be ambiguous (depend on lighting, and variety exhibited even by objects of the same kind)

Previous methods

• Previous work was focused on learning the association between visual and textual information

• Many researchers have worked on activity recognition in videos using only visual cues

• Some used co-training but in a different flavour (co-SVM, 2 visual views, 2 text views)

How we propose to solve it

• The two major factors are the features and approach.

• We want to use : {Visual information + Linguistic information + Unlabeled multi-modal data }

for the learning process using a co-training approach.• Visual and linguistic information are used as

separate cues and we expect they will complement each other during training

What is co-training • Training using 2 different (conditionally independent and

sufficient) views • First learn a separate classifier for each view • Most confident predictions of these classifiers are then

used to iteratively construct labeled training data• Change made: an unlabeled example is only labeled if a

pre-specified confidence threshold for that view is exceeded

• Used for classifying : Webpages (based on content and hyperlink views) Email ( based on header and body) object detection models

Co-training Algorithm: Algorithm: A classifier for each view using the labeled data with just the features for that

– Loop following steps until there are no more unused unlabeled instances:1. Compute predictions and confidences of both classifiers for all of the

unlabeled instances.2. For each view, choose the m unlabeled instances for which its classifier hasthe highest confidence. For each such instance, if the confidence value is lessthan the threshold for this view, then ignore the instance and stop labelinginstances with this view, else label the instance and add it to the supervisedtraining set.3. Retrain the classifiers for both views using the augmented labeled data.

Text Feature

• Pre process text by removing stop-words • Stem the remaining words using Porter

stemming• Frequency of the resulting word stems

comprises of the final textual features. (“bag of words” representation)

Captioned images

Features used :• We want to capture overall texture and color

distributions in local regions• Texture – Gabor filter with 3 scales and 4

orientations • Color – Mean, Standard deviation and

skewness of per-channel RGB and lab color pixel values

Method (for captioned images)

• Divide each image into a 4 by 6 sized cells• Compute texture feature using Gabor filter for each • The resulting feature vector for each region is then

clustered using k-means • Each region is then assigned to one of the k-clusters

based on its closeness to cluster centroids • Final “bag of visual words” represents every image

with a vector of k values, each denoting number of regions of the image close to that value.

The University of Texas at Austin 10

Image FeaturesDivide images into 46 grid

Capture texture and colordistributions of each cell

into 30-dim vector

Cluster the vectors using k-Meansto quantize the features into a dictionary of visual words

Represent each image as histogram of visual words

[Fei-Fei et al. ‘05, Bekkerman & Jeon ‘07]

…QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Example dataset

Results for captioned images

• Compare co-training to supervised SVM• Compare co-training to Semi-supervised EM• Compare co-training to Transductive SVM

Commented videos

• Features used:• we use features that describe both salient

spatial changes and interesting movements.• Maximize a normalized temporal laplacian

operation over spatial and temporal scale• HOG – 3x3x2 spatial temporal blocks,4-bin

HOG descriptor for every block => 72 element descriptor

Method (for commented videos)• Use spatio-temporal descriptor motion descriptor (Laptev)• To detect events, use significant local changes in image

values in both space and time.• Estimate the spatio-temporal extent of the detected events

by maximizing a normalized spatiotemporal Laplacian operator over both spatial and temporal scales

• A HOG(histogram of oriented gradients) is calculated at each interest point. The patch is partitioned into a grid with 3x3x2 spatio-temporal blocks, and four-bin HOG descriptors are then computed for all blocks and concatenated into a 72-element descriptor

• These descriptors are clustered to form a vocabulary

The University of Texas at Austin 15

Video FeaturesDetect Interest Points

Harris-Forstener Corner Detector for both spatial and temporal space

Describe Interest PointsHistogram of Oriented Gradients (HoG)

Create Spatio-Temporal VocabularyQuantize interest points to create 200

visual words dictionary

Represent each video as histogram of visual words

[Laptev, IJCV ‘05]

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture. …QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.N 72

Example dataset

Results for commented video• Compare co-training with supervised SVM for commented video dataset • Compare co-training with supervised SVM when commentary is not available

during testing

What does the future look like?

• Larger dataset + more categories for testing • Labeled data versus associated text • Already the results show that co-training gives

better results than existing semi-supervised and supervised methods.

Questions

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah

Documents

Investing in Properties with Assured Returns Project – Morpheus Pratiksha

Transplant Surgery April 8-14 Joohyun Kim Keri Quinn Matt Kaspar Abdul Hamdi Katarzyna Trebska-McGowan

Morpheus Group Offer Subvention Plan in Morpheus Pratiksha Project

Morpheus pratiksha 4 Bhk flats for sale Greater Noida West

Transplant Surgery April 15-21 Joohyun Kim Keri Quinn Matt Kaspar Abdul Hamdi Katarzyna Trebska-McGowan

Morpheus group launch the Morpheus Pratiksha Project in Greater Noida

Stanley Grauman Weinbaum - A Martian Odyssey

FT 500 Assignment Study of Financial Use Cases influenced ...fintechprofiles.com/FT01/Pratiksha/FT400-Pratiksha Mishra Grp6 ML… · FT 500 Assignment Study of Financial Use Cases

Watch, Listen & Learn: Co-training on Captioned Images and ... · Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney Department of Computer Sciences The University of Texas

Earn 24% Assured Return on Your Investment in Morpheus Pratiksha

Morpheus pratiksha in noida extension

Morpheus Group Offer 10-90 Plan in Morpheus Pratiksha Project

Fitting Thursday, Sept 24 Kristen Grauman UT-Austin

Scheme 351 {PRATIKSHA NAGAR, SION} Categor SG {STATE ... · scheme 351 {pratiksha nagar, sion} categor sg {state government employee} appl.num. name 1180000140 rohan babaji khilari

Pay 10 % Payment Now And Rest On Possession Morpheus Pratiksha

The uttarakhand tragedy.2013.....By- Pratiksha Yadav

Grauman Phd Thesis

Pratiksha Singh Project

Jon Grauman Lauren Hice Luxe Magazine Feature

Inferring Analogous Attributes...InferringAnalogous Attributes Chao-Yeh Chen and Kristen Grauman University of Texas at Austin chaoyeh@cs.utexas.edu, grauman@cs.utexas.edu Abstract