43
Watch Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney.

Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Embed Size (px)

Citation preview

Page 1: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Watch Listen & Learn: Co-training on

Captioned Images and Videos

Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney.

Page 2: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Their endeavour :

• Recognize categories from natural scenes from images with captions

• Recognizing human actions in sports videos accompanied with commentary.

Page 3: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

How do they go about it

• Their model learns to classify images and videos from labelled and unlabelled multi-modal examples.

• A semi-supervised approach.

• Use image or video content together with its textual annotation(captions or commentary) to learn scene and action categories.

Page 4: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Features:

• Visual Featureso Static Image features o Motion descriptors from videos

• Textual features

Page 5: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Histogram Of Oriented Gradients(HOG)• Divide image into small connected

regions(cells)

• Compile a histogram of gradient directions or edge orientation(This is the descriptor).

Page 6: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Gabor Filter

• Convolution of a sinusoid with a gaussian

• Texture detection

Page 7: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

LAB color space

L(lightness)a and b are color opponent dimensions

• Includes all perceivable colors(exceeds

that of RGB and CYMk)

• device independent

• Designed to approximate human vision

Page 8: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Static Image Features

Each Image is broken into 4x6 grids

Record the mean, std deviation & skewness of per channel RGB

texture feature for each regions using gabor filters

Lab color pixel values

30 -dimensional feature vector for each of the 24 regions

K-means clustering Each region of each image is assigned to the closest cluster centroid.

Page 9: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

N x 30

Page 10: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Bag of visual words

The final bag of visual words for each image consists of:

• A vector of k-values: The ith element represents the number of regions in the image that belongs to the ith cluster.

Page 11: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Motion descriptorsAt each feature point the patch is divided into 3x3x2 spatio-temporal blocks

4-bin HOG descriptors calculated for each block

laptev spatio temporal motion descriptors

72 element feature vector is obtained.

Motion descriptors from all video in the training set are clustered to form a vocabulary.

A video clip is represented as a histogram over this vocabulary

Page 12: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

N x 72

Page 13: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Textual features

Image captions or transcribed video commentary

Preprocess:Remove stop words and stem the remaining.

The frequency of resulting word stem comprised the feature set.

Page 14: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co- training

What is it?- A semi supervised learning paradigm

that exploits two mutually independent views.

Independent views in our case:

• Text classifier

• Visual classifier

Page 15: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Text classifier Visual Classifier

Text View Visual View +

Text View Visual View -

Text View Visual View -

Text View Visual View +

Initially labelled Instances

Page 16: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Supervised learning

Text classifier Visual Classifier

Initially labelled Instances

Visual View -

Visual View +

Visual View +

Visual View -

Text View +

Text View +

Text View _

Text View +

Classifiers used to learn from labeled instances

Page 17: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co -train

Text classifier Visual Classifier

Text View Visual View

Text View Visual View

Text View Visual View

Text View Visual View

Unlabelled Instances

The trained classifier from the previous step used to label unlabelled instances

Page 18: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Supervised learning

Classify Most confident instances

Text classifier Visual Classifier

Partially labelled Instances

Visual View

Visual View

Visual View

Visual View

Text View

Text View

Text View

Text View

Page 19: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Supervised learning

Label all views in the instances

Text classifier Visual Classifier

Classifier labelled Instances

Visual View +

Visual View -

Visual View +

Visual View -

Text View +

Text View -

Text View +

Text View -

Page 20: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Retrain Classifier

Retrain based on the new labels

Text classifier Visual Classifier

Visual View +

Visual View -

Visual View +

Visual View -

Text View +

Text View -

Text View +

Text View -

Page 21: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Classify a new instance

Text View Visual View

Text View Visual View

Text view Visual View

Text View Visual view

Page 22: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Ready for Co- training

A Set of labelled and unlabeled examples each with two set of features(one for each view)

System

Two classifiers whose predictions can be combined to classify new test instances.

InputOutput

Classify Most confident instances

Page 23: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Early and Late Fusion

Early Fusion

Visual Features

Textual Features

Fused Vector

ClassifierCombined result used for labeling

test instances

Page 24: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Late Fusion

Visual Features

Textual Features

Visual Classifier

Text Classifier

Combined result used for labeling

test instances

Early and Late Fusion

Page 25: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Semi-supervised EM and Transductive SVMs

Semi-supervised Expectation Maximization

• Learns the probabilistic classifier from the labeled training data

• Performs EM iterations

• E-step – uses the currently trained classifier to probabilistically label the unlabeled training examples

• M-step – retrains the classifier on the union of labeled data and probabilistically labeled unsupervised examples

Page 26: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

• Method of improving the generalization accuracy of SVMs by using unlabeled data

• Finds the labeling of the test examples that results in maximum margin hyperplane that separates the positive and negative examples of both training and testing data

Transductive SVMs

Page 27: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

• Transductive SVMs are typically designed to improve the performance of the test data by utilizing its availability during training.

• It is also used directly in semi supervised setting where unlabeled data is available during training, which comes from the same distribution as the test data.

Transductive SVMs (contd)

Page 28: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Methodology

• For co-training,Support Vector Machine – Base classifier

for both image and text views

• We use Weka implementation of sequential minimal optimization (SMO)

• SMO is an algorithm for efficiently solving the optimization problem which arises during the training of SVMs

Page 29: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Parameters used in SMO-RBF Kernel (γ=0.01)-Batch size: 5

Confidence Threshold

Static Images

Videos

Image/Video view

0.65 0.6

Text view 0.98 0.9

Methodology (Continued)

Page 30: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Methodology (Continued)

• Ten iterations of ten-fold cross validation is performed to get smoother and reliable results.

• Test set is disjoint from both the labeled and unlabeled training data.

• Learning curves are used for evaluating the accuracy.

Page 31: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Learning Curves

• The learning curve can represent at a glance, the initial difficulty of learning something and, to an extent, how much there is to learn after initial familiarity.

• These curves are generated where at each point some fraction of the training data is labeled and the remainder is used as unlabeled training data

Page 32: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Results

Classify captioned static images:

• Image dataset : Israel dataset

• Images have short text captions

• Two classes : ‘Desert’ , ‘Trees’

Page 33: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Examples of images

DESERT

TREES

Ibex in Judean Desert Bedouin Leads His Donkey That Carries Load Of Straw

Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School

Page 34: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co-training Vs Supervised Classifiers

Page 35: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co-training Vs. Semi-Supervised EM Co-training Vs. Transductive SVM

Page 36: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Results(contd.)

Recognize actions from Commented videos:

• Video clips of soccer and ice-skating

• Resized to 240x360 resolution and then divided manually into short clips.

• Clip length varies from 20 to 120 frames.

• Four categories : kicking, dribbling, spinning, dancing

Page 37: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Examples of Videos

Page 38: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Examples of Videos(contd.)

Page 39: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co-training Vs. supervised learning on commented video dataset

Page 40: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Co-training Vs. supervised learning when text commentary is not available

Page 41: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

Limitations in the approach• Data set used is small and requires

only binary classification.

• Images that have explicit captions are used.

Page 42: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

QUESTIONS ?????

Page 43: Watch Listen & Learn: Co- training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kirstenn Garuman and Raymond Mooney

THANK YOU

Joydeep SinhaAnuhya

KoripellaAkshitha

Muthireddy