Learning realistic human actions from...

Preview:

Citation preview

Learning realistic human actions from movies

by Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld

PRESENTATION BY KERRY SEITZ

1

The Problem

Recognize natural human actions

Realistic videos

Getting out of a car

Answering a phone

Performing CPRKissing

2[LAPTEV ET AL. 2008]

Challenges

Lack of datasets

Variations in:◦ Expression, posture, motion, and clothing

◦ Camera motion and perspective

◦ Illumination

◦ Occlusion and surroundings

3[LAPTEV ET AL. 2008]

Automatic Annotation of Human Actions

Use movie scripts

Problems◦ No time information

◦ Script and movie don’t always match

◦ Variations in phrasing

4

Script-to-Video Alignment

5[LAPTEV ET AL. 2008]

Script-to-Video Alignment

Alignment score (a) for each scene◦ Script-subtitle misalignment

◦ a = (# matched words) / (# all words)

Types of errors when a=1◦ Misaligned in time (10%)

◦ Outside the field of view (10%)

◦ Missing in the video (10%)

6[LAPTEV ET AL. 2008]

Text Retrieval of Human Actions

Phrasing variations◦ “Will gets out of the Chevrolet.”

◦ “A black car pulls up. Two army officers get out.”

◦ “Erin exits her new truck.”

False positives◦ “About to sit down, he freezes.”

Keyword search is insufficient!

7

Text Retrieval of Human Actions

Train classifier for each action (bag of features model)◦ Words

◦ Adjacent pairs of words

◦ Pairs of words within a window of N words (2 ≤ N ≤ 8)

Regularized perceptron◦ Equivalent to SVM

◦ Trained on manually labeled scene descriptions

◦ Tuned using validation set

8

Text Retrieval of Human Actions

9[LAPTEV ET AL. 2008]

The Datasets

Manual and Test Sets◦ Manually annotated scripts

◦ Manually selected visually-correct action samples

Automatic Set◦ Automatically annotated scripts

◦ Automatically selected action samples

◦ a > 0.5

◦ Length < 1,000 frames

10[LAPTEV ET AL. 2008]

KTH Dataset

11[LAPTEV ET AL. 2008]

Action Recognition

Sparse space-time features◦ Compact representation

◦ Tolerant to background clutter, occlusions, and scale changes

Interest point detection – Harris operator◦ Multiple levels of spatio-temporal scales

12

Interest Point Detection

13[LAPTEV ET AL. 2008]

Features at the Interest points

Histogram of descriptors of space-time volumes◦ Volumes divided into (nx, ny, nt) grid of cuboids

◦ Compute histogram of oriented gradients (HoG)

◦ Compute histogram of optic flow (HoF)

14[IKIZLER ET AL. 2008]

Spatio-Temporal Bag-of-Features

k-means with 4,000 clusters

Different grid sizes

Classify with non-linear SVM

15[LAPTEV ET AL. 2008]

Evaluation ofSpatio-Temporal Grids

16[LAPTEV ET AL. 2008]

Evaluation ofSpatio-Temporal Grids

17[LAPTEV ET AL. 2008]

Comparison to theState-of-the-Art

KTH Dataset Divided into:◦ Training/validation set (8+8 people)

◦ Test set (9 people)

Use best performing channel combination

18[LAPTEV ET AL. 2008]

Confusion Matrix

19[LAPTEV ET AL. 2008]

Noise in Training Data

20[LAPTEV ET AL. 2008]

Results for Real-World Videos

21[LAPTEV ET AL. 2008]

Examples

22[LAPTEV ET AL. 2008]

Summary

Automatic annotation using movie scripts

Action recognition performs better than state-of-the-art

System tolerant to errors in training data

23

Future Work

Improve script-to-video alignment

Improve tolerance of classifier◦ Iterative learning

Experiment with other space-time low-level features

24

Questions?

25[LAPTEV ET AL. 2008]

References

Learning Realistic Human Actions from Movies. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. CVPR 2008.

Human Action Recognition with Line and Flow Histograms. N. Ikizler, G. Cinbis, and P. Duygulu. ICPR 2008.

26

Recommended