Learning realistic human actions from moviesweb.cs.ucdavis.edu/~yjlee/teaching/ecs289h-fall2014/KerrySeitz1.pdf · Learning realistic human actions from movies by Ivan Laptev, Marcin

Learning realistic human actions from movies

by Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld

PRESENTATION BY KERRY SEITZ

1

The Problem

Recognize natural human actions

Realistic videos

Getting out of a car

Answering a phone

Performing CPRKissing

2[LAPTEV ET AL. 2008]

Challenges

Lack of datasets

Variations in:◦ Expression, posture, motion, and clothing

◦ Camera motion and perspective

◦ Illumination

◦ Occlusion and surroundings


Automatic Annotation of Human Actions

Use movie scripts

Problems◦ No time information

◦ Script and movie don’t always match

◦ Variations in phrasing

4

Script-to-Video Alignment


Script-to-Video Alignment

Alignment score (a) for each scene◦ Script-subtitle misalignment

◦ a = (# matched words) / (# all words)

Types of errors when a=1◦ Misaligned in time (10%)

◦ Outside the field of view (10%)

◦ Missing in the video (10%)


Text Retrieval of Human Actions

Phrasing variations◦ “Will gets out of the Chevrolet.”

◦ “A black car pulls up. Two army officers get out.”

◦ “Erin exits her new truck.”

False positives◦ “About to sit down, he freezes.”

Keyword search is insufficient!

7


Train classifier for each action (bag of features model)◦ Words

◦ Adjacent pairs of words

◦ Pairs of words within a window of N words (2 ≤ N ≤ 8)

Regularized perceptron◦ Equivalent to SVM

◦ Trained on manually labeled scene descriptions

◦ Tuned using validation set

8



The Datasets

Manual and Test Sets◦ Manually annotated scripts

◦ Manually selected visually-correct action samples

Automatic Set◦ Automatically annotated scripts

◦ Automatically selected action samples

◦ a > 0.5

◦ Length < 1,000 frames


KTH Dataset


Action Recognition

Sparse space-time features◦ Compact representation

◦ Tolerant to background clutter, occlusions, and scale changes

Interest point detection – Harris operator◦ Multiple levels of spatio-temporal scales

12

Interest Point Detection


Features at the Interest points

Histogram of descriptors of space-time volumes◦ Volumes divided into (nx, ny, nt) grid of cuboids

◦ Compute histogram of oriented gradients (HoG)

◦ Compute histogram of optic flow (HoF)

14[IKIZLER ET AL. 2008]

Spatio-Temporal Bag-of-Features

k-means with 4,000 clusters

Different grid sizes

Classify with non-linear SVM


Evaluation ofSpatio-Temporal Grids


Evaluation ofSpatio-Temporal Grids


Comparison to theState-of-the-Art

KTH Dataset Divided into:◦ Training/validation set (8+8 people)

◦ Test set (9 people)

Use best performing channel combination


Confusion Matrix


Noise in Training Data


Results for Real-World Videos


Examples


Summary

Automatic annotation using movie scripts

Action recognition performs better than state-of-the-art

System tolerant to errors in training data

23

Future Work

Improve script-to-video alignment

Improve tolerance of classifier◦ Iterative learning

Experiment with other space-time low-level features

24

Questions?


References

Learning Realistic Human Actions from Movies. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. CVPR 2008.

Human Action Recognition with Line and Flow Histograms. N. Ikizler, G. Cinbis, and P. Duygulu. ICPR 2008.

26

Documents

Learning realistic human actions from moviesweb.cs.ucdavis.edu/~yjlee/teaching/ecs289h-fall2014/KerrySeitz1.pdf · Learning realistic human actions from movies by Ivan Laptev, Marcin