Upload
hadieu
View
220
Download
0
Embed Size (px)
Citation preview
Learning realistic human actions from movies
by Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld
PRESENTATION BY KERRY SEITZ
1
The Problem
Recognize natural human actions
Realistic videos
Getting out of a car
Answering a phone
Performing CPRKissing
2[LAPTEV ET AL. 2008]
Challenges
Lack of datasets
Variations in:◦ Expression, posture, motion, and clothing
◦ Camera motion and perspective
◦ Illumination
◦ Occlusion and surroundings
3[LAPTEV ET AL. 2008]
Automatic Annotation of Human Actions
Use movie scripts
Problems◦ No time information
◦ Script and movie don’t always match
◦ Variations in phrasing
4
Script-to-Video Alignment
5[LAPTEV ET AL. 2008]
Script-to-Video Alignment
Alignment score (a) for each scene◦ Script-subtitle misalignment
◦ a = (# matched words) / (# all words)
Types of errors when a=1◦ Misaligned in time (10%)
◦ Outside the field of view (10%)
◦ Missing in the video (10%)
6[LAPTEV ET AL. 2008]
Text Retrieval of Human Actions
Phrasing variations◦ “Will gets out of the Chevrolet.”
◦ “A black car pulls up. Two army officers get out.”
◦ “Erin exits her new truck.”
False positives◦ “About to sit down, he freezes.”
Keyword search is insufficient!
7
Text Retrieval of Human Actions
Train classifier for each action (bag of features model)◦ Words
◦ Adjacent pairs of words
◦ Pairs of words within a window of N words (2 ≤ N ≤ 8)
Regularized perceptron◦ Equivalent to SVM
◦ Trained on manually labeled scene descriptions
◦ Tuned using validation set
8
Text Retrieval of Human Actions
9[LAPTEV ET AL. 2008]
The Datasets
Manual and Test Sets◦ Manually annotated scripts
◦ Manually selected visually-correct action samples
Automatic Set◦ Automatically annotated scripts
◦ Automatically selected action samples
◦ a > 0.5
◦ Length < 1,000 frames
10[LAPTEV ET AL. 2008]
KTH Dataset
11[LAPTEV ET AL. 2008]
Action Recognition
Sparse space-time features◦ Compact representation
◦ Tolerant to background clutter, occlusions, and scale changes
Interest point detection – Harris operator◦ Multiple levels of spatio-temporal scales
12
Interest Point Detection
13[LAPTEV ET AL. 2008]
Features at the Interest points
Histogram of descriptors of space-time volumes◦ Volumes divided into (nx, ny, nt) grid of cuboids
◦ Compute histogram of oriented gradients (HoG)
◦ Compute histogram of optic flow (HoF)
14[IKIZLER ET AL. 2008]
Spatio-Temporal Bag-of-Features
k-means with 4,000 clusters
Different grid sizes
Classify with non-linear SVM
15[LAPTEV ET AL. 2008]
Evaluation ofSpatio-Temporal Grids
16[LAPTEV ET AL. 2008]
Evaluation ofSpatio-Temporal Grids
17[LAPTEV ET AL. 2008]
Comparison to theState-of-the-Art
KTH Dataset Divided into:◦ Training/validation set (8+8 people)
◦ Test set (9 people)
Use best performing channel combination
18[LAPTEV ET AL. 2008]
Confusion Matrix
19[LAPTEV ET AL. 2008]
Noise in Training Data
20[LAPTEV ET AL. 2008]
Results for Real-World Videos
21[LAPTEV ET AL. 2008]
Examples
22[LAPTEV ET AL. 2008]
Summary
Automatic annotation using movie scripts
Action recognition performs better than state-of-the-art
System tolerant to errors in training data
23
Future Work
Improve script-to-video alignment
Improve tolerance of classifier◦ Iterative learning
Experiment with other space-time low-level features
24
Questions?
25[LAPTEV ET AL. 2008]
References
Learning Realistic Human Actions from Movies. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. CVPR 2008.
Human Action Recognition with Line and Flow Histograms. N. Ikizler, G. Cinbis, and P. Duygulu. ICPR 2008.
26