Upload
zukun
View
106
Download
3
Tags:
Embed Size (px)
Citation preview
1
Recognizing Human-Object Interaction Activities
Bangpeng Yao, Aditya Khosla and Li Fei-Fei
Computer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu
• Action Classification• Action Retrieval
2
B. Yao and L. Fei-Fei. “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.” CVPR 2010.
B. Yao, A. Khosla, and L. Fei-Fei. “Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses.” ICML 2011.
Visual Recognition
3
Visual Recognition
Focus on Humans
6
Human images are everywhere:
Why humans are important?
7
Why humans are important?
Top 3 most popular synsets in ImageNet:
Deng et al, 2009
http://www.image-net.org/
8
Human Action Recognition
9
Robots interact with objects
Automatic sports commentary
Security – Drunk people detection
Human Action RecognitionHuman-Object InteractionB. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition (CVPR). 2010.
B. Yao, A. Khosla, and L. Fei-Fei. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. International Conference on Machine Learning (ICML). 2011.
Robots interact with objects
Automatic sports commentary
Security – Drunk people detection
10
11
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
12
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
Difficult part appearance
Self-occlusion
Image region looks like a body part
Human pose estimation & Object detection
13
Human pose estimation is challenging.
Human pose estimation & Object detection
14
Human pose estimation is challenging.
• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
Human pose estimation & Object detection
15
Facilitate
Given the object is detected.
• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009
Small, low-resolution, partially occluded
Image region similar to detection target
Human pose estimation & Object detection
16
Object detection is challenging
Human pose estimation & Object detection
17
Object detection is challenging
• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009
Human pose estimation & Object detection
18
Facilitate
Given the pose is estimated.
Human pose estimation & Object detection
19
Mutual Context
20
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
Mutual Context Model Representation
21
Croquet shot
Volleyball smash
Tennis forehand
Activity classes:
Activity
A
Image evidenceII
[Yao et al, 2011]
Mutual Context Model Representation
22
Human pose as layout of body parts. Activity
H
A
P1 P2 PL
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
23
Volleyball smashing
Cricket bowling
Tennis forehand
Human pose as layout of body parts.
Atomic poses – pose dictionary.
Activity
H
A
P1 P2 PL
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
24
List of objects:
Human interact with any number of objects:
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
25
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
26
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field: Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
27
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field:
Compatibility between actions, objects, and human poses:
1
( ) ( ) , ,( )1 1 1 1
( , , )
1 1 1h o a
mi kj
N N NM
H h A a i j kO oi m j k
A O H
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Representation
28
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field:
Modeling actions:
2 ( )1
( , ) 1 ( )a
k
NT
A a kk
A I s I
Na-dimensional output of an action classifier
Activity
Objects H
A
P1 P2 PL
OM O1Body parts
I
Human pose
[Yao et al, 2011]
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
Mutual Context Model Representation
29
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field:
Modeling objects:
3 ( )1 1
( , ) 1 ( )o
mj
NMT mjO o
m j
O I g O
Object detection scores
Spatial relationship between two object windows
,( ) ( )1 1 1 1
1 1 ( , )o o
m mj j
N NM MT m mj jO o O o
m m j j
b O O
[Yao et al, 2011]
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
Mutual Context Model Representation
30
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field:
Modeling human poses:
4
( ) , ,1 1
( , )
1 ( | ) ( )h
i i
N LT l l T l
H h i l I h i li l
H I
p f I
x x
Detection score of the l-th body part
Location of the l-th body part with the prior of atomic pose hi
[Yao et al, 2011]
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
Mutual Context Model Representation
31
1 2( , , , ) ( , , ) ( , )A O H I A O H A I
3 4 5( , ) ( , ) ( , )O I H I O H
Conditional Random Field:
Modeling human poses:
5
( ) , ,( )1 1 1 1
( , )
1 1 ( , )h o
mi j
N NM LT l m
H h i j l IO om i j l
H O
b O
x
Spatial relationship between the l-th body part and the m-th object window
[Yao et al, 2011]
32
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
Mutual Context Model Learning
33
• Obtaining atomic poses
Annotating
Clustering
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Learning
34
• Obtaining atomic poses• Potentials
– Object & body part detection
One detector for each object or body part
Deformable part model [Felzenszwalb et al, 2008]
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Learning
35
• Obtaining atomic poses• Potentials
– Object & body part detection– Action classification
Spatial pyramid model [Lazebnik et al, 2005]
Activity
Objects H
A
P1 P2 PL
OM O1Body parts
I
Human pose
[Yao et al, 2011]
Mutual Context Model Learning
36
• Obtaining atomic poses• Potentials
– Object & body part detection– Action classification– Spatial relationships
Bin function [Desai et al, 2009]
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Mutual Context Model Learning
37
• Obtaining atomic poses• Potentials
– Object & body part detection– Action classification– Spatial relationships
• Model parameter estimation
Standard Conditional random field: Belief Propagation
[Pearl, 1988]
, , , , ,
Activity
Objects H
A
P1 P2 PL
OM O1
Human pose
Body parts
I
[Yao et al, 2011]
Model Learning Result
38
Activity classes:
Atomic poses:
Objects:
Model Learning Result
39
Activity classes:
Atomic poses:
Objects:
Tennis Serving
Model Learning Result
40
Activity classes:
Atomic poses:
Objects:
Tennis Serving
Volleyball Smash
41
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
42
Model Inference for Pose Estimation, Object Detection, and Action Classification
• Initialization• Iteratively optimize :
Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels
( , , , )A O H I
Action a1
Action a2
Action a3
Action aNa
43
Model Inference for Pose Estimation, Object Detection, and Action Classification
• Initialization• Iteratively optimize :
Updating the layout of human body parts
1( )H hp 2( )H hp 3( )H hp
0.51 0.06 0.04
Mixture model
Re-estimate human pose
[Felzenszwalb et al, 2005][Sapp et al, 2010]
( , , , )A O H I
Action a1
Action a2
Action a3
Action aNa
44
Model Inference for Pose Estimation, Object Detection, and Action Classification
• Initialization• Iteratively optimize :
Updating the layout of human body parts Updating the object detections
( , , , )A O H I
Start from no objects in the image;
Evaluate the contribution of increasing for each detection window separately.
( , , , )A O H IAction a1
Action a2
Action a3
Action aNa
45
Model Inference for Pose Estimation, Object Detection, and Action Classification
• Initialization• Iteratively optimize :
Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels
( , , , )A O H I
Enumerating all possible A and H values to maximize .
Action a1
Action a2
Action a3
Action aNa
( , , , )A O H I
46[Gupta et al, 2009]
Cricket batting Cricket bowling Croquet shot
Tennis forehand Tennis serve Volleyball smash
Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images
Action Classification Experiment
47
Action Classification Results
1 2 3 4 5 6 70.5
0.6
0.7
0.8
0.9
1
Acc
urac
y
1 2 3 4 5 6 70.5
0.6
0.7
0.8
0.9
1
Acc
urac
y
Cricket bowling
Croquet shot
Tennis forehand
Tennis serving
Volleyball smash
Cricket batting
Overall
83%87%
Yao & Fei-Fei (2010b)
Lazebnik et al. (2006)
Our MethodYao et al,
(2011)Yao & Fei-Fei,
(2010)
48[Gupta et al, 2009]
Cricket batting Cricket bowling Croquet shot
Tennis forehand Tennis serve Volleyball smash
Object Detection and Pose Estimation
Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images
49
Object Detection Results
cricket bat .17 .18 .20
cricket ball .24 .27 .32
cricket stump .77 .78 .77
croquet mallet .29 .32 .34
croquet ball .50 .52 .58
croquet hoop .15 .17 .22
tennis racket .33 .31 .37
tennis ball .42 .46 .49
volleyball .64 .65 .67
volleyball net .04 .06 .09
MethodFelzensz-walb et al.
(2010)
Desai et al. (2009)
Yao et al. (2011)
overall .36 .37 .41
50
Object Detection Results
cricket bat .17 .18 .20
cricket ball .24 .27 .32
cricket stump .77 .78 .77
croquet mallet .29 .32 .34
croquet ball .50 .52 .58
croquet hoop .15 .17 .22
tennis racket .33 .31 .37
tennis ball .42 .46 .49
volleyball .64 .65 .67
volleyball net .04 .06 .09
MethodFelzensz-walb et al.
(2010)
Desai et al. (2009)
Yao et al. (2011)
overall .36 .37 .41
51
Object Detection Results
cricket bat .17 .18 .20
cricket ball .24 .27 .32
cricket stump .77 .78 .77
croquet mallet .29 .32 .34
croquet ball .50 .52 .58
croquet hoop .15 .17 .22
tennis racket .33 .31 .37
tennis ball .42 .46 .49
volleyball .64 .65 .67
volleyball net .04 .06 .09
MethodFelzensz-walb et al.
(2010)
Desai et al. (2009)
Yao et al. (2011)
overall .36 .37 .41
52
Human Pose Estimation Results
head .58 .71 .76
torso .66 .69 .77
left/rightupper arms
.44 .44 .52
.40 .40 .45
left/rightlower arms
.27 .35 .39
.29 .36 .37
left/rightupper legs
.43 .58 .63
.39 .63 .61
left/rightlower legs
.44 .59 .60
.34 .71 .77
Method Yao & Fei-Fei (2010)
Andrilu-ka et al. (2009)
Yao et al. (2011)
overall .42 .55 .59
53
Human Pose Estimation Results
head .58 .71 .76
torso .66 .69 .77
left/rightupper arms
.44 .44 .52
.40 .40 .45
left/rightlower arms
.27 .35 .39
.29 .36 .37
left/rightupper legs
.43 .58 .63
.39 .63 .61
left/rightlower legs
.44 .59 .60
.34 .71 .77
Method Yao & Fei-Fei (2010)
Andrilu-ka et al. (2009)
Yao et al. (2011)
overall .42 .55 .59
54
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
Action Recognition as Classification
Cricket batting
Tennis Forehand
Volleyball Smashing
Playing Bassoon
Playing Guitar
Playing Erhu
Running
Gupta et al (2009)Yao & Fei-Fei (2010)
PASCAL VOC (2010)
Reading
Ikizler-Cinbis et al, 2009Desai et al, 2010Yang et al, 2010Delaitre et al, 2011Maji et al, 2011
55
Is Classification the End?
stand run
Actions in a continuous space
56
Is Classification the End?
Same action,Different meanings
57
Is Classification the End?
More than one action at the same time
Shopping
Calling
58
59
Retrieval Instead of ClassificationRetrieval as Similarity Ranking
> > >
> >
> > >
60
Ref.
Retrieval as Similarity Ranking
Decreasing of similarity value
61
Retrieval as Similarity Ranking
Ref.
Decreasing of similarity value
62
Ref.
Retrieval as Similarity Ranking
• Challenges:How to obtain the ground-truth?How to perform automatic retrieval?How to evaluate a retrieval system?
Decreasing of similarity value
63
Action Retrieval: Obtaining Ground Truth
• Human annotation experiment:– Eight human subjects, the same set of 252 trials.
One trial:Comparison images
Reference image
64
• Human annotation experiment:– Eight human subjects, the same set of 252 trials.
One trial:
Reference image
Comparison images
Reference image
Action Retrieval: Obtaining Ground Truth
65
• Human annotation experiment:– Eight human subjects, the same set of 252 trials.
One trial:
?? ?
Reference image
Comparison images
Action Retrieval: Obtaining Ground Truth
66
• Human annotation experiment:– Eight human subjects, the same set of 252 trials.
One trial:
?? ?
Reference image
Comparison images
Action Retrieval: Obtaining Ground Truth
67
• Human annotation experiment:– Eight human subjects, the same set of 252 trials.
1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
8:0 7:1 6:2 5:3 4:4
Degree of consistency of human annotations
Per
cent
age
Action Retrieval: Obtaining Ground Truth
68
• From pairwise annotation to overall similarity:
1Ref.
1s
2s
3s
4s
Ns
Sim( , )
PairwiseHuman annotation
Similarityvector
Action Retrieval: Obtaining Ground Truth
2s.t. 0, 1s s
-1 0 1 0 0 0 1 0 0 -1
69
1Ref.
1s
2s
3s
4s
Ns
Sim( , )
PairwiseHuman annotation
Similarityvector
-1 0 1 0 0 0 1 0 0 -1
• From pairwise annotation to overall similarity:
Action Retrieval: Obtaining Ground Truth
2s.t. 0, 1s s
70
Ref.
0.260 0.227 0.145 0.135 0.112
0.085 0.075 0.041 0.012 0.006
0.002 0.000 0.000 0.000 0.000
• From pairwise annotation to overall similarity:
Action Retrieval: Obtaining Ground Truth
Action Retrieval: Our Approach
>
>
>
Action class
Human pose
Object
71
>
>
>
Action class
Human pose
Object
• Distance between two images I and I’:
2 ( | ), ( | )D p A I p A I
2 ( | ), ( | )D p H I p H I
( | ), ( | )D p O I p O I
( , ) i ii
D T p q p q
22 ( )( , ) i i
i i i
p qD p q
p q
Total variance:
Chi-square statistics:
72
Action Retrieval: Our Approach
73
Action Retrieval: Evaluation Metric
Ref.
• Ranking from an algorithm:
• Ranking by ground-truth similarity:
1reI 2reI nreI
1gtI 2gtI ngtI
Number of Neighborhoods
Acc
urac
y
refI
n
1
1
( , )
( , )
i
i
n re ref
in gt ref
i
s I I
s I I
: ground-truth similaritys
74
Action Retrieval: Result
MC: Mutual Context
10 20 30 400.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of neighbors
Ave
rage
pre
cisi
on
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
75
Action Retrieval: Result
MC: Mutual Context
SPM: spatial pyramid matching (Lazebnik et al, 2005)
• Use the confidence scores of SPM output to evaluate the action similarity.
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of neighbors
Ave
rage
pre
cisi
on
76
Action Retrieval: Result
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of neighbors
Ave
rage
pre
cisi
on
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
77
Action Retrieval: Result
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of neighbors
Ave
rage
pre
cisi
on
MC, Overall
SPM Baseline
78
Action Retrieval: Result
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.6
0.7
0.8
Number of retrieved images
Ave
rage
pre
cisi
on
MC overall, 2
MC oeverall, T
MC action only, 2
MC action only, T
MC object only, 2
MC object only, T
MC pose only, 2
MC pose only, T
SPM baseline, 2
SPM baseline, T
10 20 30 400.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of neighbors
Ave
rage
pre
cisi
on
MC, Overall
SPM Baseline
79
• Mutual context model for Action Recognition Motivation
Model representation
Model learning
• Recognition I: Action Classification, Object
Detection, and Pose Estimation
• Recognition II: Action Retrieval by Matching
Action Similarity
• Conclusion
Outline
80
Conclusion
Human action as human-object interaction:
• Action classification:
• Matching action similarity:
> >
Croquet shot
Tennis forehand
Cricket bowling
81
Acknowledgment
• Stanford Vision Lab reviewers:– Jia Deng– Jia Li