ICML2011: recognizing human-object interaction activities

1

Recognizing Human-Object Interaction Activities

Bangpeng Yao, Aditya Khosla and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

• Action Classification• Action Retrieval

2

B. Yao and L. Fei-Fei. “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.” CVPR 2010.

B. Yao, A. Khosla, and L. Fei-Fei. “Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses.” ICML 2011.

Visual Recognition

3

Visual Recognition

Focus on Humans

6

Human images are everywhere:

Why humans are important?

7

Why humans are important?

Top 3 most popular synsets in ImageNet:

Deng et al, 2009

http://www.image-net.org/

8

http://www.image-net.org/

Human Action Recognition

9

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

Human Action RecognitionHuman-Object InteractionB. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition (CVPR). 2010.

B. Yao, A. Khosla, and L. Fei-Fei. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. International Conference on Machine Learning (ICML). 2011.

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

10

11

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

12



Model learning




Action Similarity

• Conclusion

Outline

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation & Object detection

13

Human pose estimation is challenging.


14

Human pose estimation is challenging.

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009


15

Facilitate

Given the object is detected.

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target


16

Object detection is challenging


17

Object detection is challenging

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009


18

Facilitate

Given the pose is estimated.


19

Mutual Context

20



Model learning




Action Similarity

• Conclusion

Outline

Mutual Context Model Representation

21

Croquet shot

Volleyball smash

Tennis forehand

Activity classes:

Activity

A

Image evidenceII

[Yao et al, 2011]


22

Human pose as layout of body parts. Activity

H

A

P1 P2 PL

Human pose

Body parts

I

[Yao et al, 2011]


23

Volleyball smashing

Cricket bowling

Tennis forehand

Human pose as layout of body parts.

Atomic poses – pose dictionary.

Activity

H

A

P1 P2 PL

Human pose

Body parts

I

[Yao et al, 2011]


24

List of objects:

Human interact with any number of objects:

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


25

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


26

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field: Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


27

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Compatibility between actions, objects, and human poses:

1

( ) ( ) , ,( )1 1 1 1

( , , )

1 1 1h o a

mi kj

N N NM

H h A a i j kO oi m j k

A O H

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


28

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H


Modeling actions:

2 ( )1

( , ) 1 ( )a

k

NT

A a kk

A I s I

Na-dimensional output of an action classifier

Activity

Objects H

A

P1 P2 PL

OM O1Body parts

I

Human pose

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I


29

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H


Modeling objects:

3 ( )1 1

( , ) 1 ( )o

mj

NMT mjO o

m j

O I g O

Object detection scores

Spatial relationship between two object windows

,( ) ( )1 1 1 1

1 1 ( , )o o

m mj j

N NM MT m mj jO o O o

m m j j

b O O

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I


30

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H


Modeling human poses:

4

( ) , ,1 1

( , )

1 ( | ) ( )h

i i

N LT l l T l

H h i l I h i li l

H I

p f I

x x

Detection score of the l-th body part

Location of the l-th body part with the prior of atomic pose hi

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I


31

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H


Modeling human poses:

5

( ) , ,( )1 1 1 1

( , )

1 1 ( , )h o

mi j

N NM LT l m

H h i j l IO om i j l

H O

b O

x

Spatial relationship between the l-th body part and the m-th object window

[Yao et al, 2011]

32



Model learning




Action Similarity

• Conclusion

Outline

Mutual Context Model Learning

33

• Obtaining atomic poses

Annotating

Clustering

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


34

• Obtaining atomic poses• Potentials

– Object & body part detection

One detector for each object or body part

Deformable part model [Felzenszwalb et al, 2008]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


35


– Object & body part detection– Action classification

Spatial pyramid model [Lazebnik et al, 2005]

Activity

Objects H

A

P1 P2 PL

OM O1Body parts

I

Human pose

[Yao et al, 2011]


36


– Object & body part detection– Action classification– Spatial relationships

Bin function [Desai et al, 2009]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]


37


– Object & body part detection– Action classification– Spatial relationships

• Model parameter estimation

Standard Conditional random field: Belief Propagation

[Pearl, 1988]

, , , , ,

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Model Learning Result

38

Activity classes:

Atomic poses:

Objects:


39

Activity classes:

Atomic poses:

Objects:

Tennis Serving


40

Activity classes:

Atomic poses:

Objects:

Tennis Serving

Volleyball Smash

41



Model learning




Action Similarity

• Conclusion

Outline

42

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

43



Updating the layout of human body parts

1( )H hp 2( )H hp 3( )H hp

0.51 0.06 0.04

Mixture model

Re-estimate human pose

[Felzenszwalb et al, 2005][Sapp et al, 2010]

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

44



Updating the layout of human body parts Updating the object detections

( , , , )A O H I

Start from no objects in the image;

Evaluate the contribution of increasing for each detection window separately.

( , , , )A O H IAction a1

Action a2

Action a3

Action aNa

45



Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Enumerating all possible A and H values to maximize .

Action a1

Action a2

Action a3

Action aNa

( , , , )A O H I

46[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

Action Classification Experiment

47

Action Classification Results

1 2 3 4 5 6 70.5

0.6

0.7

0.8

0.9

1

Acc

urac

y

1 2 3 4 5 6 70.5

0.6

0.7

0.8

0.9

1

Acc

urac

y

Cricket bowling

Croquet shot

Tennis forehand

Tennis serving

Volleyball smash

Cricket batting

Overall

83%87%

Yao & Fei-Fei (2010b)

Lazebnik et al. (2006)

Our MethodYao et al,

(2011)Yao & Fei-Fei,

(2010)

48[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Object Detection and Pose Estimation

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

49

Object Detection Results

cricket bat .17 .18 .20

cricket ball .24 .27 .32

cricket stump .77 .78 .77

croquet mallet .29 .32 .34

croquet ball .50 .52 .58

croquet hoop .15 .17 .22

tennis racket .33 .31 .37

tennis ball .42 .46 .49

volleyball .64 .65 .67

volleyball net .04 .06 .09

MethodFelzensz-walb et al.

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

50













(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

51













(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

52

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

53

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

54



Model learning




Action Similarity

• Conclusion

Outline

Action Recognition as Classification

Cricket batting

Tennis Forehand

Volleyball Smashing

Playing Bassoon

Playing Guitar

Playing Erhu

Running

Gupta et al (2009)Yao & Fei-Fei (2010)

PASCAL VOC (2010)

Reading

Ikizler-Cinbis et al, 2009Desai et al, 2010Yang et al, 2010Delaitre et al, 2011Maji et al, 2011

55

Is Classification the End?

stand run

Actions in a continuous space

56


Same action,Different meanings

57


More than one action at the same time

Shopping

Calling

58

59

Retrieval Instead of ClassificationRetrieval as Similarity Ranking

> > >

> >

> > >

60

Ref.

Retrieval as Similarity Ranking

Decreasing of similarity value

61


Ref.


62

Ref.


• Challenges:How to obtain the ground-truth?How to perform automatic retrieval?How to evaluate a retrieval system?


63

Action Retrieval: Obtaining Ground Truth

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:Comparison images

Reference image

64


One trial:

Reference image

Comparison images

Reference image


65


One trial:

?? ?

Reference image

Comparison images


66


One trial:

?? ?

Reference image

Comparison images


67


1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

8:0 7:1 6:2 5:3 4:4

Degree of consistency of human annotations

Per

cent

age


68

• From pairwise annotation to overall similarity:

1Ref.

1s

2s

3s

4s

Ns

Sim( , )

PairwiseHuman annotation

Similarityvector


2s.t. 0, 1s s

-1 0 1 0 0 0 1 0 0 -1

69

1Ref.

1s

2s

3s

4s

Ns

Sim( , )

PairwiseHuman annotation

Similarityvector

-1 0 1 0 0 0 1 0 0 -1



2s.t. 0, 1s s

70

Ref.

0.260 0.227 0.145 0.135 0.112

0.085 0.075 0.041 0.012 0.006

0.002 0.000 0.000 0.000 0.000



Action Retrieval: Our Approach

>

>

>

Action class

Human pose

Object

71

>

>

>

Action class

Human pose

Object

• Distance between two images I and I’:

2 ( | ), ( | )D p A I p A I

2 ( | ), ( | )D p H I p H I

( | ), ( | )D p O I p O I

( , ) i ii

D T p q p q

22 ( )( , ) i i

i i i

p qD p q

p q

Total variance:

Chi-square statistics:

72

Action Retrieval: Our Approach

73

Action Retrieval: Evaluation Metric

Ref.

• Ranking from an algorithm:

• Ranking by ground-truth similarity:

1reI 2reI nreI

1gtI 2gtI ngtI

Number of Neighborhoods

Acc

urac

y

refI

n

1

1

( , )

( , )

i

i

n re ref

in gt ref

i

s I I

s I I

: ground-truth similaritys

74

Action Retrieval: Result

MC: Mutual Context

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

75


MC: Mutual Context

SPM: spatial pyramid matching (Lazebnik et al, 2005)

• Use the confidence scores of SPM output to evaluate the action similarity.

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

76


10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

77


10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

MC, Overall

SPM Baseline

78


10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8


Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

MC, Overall

SPM Baseline

79



Model learning




Action Similarity

• Conclusion

Outline

80

Conclusion

Human action as human-object interaction:

• Action classification:

• Matching action similarity:

> >

Croquet shot

Tennis forehand

Cricket bowling

81

Acknowledgment

• Stanford Vision Lab reviewers:– Jia Deng– Jia Li

Documents

ICML2011: recognizing human-object interaction activities