Localisation and Recognition of Human Actionsclopinet.com/isabelle/Projects/CVPR2011/slides/YiannisPatras.pdfOikonomopoulos, Patras, Pantic, IEEE Transactions of Image Processing,

CVPR 2011 Ioannis Patras1

Localisation and Recognition of Human Actions

Ioannis Patras

School of

Electronic Engineering and Computer Science

Queen Mary University of London

in collaboration withA. Oikonomopoulos and M. Pantic, Imperial College London

I. Kotsia and Guo Weiwei, Queen Mary University of London


Related research in QMUL

• Scene analysis (Izquierdo, Diplaros)

Object Detection/ Semantic segmentation

• Motion Analysis (Lagendijk, Hendriks, Hancock)

Motion estimation / segmentation

Object Tracking

• Facial (Expression) Analysis (Pantic, Koelstra, Rudovic)

Head tracking/Facial Feature Tracking

Facial expression recognition

• Action / Gesture Recognition (Kotsia, Guo, Kumar, Pantic)

Spatio-temporal representations for action recognition

Pose estimation

• Brain Computer Interfaces

Dynamic Vision

Looking at / sensing people

Static Analysis

URL: www.eecs.qmul.ac.uk/~ioannisp/

CVPR 2011 Ioannis Patras3 3

Looking at/sensing people

• Facial (Expression) Analysis

Head tracking/Facial Feature Tracking

Facial expression recognition

• Action / Gesture Recognition

Action recognition and localisation

Pose estimation

Tensor-based space-time analysis

• Brain Computer Interfaces


Localisation of Human ActionsOikonomopoulos, Patras, Pantic, IEEE Transactions of Image Processing, Mar. 2011.

Goal:

Recognize categories of actions

Localize them in terms of their

bounding box (space + time)

Challenges:

Occlusions, clutter, variations, …

Hypothesis: Analysis can be restricted on a set of

spatiotemporally „interesting‟/salient events


Information theoretical spatial saliencyT. Kadir and M. Brady. IJVC, Nov. 2001

Proposal: Use signal unpredictability as an indicator of saliency

HD=3.866

HD=7.201

Spatial Saliency: Unpredictability in a single frame


Scale (circle radius)

En

tro

py

0 20 40 60 80-0.2

0

0.2

0.4

0.6

0.8

1

29 59

Towards scale invariance

The entropy maxima reveal the spatial scale(s) of a salient region

Detected salient points

in a single frame


En

tro

py (

HD)

7

Spatial and spatiotemporal saliencyOikonomopoulos, Patras, Pantic, IEEE Transaction s SMC, part B, 2006

Spatiotemporal Saliency:

Driven by signal unpredictability in a spatiotemporal volume

(cylinder / sphere)

Examine entropy:

kkk vHvwvY

Entropy‟s „height‟Entropy‟s „peakness‟

dqudspd

ddqudsps

sudswq

D

q

D

,,,,,,


Descriptor extraction – codebook creation

Optical Flow

after median subtraction

Spatiotemporal

Salient Point Detection

c1

c2

…

cN

Codebook

(class-specific)

Optical Flow Input sequence

t

Feature ensembles

O.Boiman & M.Irani [ICCV‟05]

Feature selection

Ensemble codewords

Optical Flow + Spatial Gradient

Descriptors.

Bin in histograms and concatenate.


Class-dependent Spatio-temporal probabilistic voting

Current frame

T

t

-t T-t

• Parameters stored for each ensemble in the training set

average spatial position of ensemble with

respect to subject center and lower bound.

distance in frames of the activated ensemble from

the start/end of the action

average spatiotemporal scale of ensemble.

• Localisation model learned for codeword/cluster :

d

e

idii epcepwcpd

|||

X

T

S

de

ic

ic

de iX cpx

|


Discriminative learning

• Higher weights for pdfs with low

localisation entropy

• Class dictionary comprises of

discriminative codewords•Adaboost on the codeword similarities

iii cpcpdw |log|exp( icp |


Discriminative learning

Higher weights for pdfs with low temporal localisation entropy


Spatio-temporal probabilistic voting

Extension in the space time domain of ‘Implicit Shape Model’, Leibe et al., ECCV’04


Hypothesis verification with Relevance Vector Machine classification

• Mean-shift responses

used as features in RVM-based classification

• Two class classification problem (one-vs-all)

• Select class l that maximizes the posterior probability

2

2

( , ')

2( , ')

CD F F

K F F e

N

ji

l

jl

l

jl FFKwwwFc ,);( 0

,......,,1 iffF

1;1|

wFcleFlp


Localisation of single actions


Localisation accuracy (KTH)


Localisation accuracy (KTH)

[SS-PE] Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. CVPR 2007


Action recognition

• KTH dataset – average : 88% • HoHA dataset – average : 37%


Localisation under artificial occlusions (KTH)


Localisation under clutter (KTH)


Conclusions

• Voting schemes based on local descriptors are robust to

occlusions

• Good localisation and recognition accuracy

• Relies on annotation in terms of action localisation.

• More suitable for gestures rather than less „structured‟ actions


Support Tensor Learning

I. Kotsia and I. Patras, “Support Tucker Machines” CVPR 2011, Thursday afternoon

I. Kotsia and I. Patras, "Relative Margin Support Tensor Machines for gait and action

recognition," in CIVR 2010.


1

1min s.t. 1

2

0, 1,...

NT T

j j j

j=

j

w w +C ξ w φ(g )+b ξ

ξ j = ,N

Vector-based methods ignore

the space (time) structure

of the visual data

Motivation

Large dimensionality in the case of linear SVMs


1

min ( ) where ( ) a regularisation term e.g. ( ) ,

s.t. , 1 , 0, 1,...

N

j

j=

j j

f W +C ξ f W f W W W

X W +b ξ ξ j = ,N

Variants of Linear SVMs, where constraints are imposed

on the separating tensorplane

Tensor Machines

Smaller dimensionality, structural constraints

Support Tensor Machines[16] D. Tao, et al, KIS,13(1):1–42, 2007

I. Kotsia, I. Patras, CVPR 2011

Support Tucker Machines I. Kotsia, I. Patras, CVPR 2011

S/Sw Support Tucker MachinesI. Kotsia, I. Patras, CVPR 2011

=


Non-convex optimization problem w.r.t. A, B, C and core tensor G.

But: Convex w.r.t. A or B or C or G alone

Block coordinate optimization:

- e.g. optimization w.r.t G keeping A, B, C fixed

Each step can be reduced to a vector-based SVM-like constrained

optimization problem, e.g.1

(1)(1) (1)

, , 01

(1) (1) :

1min ( ( )) ( ( )) ,

2

1 s.t. [( ( )) ( ] 1 , 0

2

MIT

iG b

i

T

i i i i

A vec G A vec G C

y A vec G vec X b

Supervised learning


Probe Set Sota (five

methods)

SVMs STMs [16]

(w vector)

STMs

(W tensor)

RMSTMs

(W tensor)

StuMs

(W tensor)

Σw-StuMs

(W tensor)

A 100/100 80/97 92/100 99/100 100/100 99/100 100/100

B 89/90 79/93 81/90 85/93 89/97 85/93 87/95

C 83/88 68/85 73/88 79/93 83/95 79/90 81/91

D 39/55 30/54 47/67 53/72 56/75 53/71 55/74

E 33/55 23/46 48/79 62/88 65/91 63/86 65/90

F 30/46 24/49 29/49 41/71 44/74 42/63 44/66

G 29/48 12/37 31/71 50/88 53/90 52/87 54/90

Average - 45/62 57/68 67/86 70/89 68/84 69/87

Gait Recognition (USF dataset)

• Significant improvements in comparison to state of the art


KTH recognition

[7] T.K.Kim and R. Cipolla, „Canonical Correlation analysis of video volume tensors for action

categorization and detection,‟IEEE PAMI, vol. 31, no. 8, pp. 1415-1428, August 2009)

Input features: Dense oriented gradients (at each pixel)

Results comparable to state of the art, using very simple features


Conclusions

•Tensors exploit topology of data better than vectors

•The proposed algorithms (STuMs and Σ/Σw-STuMs) consistently outperform previous approaches, producing state of the art results

Limitations:

• Requires good alignment of the input data

• More suitable for gestures rather than less „structured‟ actions


References

• A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences" . IEEE Trans. Image Processing, vol. 20, no. 4, pp. 1126-1140, Mar. 2011

• I. Kotsia and I. Patras, "Support Tucker Machines", Int'l Conf. Computer Vision and Pattern Recognition, Jun. 2011, Colorado, USA

• I. Kotsia and I. Patras, "Relative Margin Support Tensor Machines for gait and action recognition," in Int'l Conf. Image and Video Retrieval 2010, 5-7 July, Xi'an, China, 2010.

•S. Koelstra, M. Pantic and I. Patras, "A Dynamic Texture based Approach to Recognition of Facial Actions and their Temporal Models". IEEE Trans. Pattern Analysis and Machine Intelligence, Nov. 2010

• O. Rudovic, I. Patras and M. Pantic, "Coupled Gaussian Process Regression for pose-invariant facial expression recognition", European Conf. Computer Vision (ECCV‟10), pp. 350-363, Heraklion, Crete, Greece, Sept. 2010

Documents

Localisation and Recognition of Human Actionsclopinet.com/isabelle/Projects/CVPR2011/slides/YiannisPatras.pdfOikonomopoulos, Patras, Pantic, IEEE Transactions of Image Processing,