Video Action Detection with Relational Dynamic-Poselets · Title: Video Action Detection with Relational Dynamic-Poselets Author: Limin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2 Created

Video Action Detection with Relational Dynamic-PoseletsLimin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2

1Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong2Shenzhen Institutes of Advanced Technology, CAS, China

Introduction

•Problem: We aim to not only recognize on-going action class (actionrecognition), but also localize its spatiotemporal extent (actiondetection), and even estimate the pose of the actor (pose estimation).

•Key insights:

Figure 1: Illustration for motivation.

◮ An action can be temporally decomposed into a sequence of key poses.◮ Each key pose can be decomposed into a spatial arrangement of mixtures of

action parts.•Main contributions:

◮ We propose to a new pose and motion descriptor to cluster cuboids intodynamic-poselets.

◮ We design a sequential skeleton model to jointly capture spatiotemporal relationsamong body parts, co-occurrences of mixture types, and local part templates.

Dynamic-Poselets

Figure 2: Construction of dynamic-poselets.

•A pose and motion descriptor:◮ f(pv

i,j) = [∆pv,1i,j ,∆pv,2

i,j ,∆pv,3i,j ].

◮∆pv,1i,j = pv

i,j − pvpar(i),j, ∆pv,2

i,j = pvi,j − pv

i,j−1, ∆pv,3i,j = pv

i,j − pvi,j+1.

◮ f(pvi,j) = [∆pv,1

i,j ,∆pv,2i,j ,∆pv,3

i,j ].

◮∆pv,ki,j = [∆xv,k

i,j /svi,j,∆yv,k

i,j /svi,j] (k = 1, 2, 3).

•Using this descriptor, we run k-means algorithm to clustercuboids into dynamic-poselets.

Action Detection with SSMDynamic Poselet Construction Sequential Skeleton Model Learning

Part Models

Spatiotemporal Relations

Mixture Type Relations

Training Dataset

Positive Videos with Annotations

Negative Videos without Annotations

Tra

inin

g

Tes

tin

g

Test Video without Annotations Temporal Sliding Window

x

y

t

Post-processing Model Inference

……

Figure 3: Overview of our approach.

•Sequential Skeleton Model (SSM):

S(v,p, t) = b(t)︸︷︷︸

Mixture Type Relations

+ Ψ(p, t)︸︷︷︸

Spatiotemporal Relations

+ Φ(v,p, t)︸︷︷︸

Action Part Models

v is a video clip, p and t are the pixel positions and the mixture typesof dynamic-poselets, respectively .◮ Mixture Type Relations:

b(t) =N∑

j=1

K∑

i=1

bti,ji,j +

∑

(i,j)∼(m,n)

bti,jtm,n(i,j),(m,n)

bti,ji,j encodes the mixture prior, bti,jtm,n

(i,j),(m,n) captures the compatibility of mixture types.◮ Spatiotemporal Relations:

Ψ(p, t) =∑

(i,j)∼(m,n)

βti,jtm,n(i,j),(m,n)ψ(pi,j, pm,n),

ψ(pi,j, pm,n) = [dx, dy, dz, dx2, dy2, dz2]

βti,jtm,n(i,j),(m,n) represents the parameter of quadratic spring model.

◮ Action Part Models:

Φ(v, p, t) =N∑

j=1

K∑

i=1

αti,ji φ(v, pi,j)

φ(v, pi,j) is the feature vector, αti,ji denotes the feature template.

Note that the body part template αti,ji is shared among different key poses.

•Action detection pipeline:◮ temporal sliding window → model inference → non-maximum suppression.

References1. Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: CVPR (2010).2. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)3. Lan, T., etc.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011).4. Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013).5. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013).

Experiments

•Quantitive results on MSR-II, UCF Sports, J-HMDB:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Recall

Pre

cisi

on

Boxing

SDPM w/o partsSDPMModel before adaptationModel after adaptationModel adapted with groundtruthOur result

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Recall

Pre

cisi

on

Hand Clapping

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Hand Waving

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

AU

C

DivingGolfKickingLiftingHorseRidingRunningSkatingSwingBenchSwingSideWalking

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive

rat

e

SDPMFigure−Centric ModelOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

Ave

rag

e A

UC

SDPMFigure−Centric ModelOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Overlap threshold

AU

C

CatchClimbStairsGolfJumpKickPickPullupPushRunShootBallSwingWalk

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive

rat

e

iDT+FVOur result

0.1 0.2 0.3 0.4 0.5 0.60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Overlap threshold

Ave

rag

e A

UC

iDT+FVOur result

•Detection examples on MSR-II, UCF Sports, J-HMDB:

Limin Wang, Yu Qiao, and Xiaoou Tang European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014.

Documents

Video Action Detection with Relational Dynamic-Poselets · Title: Video Action Detection with Relational Dynamic-Poselets Author: Limin Wang1,2, Yu Qiao2, and Xiaoou Tang1,2 Created