Skeleton Based Action Recognition with Convolutional Neural Network Yong Du ƚ, Yun Fu ǂ, Liang Wang ƚ Nov. 6, 2015 ɫ Nat’l Lab of Pattern Recognition,

Skeleton Based Action Recognition with Convolutional Neural Network

Yong Duƚ, Yun Fuǂ, Liang Wangƚ

Nov. 6, 2015

ɫNat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of SciencesǂCollege of Engineering, College of Computer and Information Science, Northeastern University, USA

Background & Motivation

Experimental Results

Outline

Our Proposed Model

Conclusions & Future Work

Action Recognition

Two main branches of action recognition

Automatic Drive

Content-Based Video Search

Game control

Robot VisionHC Interaction

Intelligent Surveillance

ApplicationsRGB video based action recognition

RGBD video (skeleton) based action recognition

Objective of this work – skeleton based action recognition

Related Work

Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and

Recognition (CVPR 2014)

Mining actionlet ensemble for action recognition with depth

cameras (CVPR 2012)

An approach to pose-based action recognition (CVPR 2013)

Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action

Recognition (CVPRW 2013)

BRNN

BRNN

BRNN

BRNN

BRNNBRNN

BRNN

BRNN

BRNN

BRNN

BRNN

BRNN

Layer1 Layer2 Layer3 Layer4 Layer5 Layer6 Layer7

Full

y C

onne

cted

Lay

er

Soft

max

Lay

er

Layer8 Layer9

Hierarchical recurrent neural network for skeleton based action recognition (CVPR

2015)

Related Work

Limitations of most existing methods:

Dictionary learning based approaches (BoW) Temporal pyramids and its variants -> utilize limited contextual information;

Time series models – mainly DTWs & HMMs need pre-segmentation & pre-alignment, difficult to obtain the emission distribution;

Hand-crafted features;

Motivation & Contributions

Motivation: Information in action sequence – postures overtime and their evolution; Temporal dynamics -> static structure. Transform sequences into images – preserve dynamic & static information; Representation – structural information (I) <-> dynamic & static information (S).

Contributions: Propose a novel representation for skeleton sequences; Propose a sample end-to-end but high-efficiency and high-precision solution

for skeleton based action recognition; This framework is easily transformed to other time series problem.



Outline

Our Proposed Model


Skeleton Based Action Recognition with CNN

x11

x1N

y11

y1N

z11

z1N

x21

x2N

y21

y2N

z21

z2N

x31

x3N

y31

y3N

z31

z3N

x(T-1)1

x(T-1)N

y(T-1)1

y(T-1)N

z(T-1)1

z(T-1)N

xT1

xTN

yT1

yTN

zT1

zTN

RG

B

Temporal Dynamics

Spa

tial S

truc

ture

Spatial-Temporal Representation

Filter bankSpatial-Temporal

Synchronous Pooling

Filter bank

Data Transformation Image Representation Feature extraction & classification

From Skeleton Sequences to Images

Data Transformation Spatial postures – align joints according to human physical structure; Temporal dynamics - arrange in chronological order; Three components of each joint ↔ three components of each pixel.

Image representations obtained on the Berkeley MHAD dataset

Jumping in place

Jumping jacks Bending(hands up all the

way down)

Punching (boxing)

Waving two hands

Waving right hand

Clapping hands Throwing a ball

Sit down then stand up

Sit down Stand up

Temporal Dynamics

Spati

al S

truc

ture

Left arm

Right arm

Trunk

Right legLeft leg

Hierarchical Architecture

Convolution: all filter sizes are 3 x 3 and all convolutional strides are 1; Spatial-temporal synchronous pooling: max-pooling; The number of weights: about 75,000; Tested by voting.

Adaptive filter banks for feature representation learning:

Adaptive Filter Banks

Temporal Dynamics

Spati

al S

truc

ture

Adaptive Filter Banks Adaptive Filter Banks Adaptive Filter Banks

32 filtersSize: 3x3

32 filtersSize: 3x3

64 filtersSize: 3x3

64 filtersSize: 3x3

Max-pooling size:3x3, Stride:2



Spatial-Temporal Synchronous Pooling

Forward Neural Network



From Skeleton Sequences to Images

min

max min

255p c

p floorc c

Left arm

Right arm

Trunk

Right leg

Left leg

Temporal Dynamics

Spati

al S

truc

ture

Time

Concatenated by their physical connection order

Problem & Solution

Variable frequency problem – different subjects, different sequences, resize; Solution - Spatial-Temporal Synchronous Pooling.



Outline

Our Proposed Model


Datasets

Berkeley Multimodal Human Action Dataset (Berkeley MHAD): 12 subjects, 11 actions, 659 valid samples, 35 joints; 480 FPS, ≈3602 frames/sequence; Training on 384 samples of the first 7 subjects, testing on the rest 275 samples.

Berkeley MHAD

Motion capture dataset： High sample frequency; Long sequences; High precision.

Datasets

ChaLearn Gesture Recognition Dataset

ChaLearn Gesture Recognition Dataset: 27 persons, 20 Italian gestures, 6850 training samples, 3454 validation samples,

3579 test samples; 20 FPS, ≈39 frames/sequence; Provides RGB, depth, foreground segmentation and Kinect skeletons; Only use skeleton data, training on the training set and testing on the validation set.

Kinect dataset： Low sample frequency; Short sequences; Low precision.


Method Accuracy(%)Ofli et al., 2014 95.37Vantigodi et al., 2013 96.06Vantigodi et al., 2014 97.58Kapsouras et al., 2014 98.18Du et al., 2015 100Ours 100

Experimental results on the Berkeley MHAD

Analysis： Temporal dynamics → static structure → final sequence representation:

successful; Spatial-temporal synchronous pooling – overcome variable frequency problem; This model handle this problem very well.


Method Precision Recall F1-scoreYao et al., CVPR 2014 - - 56.0Wu et al., ACM-ICMI 2013 59.9 59.3 59.6Pfister et al., ECCV 2014 61.2 62.3 61.7Fernando et al., CVPR 2015 75.3 75.1 75.2Our Hierarchical RNN 91.93 92.01 91.97Ours 91.16 91.25 91.21

Experimental results on the ChaLearn Gesture Recognition Dataset

Analysis： Excellent performance and good robustness; Better process the temporal information compared with traditional methods; Skeleton data can well represent human motions.


Filters and convergence curves on the ChaLearn Gesture Dataset

Computational efficiency

(NVIDIA Titan GK110)

Training with 1.95ms/sequence, testing with 2.27ms/sequence



Outline

Our Proposed Model


Conclusions

sample, end-to-end, high-precision and high-efficiency;

no need for temporal alignment and pre-segmentation;

Advantages:

handle variable-length/frequency sequences.

Sensitive to fragments missing in sequences.

Disadvantages:

Time

Missing Missing

Future work: consider the appearance features as an assistance to solve the depth video based action recognition.

Advantages & disadvantages

THANK YOU

Suggestions Questions

E-mail: {Yong.du, wangliang}@nlpr.ia.ac.cn, [email protected]

Documents

Skeleton Based Action Recognition with Convolutional Neural Network Yong Du ƚ, Yun Fu ǂ, Liang Wang ƚ Nov. 6, 2015 ɫ Nat’l Lab of Pattern Recognition,