Upload
rodney-pearson
View
224
Download
1
Tags:
Embed Size (px)
Citation preview
Skeleton Based Action Recognition with Convolutional Neural Network
Yong Duƚ, Yun Fuǂ, Liang Wangƚ
Nov. 6, 2015
ɫNat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of SciencesǂCollege of Engineering, College of Computer and Information Science, Northeastern University, USA
Background & Motivation
Experimental Results
Outline
Our Proposed Model
Conclusions & Future Work
Action Recognition
Two main branches of action recognition
Automatic Drive
Content-Based Video Search
Game control
Robot VisionHC Interaction
Intelligent Surveillance
ApplicationsRGB video based action recognition
RGBD video (skeleton) based action recognition
Objective of this work – skeleton based action recognition
Related Work
Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and
Recognition (CVPR 2014)
Mining actionlet ensemble for action recognition with depth
cameras (CVPR 2012)
An approach to pose-based action recognition (CVPR 2013)
Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action
Recognition (CVPRW 2013)
BRNN
BRNN
BRNN
BRNN
BRNNBRNN
BRNN
BRNN
BRNN
BRNN
BRNN
BRNN
Layer1 Layer2 Layer3 Layer4 Layer5 Layer6 Layer7
Full
y C
onne
cted
Lay
er
Soft
max
Lay
er
Layer8 Layer9
Hierarchical recurrent neural network for skeleton based action recognition (CVPR
2015)
Related Work
Limitations of most existing methods:
Dictionary learning based approaches (BoW) Temporal pyramids and its variants -> utilize limited contextual information;
Time series models – mainly DTWs & HMMs need pre-segmentation & pre-alignment, difficult to obtain the emission distribution;
Hand-crafted features;
Motivation & Contributions
Motivation: Information in action sequence – postures overtime and their evolution; Temporal dynamics -> static structure. Transform sequences into images – preserve dynamic & static information; Representation – structural information (I) <-> dynamic & static information (S).
Contributions: Propose a novel representation for skeleton sequences; Propose a sample end-to-end but high-efficiency and high-precision solution
for skeleton based action recognition; This framework is easily transformed to other time series problem.
Background & Motivation
Experimental Results
Outline
Our Proposed Model
Conclusions & Future Work
Skeleton Based Action Recognition with CNN
x11
x1N
y11
y1N
z11
z1N
x21
x2N
y21
y2N
z21
z2N
x31
x3N
y31
y3N
z31
z3N
x(T-1)1
x(T-1)N
y(T-1)1
y(T-1)N
z(T-1)1
z(T-1)N
xT1
xTN
yT1
yTN
zT1
zTN
RG
B
Temporal Dynamics
Spa
tial S
truc
ture
Spatial-Temporal Representation
Filter bankSpatial-Temporal
Synchronous Pooling
Filter bank
Data Transformation Image Representation Feature extraction & classification
From Skeleton Sequences to Images
Data Transformation Spatial postures – align joints according to human physical structure; Temporal dynamics - arrange in chronological order; Three components of each joint ↔ three components of each pixel.
Image representations obtained on the Berkeley MHAD dataset
Jumping in place
Jumping jacks Bending(hands up all the
way down)
Punching (boxing)
Waving two hands
Waving right hand
Clapping hands Throwing a ball
Sit down then stand up
Sit down Stand up
Temporal Dynamics
Spati
al S
truc
ture
Left arm
Right arm
Trunk
Right legLeft leg
Hierarchical Architecture
Convolution: all filter sizes are 3 x 3 and all convolutional strides are 1; Spatial-temporal synchronous pooling: max-pooling; The number of weights: about 75,000; Tested by voting.
Adaptive filter banks for feature representation learning:
Adaptive Filter Banks
Temporal Dynamics
Spati
al S
truc
ture
Adaptive Filter Banks Adaptive Filter Banks Adaptive Filter Banks
32 filtersSize: 3x3
32 filtersSize: 3x3
64 filtersSize: 3x3
64 filtersSize: 3x3
Max-pooling size:3x3, Stride:2
Max-pooling size:3x3, Stride:2
Max-pooling size:3x3, Stride:2
Spatial-Temporal Synchronous Pooling
Forward Neural Network
Spatial-Temporal Synchronous Pooling
Spatial-Temporal Synchronous Pooling
From Skeleton Sequences to Images
min
max min
255p c
p floorc c
Left arm
Right arm
Trunk
Right leg
Left leg
Temporal Dynamics
Spati
al S
truc
ture
Time
Concatenated by their physical connection order
Problem & Solution
Variable frequency problem – different subjects, different sequences, resize; Solution - Spatial-Temporal Synchronous Pooling.
Background & Motivation
Experimental Results
Outline
Our Proposed Model
Conclusions & Future Work
Datasets
Berkeley Multimodal Human Action Dataset (Berkeley MHAD): 12 subjects, 11 actions, 659 valid samples, 35 joints; 480 FPS, ≈3602 frames/sequence; Training on 384 samples of the first 7 subjects, testing on the rest 275 samples.
Berkeley MHAD
Motion capture dataset: High sample frequency; Long sequences; High precision.
Datasets
ChaLearn Gesture Recognition Dataset
ChaLearn Gesture Recognition Dataset: 27 persons, 20 Italian gestures, 6850 training samples, 3454 validation samples,
3579 test samples; 20 FPS, ≈39 frames/sequence; Provides RGB, depth, foreground segmentation and Kinect skeletons; Only use skeleton data, training on the training set and testing on the validation set.
Kinect dataset: Low sample frequency; Short sequences; Low precision.
Experimental Results
Method Accuracy(%)Ofli et al., 2014 95.37Vantigodi et al., 2013 96.06Vantigodi et al., 2014 97.58Kapsouras et al., 2014 98.18Du et al., 2015 100Ours 100
Experimental results on the Berkeley MHAD
Analysis: Temporal dynamics → static structure → final sequence representation:
successful; Spatial-temporal synchronous pooling – overcome variable frequency problem; This model handle this problem very well.
Experimental Results
Method Precision Recall F1-scoreYao et al., CVPR 2014 - - 56.0Wu et al., ACM-ICMI 2013 59.9 59.3 59.6Pfister et al., ECCV 2014 61.2 62.3 61.7Fernando et al., CVPR 2015 75.3 75.1 75.2Our Hierarchical RNN 91.93 92.01 91.97Ours 91.16 91.25 91.21
Experimental results on the ChaLearn Gesture Recognition Dataset
Analysis: Excellent performance and good robustness; Better process the temporal information compared with traditional methods; Skeleton data can well represent human motions.
Experimental Results
Filters and convergence curves on the ChaLearn Gesture Dataset
Computational efficiency
(NVIDIA Titan GK110)
Training with 1.95ms/sequence, testing with 2.27ms/sequence
Background & Motivation
Experimental Results
Outline
Our Proposed Model
Conclusions & Future Work
Conclusions
sample, end-to-end, high-precision and high-efficiency;
no need for temporal alignment and pre-segmentation;
Advantages:
handle variable-length/frequency sequences.
Sensitive to fragments missing in sequences.
Disadvantages:
Time
Missing Missing
Future work: consider the appearance features as an assistance to solve the depth video based action recognition.
Advantages & disadvantages