
Evaluation of Vision-based Human Activity Recognition in Dense Trajectory Framework

Hirokatsu Kataoka, Yoshimitsu Aoki†, Kenji Iwata, Yutaka Satoh

National Institute of Advanced Industrial Science and Technology (AIST) † Keio University

Background Computer vision for human sensing

-  Detection, Tracking, Trajectory Analysis -  Posture Estimation, Activity Recognition -  Action recognition is able to extend human sensing applications

Mental state

Body Situation


Activity Analysis


Look at people

Detection Gaze Estimation

Action Recognition

Posture Estimation

Face Recognition

Trajectory extraction


Activity Recognition

“Activity” is a low-level primitive with semantic meaning e.g. walking, running, sitting

This image contains a man walking - The classification (location is given)

Activity recognition - The classification and localization

Activity detection


Dense Trajectories (DT) [Wang+, IJCV2013] • State-of-the-art space-time recognition approach –  State-of-the-art: DT + Deep Learning [THUMOS2015]

– Usable motion analyzer –  Simply, (i) flow tracker (ii) feature vectorization

Large amount of opt. flows


History of keypoint/traj.-based approach • Space-time interest points (STIP) – DT

Dense Trajectories[Wang et al., CVPR2011]

Feature Mining for Activity Recognition

Cuboid Features

STR: Spatio-Temporal Relationship Match

STIP & DT: Sampling • Space-time interest points (STIP) – DT

Co-occurrence features in DT • Extended co-occurrence feature (ECoHOG) –  Feature •  CoHOG[Watanabe, PSIVT2009] (pair-count), ECoHOG (edge-magnitude accum.) •  PCA for codeword •  DT+Co-occurrence features (62.4%) > DT (59.2%) on MPII cooking



Need for more features!

Pose-based approach

Holistic appraoch

Proposal • Feature evaluation for more better performance –  Evaluation of 13 features at fair settings –  5 Category •  Trajectory: traj. feature (originally in DT) •  Shape: HOG, SIFT •  Motion: HOF, MBHx, MBHy, MIP •  Texture: HLAC, LBP, iLBP, LTP •  Co-occurrence: CoHOG, ECoHOG

–  4 different datasets •  NTSEL (traffic) •  INRIA surgery (surgery) •  MSR daily activity 3d (daily living) •  UCF50 (sports)

Simple algorithm • (i) Flow tracking –  Pyramidal images & sampling –  Farneback optical flow & flow tracking

• (ii) Feature vectorization – HOG, HOF, MBH, Trajectory, SIFT, LBP….. – Bag-of-words (BoW) representation

Pyramidal images & sampling • Scaling and dense sampling

–  Pyramidal images •  Scales *= 1/√2

–  Sampling at each scale •  Grid: 5x5 [pxls] (experimentally decided) •  Corner detection T: threshold, λ: eigen value

Scale invariant Detailed description

Farneback Optical Flow • Dense Optical Flow + ST-patch –  Farneback Optical Flow is included OpenCV – Comparison of KLT tracker and SIFT –  Local space-time patch around tracked sampling points



Trajectory-based feature • Trajectory shape – Calculating flow between frames –  Scale normalization

Pt = (Pt+1 − Pt) = (xt+1 − xt, yt+1 − yt)

Shape-based feature • HOG, SIFT

Edge-orient., mag. from block representation with overlapping and normalization

Edge-shape from background

Simply divided 4x4 blocks

Motion • HOF, MBHx, MBHy, MIP

Block optical flow extraction


Trinary (-1, 0, +1) from block flow direction, [Kliper-Gross+, ECCV2012]

Texture • HLAC, LBP, iLBP, LTP

Higher-order local auto-correlation 0-, 1st-, 2nd- order pattern

Co-occurrence • Extended co-occurrence feature (ECoHOG) –  Feature •  CoHOG[Watanabe, PSIVT2009] (pair-count), ECoHOG (edge-magnitude accum.) •  PCA for codeword •  DT+Co-occurrence features (62.4%) > DT (59.2%) on MPII cooking


Experiments • Evaluation of 13 features in dense trajectory framework –  4 different datasets •  Traffic scene (NTSEL dataset): 4 classes •  Surgery (INRIA surgery): 4 classes •  Daily living (MSR daily action 3D): 12 classes •  Sports (UCF50): 50 classes

Results on the 4 datasets • High-performance features –  Top three features at each dataset –  4 different scenes

Results on the 4 datasets • High-performance features – CoHOG, SIFT, MBH – CoHOG is the stable accuracy at all datasets

Detailed performance rate • Depending on recognition task! – We need to experimentally concatenate several features –  Feature concatenation on the NTSEL and INRIA surgery

Rate of feature concatenation • Baseline, 5 categories and concatenated vector – Baseline: DT + BoW model – Motion and co-occurrence feature – No need to apply all features

Conclusion • We evaluated 13 features in the framework of DT –  For more effective activity recognition –  4 different scenes at each dataset – Detailed evaluation and concatenated vectors –  Top-N ranked concatenation is needed for activity recognition

Feature extraction Around trajectories

–  Extraction of 13 features in ST-patch –  2 (x dir.) x 2 (y dir.) x 3 (t dir.) region – Calculating features with bag-of-words(BoW)

ST-patch and xyt block extraction

13 features extractioin

