M. S. Ryoo and J. K. Aggarwal ICCV2009

Spatio-Temporal Relationship Match:Video Structure Comparison for Recognition of

Complex Human Activities

M. S. Ryoo and J. K. AggarwalICCV2009

Introduction

• Human activity recognition, an automated detection of ongoing activities from video is an important problem.

• This technology can use on surveillance systems, robots, human-computer interface.

• When using on serveillance systems,automaically detect violent activities is very important.

Introduction

• Spatial-temporal feature-based approaches have been proposed by many researchers.

• The method above have benn successful on short video containing simple action such as “walking” and “waving”.

• In real-world applications, actions and activities are seldom like this.

Related works

• Methods focused on tracking persons and bodies are developed [4,11] ,but their results rely on background subtraction.

• Approaches that analyze a 3-S XYT volume gained particular in past few years[3,5,6,9,13,16] , they extracted relationship on features and trained a model.

• [3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In IEEEInternational Workshop on VS-PETS, pages 65–72, 2005.

• [4] S. Hongeng, R. Nevatia, and F. Bremond. Video-based eventrecognition: activity representation and probabilistic recognitionmethods. CVIU, 96(2):129–162, 2004.

• [5] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In ICCV, 2007.

• [6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008.

• [9] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), Sep 2008.

• [11] M. S. Ryoo and J. K. Aggarwal. Semantic representation and recognition of continued and recursive human activities. IJCV, 82(1):1–24, April 2009.

• [13] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: a local svm approach. In ICPR, 2004.

• [16] S.-F. Wong, T.-K. Kim, and R. Cipolla. Learning motion categories using both semantic and structural information. In CVPR, 2007.

Related works

• In this paper, we propose a new spatial-temporal feature-based methodology.

• Kernel functions are built on relationship between features.

• After training features , match function uses for matching test data.

Example matching result

Spatio-temporal relationship match

• The method is based on matching two videos and output a real number for result.

• K : V x V R• V -> input video , R-> result

Features and their relations

• A spatial-temporal feature extractor [3,14]detects each interest point locating a salient change.


• f= (fdes,floc)• fdes ->descriptor ,floc-> 3-D coordinate• The features are clustered into k types using

k-means on fdes.


• Each floc have n elements, f1loc,…..fn

loc.• There are types to describe temporal

relations:


• Spatial relation are described below:


Human activity recognition

• Our system maintains one training dataset Dα per activity α.

• Let Dαm extracted from mth training video in

the set Dα, then use the matching function.

Localization

Hierarchical recognition

• We can combine low-level action into high-level action.

• For instance, hand-shake includes two sub-action, “arm streching” and “arm withdrawing”.

• Detecting hand-shake may like : st1 before wd1,st2 before wd2,

st1 equals st2 ,wd1 equals wd2.

Experiments

• The dataset is UT-interaction dataset.• The actions are performed by actors, each

video contains shake hands,point,hug,push,kick and punch.

Experiments

Experiments

Conclusion

• This method rely on the extracted feature and spatial-temporal relationship on features.

• Can hierarchically detect high-level actions.• Miss-detect on unusual feature combination.

Documents

M. S. Ryoo and J. K. Aggarwal ICCV2009