Spatio-Temporal Relationship Match:Video Structure Comparison for Recognition of
Complex Human Activities
M. S. Ryoo and J. K. AggarwalICCV2009
Introduction
• Human activity recognition, an automated detection of ongoing activities from video is an important problem.
• This technology can use on surveillance systems, robots, human-computer interface.
• When using on serveillance systems,automaically detect violent activities is very important.
Introduction
• Spatial-temporal feature-based approaches have been proposed by many researchers.
• The method above have benn successful on short video containing simple action such as “walking” and “waving”.
• In real-world applications, actions and activities are seldom like this.
Related works
• Methods focused on tracking persons and bodies are developed [4,11] ,but their results rely on background subtraction.
• Approaches that analyze a 3-S XYT volume gained particular in past few years[3,5,6,9,13,16] , they extracted relationship on features and trained a model.
• [3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In IEEEInternational Workshop on VS-PETS, pages 65–72, 2005.
• [4] S. Hongeng, R. Nevatia, and F. Bremond. Video-based eventrecognition: activity representation and probabilistic recognitionmethods. CVIU, 96(2):129–162, 2004.
• [5] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In ICCV, 2007.
• [6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008.
• [9] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), Sep 2008.
• [11] M. S. Ryoo and J. K. Aggarwal. Semantic representation and recognition of continued and recursive human activities. IJCV, 82(1):1–24, April 2009.
• [13] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: a local svm approach. In ICPR, 2004.
• [16] S.-F. Wong, T.-K. Kim, and R. Cipolla. Learning motion categories using both semantic and structural information. In CVPR, 2007.
Related works
• In this paper, we propose a new spatial-temporal feature-based methodology.
• Kernel functions are built on relationship between features.
• After training features , match function uses for matching test data.
Example matching result
Spatio-temporal relationship match
• The method is based on matching two videos and output a real number for result.
• K : V x V R• V -> input video , R-> result
Features and their relations
• A spatial-temporal feature extractor [3,14]detects each interest point locating a salient change.
Features and their relations
• f= (fdes,floc)• fdes ->descriptor ,floc-> 3-D coordinate• The features are clustered into k types using
k-means on fdes.
Features and their relations
• Each floc have n elements, f1loc,…..fn
loc.• There are types to describe temporal
relations:
Features and their relations
• Spatial relation are described below:
Features and their relations
Human activity recognition
• Our system maintains one training dataset Dα per activity α.
• Let Dαm extracted from mth training video in
the set Dα, then use the matching function.
Localization
Hierarchical recognition
• We can combine low-level action into high-level action.
• For instance, hand-shake includes two sub-action, “arm streching” and “arm withdrawing”.
• Detecting hand-shake may like : st1 before wd1,st2 before wd2,
st1 equals st2 ,wd1 equals wd2.
Experiments
• The dataset is UT-interaction dataset.• The actions are performed by actors, each
video contains shake hands,point,hug,push,kick and punch.
Experiments
Experiments
Conclusion
• This method rely on the extracted feature and spatial-temporal relationship on features.
• Can hierarchically detect high-level actions.• Miss-detect on unusual feature combination.