Activity Recognition Ram Nevatia Presents work of F. Lv, P. Natarajan and V. Singh Institute of Robotics and Intelligent Systems Computer Science Department

Activity Recognition

Ram NevatiaPresents work of

F. Lv, P. Natarajan and V. SinghInstitute of Robotics and Intelligent Systems

Computer Science DepartmentViterbi School of Engineering

University of Southern California

Activity Recognition: Motivation

• Is the key content of a video (along with scene description)

• Useful for • Monitoring (alerts)

• Indexing (forensic, deep analysis, entertainment…)

• HCI

• ..

Issues in Activity Recognition

• Inherent ambiguities of 2-D videos• Variations in image/video appearance due to changes in

viewpoint, illumination, clothing (texture)….

• Variations in style: different actors or even the same actor at different times

• Reliable detection and tracking of objects, especially those directly involved in activities

• Temporal segmentation• Most work assumes single activity in a given clip

• “Recognition” of novel events

Possible Approaches

• Match video signals directly• Dynamic time warping

• Extract spatio-temporal features, classify based on them• Bag of words, histograms, “clouds”…

• Work of Laptev et al

• Most earlier work assumes action segmentation (detection vs classification)

• Andrew’s talk on use of localization and tracking

• Structural Approach• Based on detection of objects, their tracks and relationships

• Requires ability to perform above operations

Event Hierarchy

• Composite Events • Compositions of other, simpler events.

• Composition is usually, but not necessarily, a sequence operation, e.g. getting out of a car, opening a door and entering a building.

• Form a natural hierarchy (or lattice)

• Primitive events: those we choose not to decompose, e.g. walking• Recognized directly from observations,

• Graphical models, such as HMMs and CRFs are natural tools for recognition of composite events.

Key Ideas

– Only need a few primitive actions in any domain.– Sign Language: Moves and Holds.– Human Pose Articulation: Rotate, Flex and Pause.– Rigid Objects (cars, people): Translate, Rotate, Scale.

– Can be represented symbolically using formal rules.– Composite Actions can be represented as combinations

of the primitive actions.– Handle uncertainty and error in video by mapping

rules to a Graphical Models.– HMM.– DBN.– CRF.

Graphical Models

• A network, normally used to represent temporal evolution of a state

• Next state depends only on previous state; observation depends only on current state, single state variable

• Typical task is to estimate most likely state sequence given an observation sequence- Viterbi algorithm

A CRFA CRFAn HMMAn HMM

Mid vs Near Range

• Mid-range• Limbs of human body, particularly the arms, are not

distinguishable

• Common approach is to detect and track moving objects and make inferences based on trajectories

• Near-range• Hands/arms are visible; activities are defined by pose

transitions, not just the position transitions

• Pose tracking is difficult; top-down methods are commonly used

Mid-Range ExampleMid-Range Example

• Example of abandoned luggage detection • Based on trajectory analysis and simple object

detection/recognition• Uses a simple Bayesian classifier and logical reasoning about

order of sub-events• Tested on PETS, ETISEO and TRECVID data

Top-Down Approaches

• Bottom-up methods remain slow, are not robust; many methods are based on use of multiple video streams

• An alternative is of top-down approaches where processing is driven by event models

• Simultaneous Tracking and Action Recognition (STAR)• In analogy with SLAM in robotics

• Provides action segmentation, in addition to recogntion

• Closed-world assumption

• Current work limited to single actor actions

Activity Recognition w/o Tracking

Input sequence

… … … …

3D body pose … … … …

check watch

punch kick pick up throw

++

Action segments

• Viewpoint change & pose ambiguity (with a single camera view)

Difficulties

• Spatial and temporal variations (style, speed)

Key Poses and Action Nets

• Key poses are determined from MoCap data by an automatic method that computes large changes in energy; key poses may be shared among different actions

Experiments: Training Set

15 action models

177 key poses

6372 nodes in Action Net

Action Net: Apply constraints

0o

10o

…

Experiments: Test Set

50 clips, average length 1165 frames

5 viewpoints

10 actors (5 men, 5 women)

A Video Result

extracted extracted blobblob&&

ground truthground truth

with action with action netnet

without action without action netnet

original frameoriginal frame

Working with Natural Environments

• Reduce reliance on good foreground segmentation• Key poses may not be discriminative enough w/o

accurate segmentation; include models for motion between key poses

• More general graphical models that include • Hierarchy

• Transition probabilities may depend on observations

• Observations may depend on multiple states

• Duration models (HMMs imply an exponential decay)

• Remove need for MoCap data to acquire models

Composite Event Representation

P1: Rotate( Right, Arm, 90o,z-axis) P2: Rotate( Right, Arm, 90o,-z-axis)

CE: Sequence(P1,P2)

Learning Event Models

Composite Event = Sequence(P1,P2)

Primitive Event P1 Primitive Event P2

Dynamic Bayesian Action Network

– Map action models to a Dynamic Bayesian Network

– Decompose a composite action into a sequence of primitive actions

– Each primitive is expressed in a function form fpe(s,s’,N).– Maps current state s to

next state s’ given parameters N.

– Assume a known, finite set of functions f for primitives.

Inference Overview

• Given a video, obtain initial state distribution with start key pose for all composite actions

• For each current state:• Predict the primitive based on the current duration

• Predict a 3D pose given the primitive and current duration

• Collect the observation potential of the pose using foreground overlap and difference image

• Obtain the best state sequence using dynamic programming (Viterbi Algorithm)

• Features used to match models with observations• If “foreground” can be extracted reliably, then we can use

blob shape properties; otherwise, use edge and motion flow matching

• Obtain state distributions by matching poses sampled from action models

• Infer the action by finding the max likelihood state sequence,

Pose Tracking & Action Recognition

Inference Algorithm

Observations

• Foreground overlap with full body model,• Difference Image overlap with body parts in action

• Grid-of-centroids to match foreground blob with pose

Results

• From CVPR08 paper

Action Learning

• Involves two problems• Model Learning: Learning parameters N in the

primitive event definition fpe(s,s’,N).– Key Pose Annotation and Lifting.– Pose Interpolation

• Feature Weight Learning: Learning the weights wk of the different potentials.

KeyPose Annotation and 3D Lifting

Pose Interpolation

• All limb motions can be expressed in terms of Rotate(part,axis,q).• We need to learn axis and q.• Simple to do given the start and end joints of part.

Feature Weight Learning

• Feature weight estimation as minimization of a log-likelihood error function.

• Learn the weights using Voted Perceptron Algorithm

• Requires fully labeled training data -> not available.

• We propose an extension to deal with partial annotations.

Latent State Voted Percepton

Experiments

• Tested method on 3 datasets• Weizmann dataset• Gesture set with arm gestures• Grocery Store set with full body actions

Dataset Train:Test

Ratio

Action Recognition(% accuracy)

2D Tracking(% error)

Speed(fps)

Weizmann 3:6 99.5 -- --

Gesture 3:5 90.18 5.25 8

Grocery Store 1:7 100.0 11.88 1.6

Weizmann Dataset

• Popular dataset for action recognition

• 10 full body actions from 9 actors

• Each video has multiple instance of one action

Train:Test RecognitionAccuracy

Jhuang et al [9] 6:3 98.8

Space-Time Shapes [6] 8:1 100.0

Fathi Et al [5] 8:1 100.0

Sun et al [20] 3:6 87.3

DBAN 1:8 96.7

DBAN 3:6 99.5

Gesture Dataset

• 5 instances of 12 gestures from 8 actors.

• Indoor lab setting.• 500 instances of all actions.• 852x480 pixel resolution,

person height: 200-250 pix.

Grocery Store Dataset

• Videos of 3 actions collected from a static camera.• 16 videos from 8 actors, performed at pan angles.• Actor height varies from 200-375 pixels, in 852x480 resolution

videos.

Incorporating Better Descriptors

• Previous work based on weak lower-level analysis

• We can also evaluate 2D part modelsDynamic Bayesian Action Network

with Part Model

Experiments

• Hand gesture dataset in an Indoor lab

• 5 instances of 12 gestures from 8 actors, total of 500 action segments

• Evaluation metrics

• Recognition rate over all action segments

• 2D pose tracking as average 2D part accuracy over 48 randomly selected instances

Dataset Train:TestRatio

Recognition(% accuracy)

2D Tracking(% accuracy)

DBAN-FGM 1:7 78.6 75.67 (89.94)

DBAN-Parts 1:7 84.52 91.76 (92.66)

Summary and Conclusions

• Structural approach to activity recognition offers many attractions and challenges• Results are descriptive but detecting and tracking objects is

challenging

• Hierarchical representation is natural and can be used to reduce complexity

• Good bottom-up analysis remains a key to improved robustness

• Concept of “novel” or “anomalous” events remains difficult to formalize

Documents

Activity Recognition Ram Nevatia Presents work of F. Lv, P. Natarajan and V. Singh Institute of Robotics and Intelligent Systems Computer Science Department