Spatio-Temporal Context of Mid-level Features for Activity ...nlpr-web.ia.ac.cn/2011papers/gjhy/gh35.pdf · Spatio-Temporal Context of Mid-level Features for Activity Recognition

Spatio-Temporal Context of Mid-level Features for

Activity Recognition

Fei Yuana,b, Gui-Song Xiab, Hichem Sahbib, Veronique Prineta

aNLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, P.R.CbLTCI CNRS, Telecom ParisTech, Paris, France

Abstract

Local spatio-temporal features have been shown to be efficient and robustto represent simple actions. However, for complicated human activities withlong-range motion or multiple interactive body parts and persons, the limi-tation of low-level features blows up because of their local properties and thelack of context. This paper addresses the problem by suggesting a frameworkfor both computing mid-level features and taking into account the contextinformation between them.

First, we propose to represent human activities by a set of mid-level com-ponents, called activity components, which have consistent structure and mo-tion in spatial and temporal domain respectively. The activity componentsare extracted hierarchically from videos, i.e., extracting key-points, groupingkey-points to trajectories and finally clustering trajectories to components.Second, to further exploit the interdependencies of the activity components,we introduce a spatio-temporal context kernel (STCK), which not only takesinto account the local properties of features but also considers their spatialand temporal context information. The experimental results on two chal-lenging datasets of activities show that the proposed activity componentsoutperform local spatio-temporal features and the STCK improves a lot theperformance of activity recognition by considering the context between ac-tivity components.

Keywords: Activity recognition, mid-level features, activity components,spatio-temporal context kernels

Preprint submitted to Pattern Recognition July 21, 2011

1. Introduction

Human activity recognition is one of the most challenging problems incomputer vision. By “activity”, we refer to a higher-level combination ofprimitive actions with certain spatial and temporal relationships, e.g., handshaking, hugging, eating food with a silverware, etc. Compared with sim-ple actions, such as walking and drinking, activity recognition is more com-plicated. Besides general difficulties of action recognition, such as cameramotion, illumination changes, occlusion, low intra-similarity and large inter-variability, etc, the challenge of activity recognition stems from its structuredproperty: the complicated spatio-temporal interactions between a set of bodyparts or multiple persons.

Local interest points based features, both 2-dimensional corner-like fea-tures [1, 2] and 3-dimensional space-time features [3, 4, 5], have been widelyemployed for action recognition, because of their avoidance of pre-processing(such as background subtraction, body modeling and motion estimation)in the process of extraction and their robustness to camera motion and il-lumination changes. They can be used to form sparse representations ofactions and be effectively integrated into a machine learning framework. Im-pressive results have been reported in both synthetic and realistic scenar-ios, see [1, 2, 3, 4, 5, 6, 7, 8]. However, the limitation of low-level interestpoints-like features blows up when they are used to represent complicatedactivities with long-range motions or multiple interactive body parts, sincethey describe only the local information in a spatio-temporal volume, and therepresentation based on them, e.g., bag-of-features (BOF), usually discardsthe geometrical and temporal relationships among features.

Much effort has been put to overcome the limitations of local interestpoints feature [1, 8, 9, 10]. Matteo et al. [8] suggested to use the global spatio-temporal distribution of the interest points, which was achieved throughextracting holistic features from clouds of interest points accumulated overmultiple temporal scales. Andrew et al. [1] proposed to hierarchically groupdense 2-dimensional corners in a neighbourhood in both space and time toproduce an overcomplete compound feature set. Then frequently reoccurringpatterns of features are found by a data mining technology. As the level ofhierarchy increases, the mined compound features become more and morecomplex, sparse and discriminative. In [10], the authors introduced a spatio-temporal relationship matching kernel, which was designed to measure thestructural similarity between two sets of features extracted from different

2

videos. The match explicitly compared temporal relationships (e.g.beforeand during) as well as spatial relationships (e.g., near and far) between theextracted features in a 3-dimensional XY-t space. Although these approachesimprove the performance of local features to a certain extent, the problem isstill far from being solved.

Instead of using local features, a second popular way in action/activityrecognition is to use the global information of the image sequence, see [11,12, 13, 14]. In these methods, a set of global templates are first built andactivities are then categorized by matching these templates with testing videosequences. However, it has been reported that it is not easy to obtain flexibleglobal templates which allow capturing huge intra-class variations of humanactivities. To improve the flexibility of templates, [13] suggested a deformableaction templates model by learning a set of primitives with weights. In [14],the authors proposed to split the entire template into a set of parts, whichare then matched individually.

At an intermediate level between local and global features, several middle-level feature-based approaches have been proposed. The first category ispose-based approach, in which action recognition is defined by pose estima-tion. The representive human body models for pose estimation is Pictorialstructures model [15], in which a pictorial structure is a collection of partsarranged in a deformable configuration. Each part is represented using anappearance model and the deformable configuration is represented by spring-like connections between pairs of parts. The second category is part-basedapproach. [16] introduced a discriminative part-based approach, in which ahuman action is modelled by a flexible constellation of parts. In [17], theauthors proposed a hierarchical action model, in which the lower layer corre-sponds to local features, the higher layer describes a constellation of P parts;each part is associated to a bag of features, and the relative positions of partsare taken into account.

Recently, trajectory-based methods, which try to extract long-term mo-tion information of actions, have been employed in activity recognition [18,19, 20, 2, 21]. For instance, in [2], for each video clip, trajectories were firstextracted by tracking 2-dimensional interest points and then were used asthe basic feature unit; finally, bag-of-words model was formed to representthe holistic trajectories.

Context has also been considered as an important cue for activity recog-nition. [9] proposed to learn the shapes of neighborhoods of the space-timefeatures which are discriminative for a given action category, and recursively

3

mapped the descriptors of the variable-sized neighborhoods to higher-levelvocabularies. The process produced a hierarchy of space-time configurations.In [22], the objects and human body parts were considered as mutual context,and then a random field model was proposed to encode the mutual contextof objects and human poses in “human-object interaction” activities. Theycasted the model learning task as a structure learning problem, by which thestructural connectivity between objects, overall human poses, and differentbody parts were estimated.

In this paper, we propose to represent complex human activities by aset of mid-level features, called activity components. We define an activ-ity component as a connected spatio-temporal part having consistent spatialstructure and consistent motion in temporal domain. As we shall see, mostof these activity components have physical meanings, for instance, the ex-tracted activity components in the “Shaking Hands” activity correspond tothe extending arms and shaking hands. We suggest to compute the activitycomponents by grouping similar trajectories. First, key-points are extractedand tracked frame by frame, in order to form a set of trajectories. Thenaccording to appearance and motion attributes, the set of trajectories ispartitioned into several trajectory clusters to obtain activity components.Observe that shape is an important property of activity components, whichis beyond trajectories themslves. Finally, a hierarchical descriptor has beenproposed for each activity component to encode their appearance, shape andmotion information.

It is worth noticing the difference between our activity components andthe part-based models, such as hidden part model [16] and constellationmodel [17]: parts are hidden or abstract without physical meaning in thesemodels, but the components proposed in this paper are designed to corre-spond to certain moving physical body parts. Furthermore, we can auto-matically compute consistent activity components. It does not rely on anypart-based body model or other complicated models.

The second contribution of this paper is to introduce a spatio-temporalcontext kernels (STCK) to exploit the structural and dynamic property ofactivity components. We argue that an activity is composed of several activ-ity components, which are depend on each other both in the spatial domainand in the temporal domain. Our preliminary work [23] investigates the pair-wise relationships between activity components. In the case of STCK, oneactivity component is considered to be related to all other components in itsneighbourhood, thus the similarity of two activities are computed by reling

4

on both the local properties of components and the spatio-temporal contextof components. Moreover, STCK has a promising generalization propertyand can be plugged into SVMs for activities recognition.

The rest of this paper is organized as follows. Section 2 gives a detaileddescription of our approach for extracting and representing mid-level com-ponents. In Section 3, we present the spatial-temporal context kernels. Weillustrate and interpret the experimental results in Section 4, and finally con-clude the paper in Section 5.

2. Activity-Components: a mid-level activity representation

2.1. From trajectories to activity-components

For a given video V = {It}Tt=1 of T frames, denote P(t) = {p(t)k }Kk=1

with p(t)k = (x

(t)k , y

(t)k ) as a set of key-points in the frame It. Let τ =

(p(t)i , p

(t+1)j , . . . , p

(t+∆)k ) be a trajectory with the length of ∆, which is a list of

matched key-points in a number of ∆ continued frames, and T = {τn}Mn=1 bethe collection of all trajectories in V . Thus the video V can be representedin a natural and hierarchical way as

{P(t)}Tt=1 −→ T −→ V . (1)

The trajectory-based approaches start from this video representation [18, 19,20, 21], where the trajectories are supposed to be independent and are usedas basic elements of activities. Each trajectory is described by its motion andthe appearance information along it, and the videos are finally representedby a bag-of-words model (bag-of-trajectories).

This section presents a novel mid-level representation of activities by re-lying on the trajectories of video. We argue that these trajectories T of avideo V can be partitioned into several different clusters and each of theseclusters corresponds to certain meaningful part of the objects, such as amoving arm, which can be taken as basic elements of activities. In whatfollows, we call these physically meaningful clusters as activity-components.An activity-component cl can be represented as a subset of trajectories

cl = {τn1 , τn2 , . . . , τnm}.

Moreover, different activity-components are non-overlapped, i.e. cl ∩ cq = ∅for l 6= q. Let C = {cl}Ll=1 be the set of activity-components. The new

5

hierarchical representation of a video V is

{P(t)}Tt=1 −→ T −→ C −→ V. (2)

Notice that the representation in Equation (2) encodes some mid-level infor-mation of activities, comparing with that in Equation (1).

2.2. Computation of activity-components

Given a video, we take a bottom-up strategy to compute the activity-components. More precisely, trajectories are first extracted by linking denselysampled key-points, and are then grouped into different activity-componentsaccording to their properties.

2.2.1. Trajectory extraction

Similar to [2, 18, 19], trajectories are generated by tracking points frameby frame. However, in order to provide sufficient points to group similar mo-tions into meaningful body parts, we propose to use densely sampled pointsfrom each frame rather than spatially salient points, such as corners andSIFTs, used in previous trajectory-based methods, see [2, 18, 19]. The corre-spondence between key-points in successive frames is established by nearestneighbor distance ratio matching [24].

To obtain reliable trajectories, two heuristics are used in the extractionstep. (1) For a key-point p = (x

(t)k , y

(t)k ) in the frame It, we only match

it with key-points located within a spatial window of size N × N aroundp′ = (x

(t+1)k , y

(t+1)k ) in the next frame It+1. The window size N depends

on the possible maximal velocity and it is empirically set to be 15 in ourcase. (2) We discard near-static and short trajectories, since still objectsare out of interest in activity recognition and meaningful objects will notdisappear fast in natural activities. More precisely, for a trajectory τ =(p

(t)i , p

(t+1)j , . . . , p

(t+∆)k ) of length ∆, define its average displacement as στ =

〈‖p‖22〉p∈τ − 〈‖p‖2〉2p∈τ with 〈·〉 being an average operator. The trajectory τwill be taken into account, only if its length ∆ ≥ 5 and στ is larger than agiven threshold.

2.2.2. Trajectory grouping

Grouping the extracted trajectories T = {τn}Tn=1 into consistent anddescriptive motion-parts, namely activity-components, is similar to motionsegmentation from trajectories. However, observe that current motion seg-mentation methods can not provide satisfying results in complex scenes, e.g.

6

scenes with low motion contrast or cluttered background, since only the 2Dor 3D location information of points on the trajectories are taken into ac-count [25]. In our case, we propose to use more discriminative features forthe description of trajectories. Specifically, for a trajectory τ , we considerthree following information:

- Location {p(t)τ = (x(t)τ , y

(t)τ )}t

eτt=tsτ

, where tsτ and teτ are the starting andending time of the trajectory;

- Displacement {∂p(t)τ = p(t)τ −p(t+1)

τ }teτ−1t=tsτ

, which is the location differencebetween two connected points on the trajectory;

- Brightness profiles along the trajectory. For a point p(t)τ , the intensities

inside a small patch of It around p(t)τ are considered and are encoded

by a N -bin histogram h(t)τ [n]Nn=1.

here, the brightness information other than HOG or SIFT is used to reinforceclustering of trajectories. The reason is that the points that belong to thesame component usually don’t contain similar gradient information, espe-cially for the points around the edge of a component and the points withina componont, while the brightness information in a component is usuallyconsistant. We conform in our experiments that the brightness histogram isbetter than HOG or SIFT for clustering the trajectories.

Thus, the distance between the trajectory τ and the trajectory υ is definedas

dist(τ, υ) =

t2∑t=t1

‖p(t)τ − p(t)υ ‖2

+ α1

t2∑t=t1

‖∂p(t)τ − ∂p(t)υ ‖2

+ α2

t2∑t=t1

exp{−

N∑n=1

min(h(t)τ [n], h(t)υ [n]

)}, (3)

where t1 = max(tsτ , tsυ) and t2 = min(teτ , t

eυ) are the overlapping start-frame

and end-frame of the two trajectories τ and υ, ‖ · ‖2 is the L2-norm andα1, α2 are weights that balance the contribution of different features. Thefirst term of Equation (3) is the spatial distance between the two trajectorieswhich favor trajectories that are spatially close to each other; the second

7

term of Equation (3) measures the motion dissimilarity of trajectories; andthe third term of Equation (3) is the intensity dissimilarity between twotrajectories. Moreover, in order to compare trajectories of different length,we finally normalize the distance in Equation (3) by the overlapping length|t1 − t2| as

dist′(τ, υ) =dist(τ, υ)

|t1 − t2|. (4)

Once the pairwise distance matrix of trajectories is computed, the activity-components could be obtained by a clustering algorithm, such as graph-basedclustering method [26], Normalized Cuts [27] or other spectral clustering tech-niques. In our work, we use the efficient graph-based clustering algorithmin [27], which generates the components with similar accuracy compared toother clustering methods, such as Normalized Cuts [26], while runs extremelyfast. Specifically, each trajectory is taken as a node of the graph and everytwo nodes are linked by an edge, whose weight is the distance described inEquation (3). The main principle of this algorithm is to keep that edges be-tween two vertices in the same component have relatively low weights, andedges between vertices in different components have higher weights. The ap-proach relies only on a threshold to cut the edges, which is chosen empiricallyfor giving the best performance.

Figures 1 gives several graphical illustrations of the activity-componentscomputed from the image sequences of 6 activity classes in the UT-Interactiondataset [28]. Different activity-components are shown in different colors andthe key-points in the same activity-component are displayed in the samecolor. Observe that each extracted activity-component contains a collectionof trajectories that are of consistent motion and appearance. The extractedactivity-components have some physical meanings. For instance, in Fig-ures 1(a) the most meaningful extracted activity-components are two shakinghands (in red and green) and moving arms (in blue and slight blue), whichare also consistent with our human observations.

2.3. Description of activity-components

As demonstrated in Figure 1, the extracted activity-components corre-spond to certain physical moving objects or parts of activities, such as mov-ing arms and shaking hands. In order to capture different aspects of anactivity-component, we use three kinds of information: the appearance, the

8

(a) Activity-components of ShakeHand-ing

(b) Activity-components of Hugging

(c) Activity-components of Kicking (d) Activity-components of Pointing

(e) Activity-components of Pushing (f) Activity-components of Punching

Figure 1: Examples of extracted activity-components from the UT-Interaction dataset [28].Only the largest components are shown. The left one of each subfigure shows the extractedcomponents in X-Y plane, and the right one displays the extracted components in 3-dimensional (X-Y-t) space, where the time information is included. Observe that eachactivity-component contains a collection of trajectories that are of consistent motion andappearance. Different colors correspond to different activity-components. The points inthe same activity-component are displayed in the same color. It is worth noticing thephysical meanings of the extracted activity-components. For instance, in Figures 1(a) themost meaningful extracted activity-components are two shaking hands (in red and green)and moving arms (in blue and slight blue), which are also consistent with our humanobservations.

9

shape and the motion of activity-components. The first two cues are en-coded by bag-of-words models and the last one is described by a translationmatrix. These three descriptors will be finally concatenated to form ouractivity-component descriptor.

Appearance descriptor. As shown in Equation (2), the basic elements of ouractivity-components are key-points. For a given activity-component, eachof its key-points is described by a 128-dimensional SIFT descriptor [29].The appearance information of the activity-component is then encoded bya bag-of-words model. More precisely, a global codebook is first generatedby clustering all the SIFT descriptors extracted at each key-point of all theactivity-components in the training sequences. (In our experiments, the k-means algorithm is used for clustering and the size of codebook is empiricallyset to be 300.) Each SIFT descriptor is then quantized with the codebook,and an activity-component c is finally represented by a 1-dimensional empir-ical distribution of the quantized values, noted as happe(c).

Shape descriptor. The shape of an activity-component can be in fact consid-ered as a sequence of deformed 2-dimensional shapes in spatial domain. Inour case, the 2-dimensional spatial shape of an activity-component is repre-sented by a cloud of key-points. For an activity-component c, we describeits shape information in the frame It in a way similar to that of Shape Con-text [30]. Specifically, as in Figure 2, the spatial centroid p(t)c = (x(t)c , y

(t)c )

of the key-points cloud in the frame It and the average radius r(t)c to p(t)c

are computed. The X-Y plane is then divided into 5 × 12 log-polar parti-tions (5 intervals in the radial directions and 12 intervals for the orientation)at the gravity center of the point cloud. The size of the log-polar zone isproportional to the average radius r(t)c . Finally the 2-D spatial shape of anactivity-component in the frame It is described by a quantized distributionof all key-points according to the log-polar partitions.

Similar to the appearance descriptor, a global shape codebook with sizeof 300 is built on all the spatial shapes of the extracted activity-componentsfrom the training sequences by k-means algorithm. The 2-D spatial shape ofan activity-component c in each frame is quantized with the shape codebook.The shape information of c is finally encoded by an empirical distribution ofquantized shapes in all frames, noted as hshape(c).

Motion descriptor. We use a translation matrix of trajectories [2, 18] to de-scribe the motion information of an extracted activity-component. Suppose

10

10 20 30 40 50 600

0.05

0.1

0.15

Figure 2: Illustration of the shape descriptor of an activity-component. The shape descrip-torincodes the shape information in a frame It in an activity-component. The blue point

stands for the spatial centroid p(t)c = (x

(t)c , y

(t)c ) of the key-points cloud in the frame It, and

the size of the log-polar zone is proportional to the average radius r(t)c . The 2-D spatial

shape of an activity-component in the frame It is described by a quantized distribution ofall key-points according to the log-polar partitions.

Figure 3: Illustration of the motion descriptor of an activity-component. For each trajec-tory in an activity-component, we compute its transition matrix based on its line segments.The final motion descriptor is the average of those of trajectories in an activity-component.

11

that each trajectory lies on a curve in the 3-D spatio-temporal space, we firstsmooth the trajectory-curve by anisotropic diffusion [31], and then computeits curvature. The extrema of the spatio-temporal curvatures capture boththe changes of speed and the changes of the direction. According to the cur-vature extrema, following the approach in [32], a trajectory is divided intoseveral line segments. The orientations of line segments are then quantizedinto S states, including a “no-motion” one. Thereby, for each trajectory,a transition matrix with S × S state (in our work, we take S = 9.) iscomputed. The transition matrices of all the trajectories contained by theactivity-component c are added together to form one matrix, which is thenconcatenated as a motion descriptor ψmotion(c). Figure 3 shows a visualiza-tion of our motion descriptor.

So far, for an activity-component c, the three descriptors mentioned aboveare first normalized, then are concatenated to form a combined feature f(c) =〈happe(c), hshape(c), ψmotion(c)〉.

2.4. Bag-of-components activity representation

After describing each activity-component by a feature f(c), a straightfor-ward way to represent an activity is to use the bag-of-words model, wherethe “words” are activity-components in our case.

3. Activity representation by spatio-temporal context of activity-components

The bag-of-components activity representation proposed in previous sec-tions makes an important assumption that all these activity-components areindependent. Observe that this assumption has been also widely adopted byother activity representations, such as the trajectory-based ones [2, 19, 20,21], and few work considers the relationships between the activity elements,except [10]. However, one basic fact ignored in activity recognition is thatan activity can be composed of several activity-components, such as the ac-tivity shaking hand consisting of two shaking hands and two moving arms,which depend on each other not only in the spatial domain but also in thetemporal domain. Thus, in this section, we introduce the spatio-temporalcontext of activity-components for taking into account the relationships ofactivity-components.

12

3.1. Spatio-temporal context of activity-components

Let note V an activity represented by a collection of activity-componentsas V = {cn}Nn=1. In our case, each activity-component c is described by avector 〈f(c), `(c), s(c)〉, where f(c) stands for the feature of the activity-component as detailed in Section 2.3, `(c) =

(x(c), y(c), t(c)

)corresponds

to the 3-dimensional centroid location of c, which is computed from all thekey-points in the component. Let s(c) =

(sxy(c), st(c)

)be the spatial scale

and temporal scale of c, where sxy(c) and st(c) are the spatial and temporalaverage radius of the component, respectively.

3.1.1. Spatio-temporal neighborhood

Given a component c with centroid location `(c) =(x(c), y(c), t(c)

),

note `xy(c) :=(x(c), y(c)

), `t := t(c) and define the spatial and temporal

distance between two components c and c′ as

dxy(c, c′) := ‖`xy(c) − `xy(c

′)‖2,dt(c, c

′) := ‖`t(c) − `t(c′)‖2,

where ‖ · ‖2 is the L2-norm. Thus, the spatio-temporal neighborhood of cis defined in a way proportional to the spatial and temporal scale of thecomponent

N (c) :={c′ : c′ ∈ V , dxy(c, c′) < αxy · sxy(c), dt(c, c′) < αt · st(c)

}, (5)

where αxy and αt are the neighborhood factors. Obviously, N (c) is a circularcylinder with a radius αxy · sxy and a half length αt · st(c).

3.1.2. Spatio-temporal relationships

We describe the spatial and temporal relationships between componentsrespectively and quantize them into discrete states. (1) In order to considerthe spatial relationships between two components c and c′, we first com-pute the spatial distance dxy(c, c

′) and then quantize the relationship into 5different states as demonstrated in Figure 4(a). More precisely, if dxy(c, c

′)is smaller than a given value λ, then relationship is quantized as state 0;otherwise the angle of the spatial locations of two components (i.e. the di-rection from `xy(c) to `xy(c

′)) is calculated and is quantized into the otherfour states. (2) For temporal relationships, similar to [10], we define 5 typesof relationships based on the temporal locations `t(c), `t(c

′) and the temporal

13

(a) Spatial relationships (b) Temporal relationships

Figure 4: Illustration of spatio-temporal relationships. In (a), the spatial relationships arequantized into 5 states. In (b), the locations of lines stand for the temporal locations ofcomponents, and the lengths of lines stand for the temporal ranges components cover.

scales st(c), st(c′) of the components c and c′ respectively: before, meeting,

overlapping, equaling, inclusion. See Figure 4(b) for a graphical illustration.Thus, there are 5 × 5 = 25 spatio-temporal relationships in total.

These spatio-temporal relationships reflect the spatial structure arrange-ment and the temporal dynamic information of human activities, which canbe used as an important cue to help understanding human motions and rec-ognizing activities.

3.1.3. Context with spatio-temporal relationships

According to the spatio-temporal relationships mentioned above, we de-fine the context of an activity-component c as

Nr(c) :={c′ : c′ ∈ N (c), c′ f c = r

}, r ∈ {0, 1, . . . , 24}, (6)

where r is the label of spatio-temporal relationship, c′ f c stands for thequantized spatio-temporal relationship, and N (c) is the neighborhood of thecomponent c.

3.2. Design of spatio-temporal context-dependent kernels

In order to model the spatio-temporal context of activity-components,we rely on the context-dependent kernel, which was first proposed by Sahbi

14

et al. [33] for capturing geometric structure information of objects. In thissection, we extend the context-dependent kernel into spatio-temporal domainfor exploiting the structural and dynamic property of activities.

Let C be the union of possible components over the activities {V1, ...,VN}.We consider K : C ×C → R+ as a kernel which provides a similarity measurebetween any two activities Vp = {cpi }mi=1 and Vq = {cqj}nj=1 as

K(Vp, Vq) =m∑i=1

n∑j=1

k(cpi , cqj),

where k(cpi , cqj) is a kernel measuring the similarity between two components

cpi and cqj . Let {Pr(cpi , c

qj) = gr(c

pi , c

qj)}cpi ,cqj be the elements of intrinsic adja-

cency matrices with a spatio-temporal relationship label r, where gr(cpi , c

qj)

is a decreasing function of any (pseudo) distance involving components cpiand cqj . Let D(cpi , c

qj) = df (cpi , c

qj), where df (cpi , c

qj) is a dissimilarity metric

between components cpi and cqj in feature space {f}. According to [33], thekernel K on C × C could be defined by minimizing

minK≥0,‖K‖1=1

Tr(KDT

)+ βTr

(K logKT

)− α

24∑r=0

Tr(KPrK

TP Tr

), (7)

where DT is the transposed matrix of D, α, β ≥ 0 are parameters and theoperations “log”(natural) and “≥” are applied individually to every entry ofthe matrix, ‖ · ‖1 is the “entry wise” L1-norm and Tr denotes matrix trace.

The first term in the above constrained minimization problem is a fi-delity term which measures the quality of how the features of componentsmatch. The second term is a regularization term which keeps the probabilitydistribution {k(cpi , c

qj)} flat without any priori knowledge about the aligned

components. The third term is a term capturing the spatio-temporal struc-tures, where a high value of {k(cpi , c

qj)} should imply high kernel values in

the neighborhoods Nr(cpi ) and Nr(c

qj).

In practice, we use Euclidean distance to compute the dissimilarity offeatures

df (cpi , cqj) = ||f(cpi ) − f(cqj)||2,

and take the distance function

gr(cpi , c

qj) =

∑cpk∈Nr(c

pi ), cql ∈Nr(c

qj )

gr(cpi , c

pk) · gr(cql , c

qj),

15

where gr(cpi , c

pk) = exp{− 1

σdd`(c

pi , c

pk)} with d`(c

pi , c

pk) = ‖~ω ·

(`(cpi ) − `(cpk)

)‖2

as a weighted spatial and temporal distance between two neighbors, ~ω =(ωx, ωy, ωt).

Kernel solution. The optimization problem in Equation (7) admits a solutionK, which is the limit of the context-dependent kernels

K(η) =G(K(η−1))∑i,j G(K(η−1))

,

with

G(K(η)) = exp{− D

β+α

β

24∑r=0

(PrK

(η−1)P Tr + P T

r K(η−1)Pr

)},

K(0) =exp(−D

β)∑

i,j exp(−Dβ

).

Refer to [33] for detailed proof of this solution and its convergence to apositive definite fixed point. In practice, the kernel usually stops after 3iterations, i.e. η = 2.

So far, two activities Vp and Vq can be compared within spatio-temporalcontext by designed spatio-temporal kernel K(Vp, Vq). It is worth noticingthat this kernel can be plugged into classifiers, such as SVMs, which benefitsboth the context-dependent kernel and the well established generalizationpower.

4. Experimental analysis

This section experimentally evaluates the proposed method in two as-pects: we first compare the activity-components with the local space-timefeatures [3, 4] and trajectory-based features [19]; we then demonstrate thattaking into account the spatio-temporal context of these activity-componentsthrough STCK can largely improve the performance of activity recognition.

4.1. Experimental Setups

All the comparisons are implemented on two recent activities datasets:the UT-Interaction dataset [28] containing complex interactions and theRochester Activities Dataset [19] containing complicated daily activities,Both of which are reported to be challenging on activity recognition.

16

(a) Snapshots of video sequences in the UT-Interaction dataset [28], containing 6 classesof activities and 20 videos in each class.

(b) Snapshots of video sequences in the Rochester Activities Dataset [19], containing 10classes of activities and 15 videos in each class.

Figure 5: Snapshot examples of video sequences in two activity datasets: the UT-Interaction dataset [28] and the Rochester Activities Dataset [19].

17

UT-Interaction dataset. It contains 6 classes of human-human interactions:shake-hands, point, hug, push, kick and punch. Each class contains 20 videosequences which are divided into two different sets: (1) the first 10 videosequences, Set 1, are taken on a parking lot with slightly different zoom rate,and the backgrounds of the video are mostly static with little camera jitter;(2) the other 10 video sequences, Set 2, are taken on a lawn in a windy day.The backgrounds of the video are moving slightly, e.g. tree moves, and thevideos contain more camera jitters. Observe that the background, scale, andillumination of the videos in each set are different. Figure 5(a) shows onesnapshot for each class.

Rochester Activities Dataset. This dataset contains 10 classes of daily livingactivities: answering a phone, chopping a banana, dialing a phone, drinkingwater, eating a banana, eating snack chips, looking up a phone number in atelephone book, peeling a banana, eating food with silverware and writing ona white board. Each of these activities contain 15 different sample videos,which were performed three times by five different people of different shapes,sizes, genders, and ethnicities. It has been reported that only using mo-tion information is not sufficient for distinguishing these activities and someother information, such as appearance descriptions, should be taken intoaccount [19]. Figure 5(b) shows one snapshot for each class on the dataset.

In order to evaluate and compare the performances, we ues the sameexperimental setting in [19] and [28]. Specifically, for Rochester ActivitiesDataset, 12 video sequences performed by four out of the five subjects areused as training set, and the left 3 video sequences are used for testing. Theexperiments are repeated five times; For UT-Interaction dataset, a 10-foldleave-one-out cross validation (each times, 9 samples are used for trainingand the left one is used for testing) is implemented on each set. Differentmethods are compared by using the average recognition rates.

4.2. Comparative Evaluations

To evaluate the effect of the proposed mid-level activity-components, wemodel and classify activities with 3 representations:

- Bag-of-Component + SVM : we first generate a codebook of activity-components using k-means, then represent activity with bag-of-featuresmodel, finally train the model of each class by SVMs with radial basisfunction kernel. As in [28], codebooks are generated 10 times using

18

k-means algorithm, and the performances are averaged for the 10 code-books.

- Component + STCK + SVM : The model of each class is trained bySVMs with the STCK. When computing the STCK, the following pa-rameters are used: αxy = 4, αt = 6, α = 20 and β = 0.5.

- Component + context-free kernel + SVM : The modeling and classifi-cation is the same as in Component + STCK + SVM. However, in thiscase, α = 0 is used, which implies that no context is taken into accountwhen computing the kernels.

The above parameters are chosen in order to achieve the best performanceon the databases.

We first investigate the contribution of the appearance, shape and motionfeatures of activity-components. Bag-of-Component + SVM method is usedto evaluate the classification accuracy. See Figure 6 for the classification ac-curacies of these features and their combination on UT-Interaction datasetset 1 and set 2. It is easy to find that each features has positive contributionon the classification accuracies. The average classification accuracies of ap-pearance, shape and motion features are 22%, 27.8% and 67.1% ( on Set 1and Set 2) respectively. Compared with appearance and shape features, thecontribution of motion feature is extremely significant. This confirms thatmotion feature is the most critical feature for recognizing activities [23]. Thecombined feature component(ASM) achieves the average classification accu-racies of 73.3%, which improves our original motion feature component(M)in [23] by 6.2%.

We than compare the proposed mid-level activity-components and thelocal spatio-temporal features, more precisely, STIPs feature [3] and Cuboidfeature [4] and a state-of-the-art method variations of Hough-Voting [34] .The baseline results of Bag-of-STIPs + SVM and Bag-of-Cuboid + SVMare taken from the ICPR 2010 Contest on Semantic Description of HumanActivities [35], which were implemented by the contest organizers. Figure 7shows the comparison in UT-Interaction dataset. We can see that Compo-nent(ASM) outperforms STIPs and Cuboid, both on Set 1 and Set 2. Specif-ically, on Set 1, the average classification accuracies of STIPs and Cuboidare 64.2% and 75.5% respectively, but for the proposed mid-level activity-components with appearance, shape and motion features Component(ASM),we get the average accuracy of 78.3%, which largely improves the results of

19

0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

clas

sific

atio

n ac

cura

cies

Component(A)(appearance)Component(S)(shape) Component(M)(motion)

Component(ASM)(appearance + shape + motion)

(a) UT-Interaction dataset set 1

0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

clas

sific

atio

n ac

cura

cies

Component(A)(appearance)Component(S)(shape) Component(M)(motion)

Component(ASM)(appearance + shape + motion)

(b) UT-Interaction dataset set 2

Figure 6: Contribution of apearance, shape and motion features on UT-Interaction dataset.

0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

clas

sific

atio

n ac

cura

cies

Bag-of-STIPs + SVMBag-of-Cuboid + SVM

Bag-of-Component(ASM) + SVMComponent + STCK + SVMVariations of Hough-Voting


0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

clas

sific

atio

n ac

cura

cies

Bag-of-STIPs + SVMBag-of-Cuboid + SVM

Bag-of-Component(ASM) + SVMComponent + STCK + SVMVariations of Hough-Voting


Figure 7: Comparison of classification accuracies of different features on UT-Interactiondataset.

20

the STIPs by 14.1% and those of Cuboid features by 2.8%. Similar improve-ments are achieved in Set 2, where Component(ASM) outperforms STIPs(59.7%) by 8.5% and Cuboid (62.7%) by 5.5% respectively.

As shown in Figure 7, the proposed mid-level activity-components aremore discriminative than the representative local STIPs feature and Cuboidfeature. Activity-components contain spatio-temporal information in a largerarea, where the appearance, shape and motion information contribute to-gether to the discriminating power of them. It’s also worth noticing that,compared with dense cuboid feature and sparse STIPs feature, the proposedmid-level activity-components are much sparser, for instance, there are about30-60 components in sequences in hand-shaking class. As we will see, thisspaseness shows huge advantage when exploiting the spatio-temporal rela-tionships and context information for activity recognition.

When using STCK with the activity-components, Component + STCK+ SVM achieves the best average classification accuracies 82.5% (averagedon Set 1 and Set 2). Even though the variations of Hough-Voting method [34]has the same average classification accuracies, our method is more stable. Forthe later method, the classification accuracies differ in a large scale, i.e., theclassification accuracy of activity Push in Set 2 is as low as 40%. However,for our method, the classification accuracies in all classes are above 70%.

The second comparison is between the proposed activity-components andthe trajectory-based features [19]. Figure 8 shows the classification accuraciesof different methods on Rochester Activities Dataset. Overall, our mid-levelactivity-components with STCK outperform the average classification accu-racy of the velocity histories feature (63%) by 28%, and also outperformthe augmented velocity histories features (89%) by 2%. However, observethat the augmented velocity histories features improve the velocity historyof tracked key-points with a set of rich information, such as absolute andrelative positions of human face, appearances and color. Furthermore, ourresults are as good as or better than the augmented velocity histories featureson 8 activities classes, except Dial Phone and Use Silverware classes, whichsomewhat demonstrates the stability of the proposed method.

Finally, we show the importance of the STCK experimentally. Specifi-cally, the STCK is compared with 2 context-free models the bag-of-componentsmodel and the context-free kernel, and the bag-of-co-components method [23].As mentioned before, the context-free kernel is the special case of STCK,that is α = 0. The bag-of-co-components method proposed in our previouswork [23] used co-components in the joint feature space as the basic feature

21

0

0.2

0.4

0.6

0.8

1

1.2

answerPhone

chopBanana

dialPhone

drinkWater

eatBanana

eatSnack

lookupInPB

peelBanana

useSilverware

writeBoard

Average

clas

sific

atio

n ac

cura

cies

Velocity HistoriesAugmented Velocity Histories

Component + STCK + SVM

Figure 8: Comparison of classification accuracies of different features on Rochester Activ-ities Dataset.

unit, where pairwise components are considered as their mutural context.Figure 9 and Figure 10 show the classification accuracies of activities in UT-Interaction dataset and Rochester Activities Dataset, respectively. We cansee that the performances of STCK are consistently better than those of theabove three mothods on the two datasets.

Among with the above 3 context-free models and co-component method,STCK shows the maximum improvement than the context-free kernel, i.e.,the average classification accuracy of STCK outperforms that of the context-free kernel (66.7%) by 18.3% on UT-Interaction dataset set 1. STCK alsooutperforms the bag-of-components model, with the improvement of aver-age classification accuracies 8.75% and 8% on UT-Interaction dataset andRochester Activities Dataset respectively. It is also interesting to find thatSTCK shows better performance than the bag-of-co-components method [23].Since all the context features in the neighborhood of features other than pair-wise context information are taken into account together in STCK.

22

0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

Equ

al e

rror

rat

e (E

ER

)

Component(ASM) + Context-free + SVMBag-of-Component(ASM) + SVM

Bag-of-Co-Component(ASM) + SVMComponent(ASM) + STCK + SVM


0

0.2

0.4

0.6

0.8

1

1.2

Shake

HugKick

PointPunch

PushAverage

Equ

al e

rror

rat

e (E

ER

)




Figure 9: Comparison of classification accuracies of different context models on UT-Interaction dataset.

0

0.2

0.4

0.6

0.8

1

1.2

answerPhone

chopBanana

dialPhone

drinkWater

eatBanana

eatSnack

lookupInPB

peelBanana

useSilverware

writeBoard

Average

Equ

al e

rror

rat

e (E

ER

)



Figure 10: Comparison of classification accuracies of different context models on RochesterActivities Dataset.

23

5. Conclusion

In this paper, we have first presented a novel mid-level activity-componentsby clustering motion-consistent trajectories to represent human activities inimage sequences. Compared to local space-time features, the proposed mid-level activity-components are more discriminative, as they contain spatio-temporal information in a larger area, where richer information ( such asappearance, shape and motion information) has been taken into account. Incontrast with trajectory-based methods, the mid-level activity-componentsnot only represent the motion information of activities but also describethe appearance and shape of the components. Furthermore, most of theseextracted mid-level activity-components has physical meanings. Moreover,we have introduced a spatio-temporal context kernel to model the relation-ships between these mid-level activity-components, which not only takes intoaccount the local properties of each activity-component, but also consid-ers the interdependencies of the activity-components. Experimental resultson two challenging databases of activities both confirmed the advantages ofthe proposed mid-level components over start-of-the-art local features, anddemonstrated that the spatio-temporal context kernel can well exploit therelationships of mid-level activity-components and improves the recognitionperformance.

However, it is worth noticing that not all mid-level activity-componentscorrespond to perfect physical parts because of the clustering step. For in-stance, a moving arm might be divided into two components instead of onedue to the difference color and velocity between the skin and clothes. Thusmore robust clustering algorithm should be investigated for the future workand further investigation should also be done by taking into account themoving of the background.

References

[1] A. Gilbert, J. Illingworth, R. Bowden, Fast realistic multi-action recog-nition using mined dense spatio-temporal features, in: Proc. Int. Con-ference on Computer Vision (ICCV), 2009.

[2] P. Siva, T. Xiang, Action detection in crowd, in: Proc. British MachineVision Conference (BMVC), 2010.

[3] I. Laptev, , T. Lindeberg, On space-time interest points, in: Proc. Int.Conference on Computer Vision (ICCV), 2003.

24

[4] P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition viasparse spatio-temporal features, in: Proc. Int. Conference on VS-PETS,2005.

[5] G. Willems, T. Tuytelaars, L. V. Gool, An efficient dense and scale-invariant spatio-temporal interest point detector, in: Proc. EuropeanConference on Computer Vision (ECCV), 2008.

[6] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: A lo-cal svm approach, in: Proc. Int. Conference on Computer Vision andPattern Recognition (CVPR), 2004.

[7] I. Laptev, M. Marsza lek, C. Schmid, B. Rozenfeld, Learning realistichuman actions from movies, in: Proc. Int. Conference on ComputerVision and Pattern Recognition (CVPR), 2008.

[8] M. Bregonzio, S. Gong, T. Xiang, Recognising action as clouds of space-time interest points, in: Proc. Int. Conference on Computer Vision andPattern Recognition (CVPR), 2009.

[9] A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human action recognition, in: Proc. Int.Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[10] M. S. Ryoo, J. K. Aggarwal, Spatio-temporal relationship match: Videostructure comparison for recognition of complex human activities, in:Proc. Int. Conference on Computer Vision (ICCV), 2009.

[11] A. F. Bobick, J. W. Davis, I. C. Society, I. C. Society, The recognition ofhuman movement using temporal templates, IEEE Trans. Pattern Anal.Mach. Intell. 23 (3) (2001) 257–267.

[12] L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions asspace-time shapes, IEEE Trans. Pattern Anal. Mach. Intell. 29 (12)(2007) 2247–2253.

[13] B. Yao, S. C. Zhu, Learning deformable action templates from clutteredvideos, in: Proc. Int. Conference on Computer Vision (ICCV), 2009.

[14] Y. Ke, R. Sukthankar, M. Hebert, Event detection in crowded videos,in: Proc. Int. Conference on Computer Vision (ICCV), 2007.

25

[15] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient matching of pictorialstructures, in: Proc. Int. Conference on Computer Vision and PatternRecognition (CVPR), 2000.

[16] Y. Wang, G. Mori, Learning a discriminative hidden part model for hu-man action recognition, in: Advances in Neural Information ProcessingSystems (NIPS), 2008.

[17] J. Niebles, L. Fei-Fei, A hierarchical model of shape and appearancefor human action classification, in: Proc. Int. Conference on ComputerVision and Pattern Recognition (CVPR), 2007.

[18] J. Sun, X. Wu, S. Yan, L. F. Cheong, T.-S. Chua, J. Li, Hierarchicalspatio-temporal context modeling for action recognition., in: Proc. Int.Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[19] R. Messing, C. Pal, H. Kautz, Activity recognition using the velocityhistories of tracked keypoints, in: Proc. Int. Conference on ComputerVision (ICCV), 2009.

[20] P. Turaga, R. Chellappa, Locally time-invariant models of human activ-ities using trajectories on the grassmannian, in: Proc. Int. Conferenceon Computer Vision and Pattern Recognition (CVPR), 2009.

[21] H. Wang, A. Klaser, C. Schmid, L. Cheng Lin, Action Recognition byDense Trajectories, in: Proc. Int. Conference on Computer Vision andPattern Recognition (CVPR), 2011.

[22] B. Yao, L. Fei-Fei, Modeling mutual context of object and human pose inhuman-object interaction activities, in: Proc. Int. Conference on Com-puter Vision and Pattern Recognition (CVPR), 2010.

[23] F. Yuan, J. Yuan, V. Prinet, Middle-level representation for humanactivities recognition: the role of spatio-temporal relationships, in:ECCV’10 Workshop on Human Motion: Understanding, Modeling, Cap-ture and Animation, 2010.

[24] K. Mikolajczyk, C. Schmid, A performance evaluation of local descrip-tors, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1615–1630.

26

[25] S. Rao, R. Tron, R. Vidal, Y. Ma, Motion segmentation in the presenceof outlying, incomplete, or corrupted trajectories, IEEE Trans. PatternAnal. Mach. Intell. 99 (10) (2009) 1832 –1845.

[26] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image seg-mentation, Int. J. Comput. Vis. 59 (2) (2004) 167–181.

[27] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.Pattern Anal. Mach. Intell. 22 (8) (2000) 888 –905.

[28] M. S. Ryoo, J. K. Aggarwal, UT-Interaction Dataset, ICPRcontest on Semantic Description of Human Activities (SDHA),http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html(2010).

[29] D. G. Lowe, Distinctive image features from scale-invariant keypoints,Int. J. Comput. Vis. 60 (2) (2004) 91–110.

[30] S. Belongie, G. Mori, J. Malik, Matching with shape contexts, in: Proc.IEEE Workshop on Content-based access of Image and Video-Libraries,2000.

[31] P. Perona, J. Malik, Scale-space and edge detection using anisotropicdiffusion, IEEE Trans. Pattern Anal. Mach. Intell. 12 (7) (1990) 629–639.

[32] C. Rao, A. Yilmaz, M. Shah, View-invariant representation and recog-nition of actions, Int. J. Comput. Vis. 50 (2) (2002) 203–226.

[33] H. Sahbi, J.-Y. Audibert, J. Rabarisoa, R. Keriven, Context-dependentkernel design for object matching and recognition, in: Proc. Int. Con-ference on Computer Vision and Pattern Recognition (CVPR), 2008.

[34] D. Waltisberg, A. Yao, J. Gall, L. Van Gool, Variations of a hough-voting action recognition system, in: Proc. Int. Conference on PatternRecognition (ICPR) contest on Semantic Description of Human Activi-ties (SDHA), 2010.

[35] M. S. Ryoo, C.-C. Chen, J. K. Aggarwal, A. Roy-Chowdhury, Anoverview of contest on semantic description of human activities (sdha)2010, http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html(2010).

27

Documents

Spatio-Temporal Context of Mid-level Features for Activity ...nlpr-web.ia.ac.cn/2011papers/gjhy/gh35.pdf · Spatio-Temporal Context of Mid-level Features for Activity Recognition