5
DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 1 Modeling Human Behaviour: from Activities to Scene Understanding Pedro Canotilho Ribeiro Abstract — This paper concerns human behaviour under- standing. The aim is to jointly use information from the environment spatial affordances and the observed activities performed by people. Starting from a recognizable set of four basic activities, {Active, Inactive, Walking, Running}, spa- tial regions of interest are identified. These regions are de- fined by a mixture of Gaussians, estimated using the Ex- pectation Maximization algorithm. The defined regions end up containing specific transitions between the basic activi- ties and are grouped in three different classes: {Entry/Exit, Target, Neutral}. The environment can then be modeled as a Finite-State Automata where each state (region) offers dif- ferent affordances. This spatial affordances are further mod- eled as Stochastic Timed Automatas. A hierarchical model with two levels is thus obtained and two possible applica- tions are sugested, in order to improve the activity classifi- cation and generate scene descriptions. Keywords — Discrete Event Systems, Stochastic Systems, Spatial affordances, Human behaviour modeling, Scene de- scription. I. Introduction In recent years there has been a growing interest in the analysis of human activities, with emphasis in the move- ment of body parts [1], human actions [6], [4] and even in- teractions between people [2]. This is due to the improving accuracy of algorithms that perform people’s detection and tracking [6]. There are some works that model the spatial organization of the environment - some use a grid to divide the scene and then use the person’s path to identify activ- ities, and others use the tracking results to automatically define zones of interest. With newer tracking systems one can accurately com- pute features that can give some abstraction from this tracker pixel measurements and work trough the higher level behaviour description. In our previous work [3] we focus on human activity recognition that works as a bridge between the tracker lower level and the work presented here. There, we have demonstrated that it is possible to ac- curately classify, based only in the observation of people’s actions, a set of five activities: {Active, Inactive, Walking, Running, Fighting}. Considering that {Fighting} can be included in {Active}, the present work starts from a set of four basic activities and uses it to define regions of interest, since it is con- sidered that the environment has an essential role in peo- ple’s behaviour. This role comes as affordances offered by the environment. An affordance is viewed as a resource that the environment offers a person, that must also have the ability to use it. An affordance thus exists, whether it is perceived and used or not. Using an affordance im- plies a second reciprocal relationship between perception and action. Perception provides the information for action, and action generates consequences that inform perception. This information may be propriceptive, letting the animal know how its body is performing; but information is also exteroceptive, reflecting the way the animal changes the environmental context with respect to the affordances. From this evidence it is clear that the place where peo- ple evolve has a restrictive influence on the activities per- formed. In this work it is jointly used information from the environment spatial affordances and the observed activities performed by people. Using a hierarchical approach that automatically identifies three classes of regions of interest in the scene and then modeling each one to work as a spa- tial affordance it is then possible, through the observation of people’s actions, to describe human behaviour. The paper is organized in different sections, where Sec- tion II describes the information needed to identify the regions of interest and the learning algorithm that geomet- rically defines them; Section III introduces the Discrete Event Systems used to model the scene as a hierarchical approach; Section IV explores some possible applications for this model and Section V presents the final remarks. II. Defining regions of interest In order to use spatial information regarding human behaviour recognition it is essential to define the regions in which particular activities are performed. Considering that people are mainly walking around and that they must change this activity to perform specific tasks, when in re- gions that allow it, the useful information involved is given by the transitions between activities. This simple remark defines that to automatically identify the spatial regions of interest one needs to have information about the activities performed by the tracked people. In this work it is used the ground truth information avail- able in CAVIAR 1 project, where four basic activities are defined: {Active, Inactive, Walking, Running}. This infor- mation could also have been obtained from our previous work [3], which shows that it is possible to automatically recognize these activities with an error of less than 5%. This recognizer could be used in different scenarios without new training and with the same level of performance, pro- viding the ground plane homographic tranformation. The remotion of the perspective distortion is an essential step for the recognizer to work as it allows it to have an or- tographic view of the ground plane. Figure 1 illustrates the observed (perspectively distorted) image and obtained 1 CAVIAR: Context Aware Vision using Image-based Active Recognition was funded by the EC’s Information Society Tech- nology’s programme project IST 2001 37540. Home page: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 1 …users.isr.ist.utl.pt/~pal/cadeiras/deds0708/deds/Projects04-05/... · DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 1 Modeling

  • Upload
    vanlien

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 1

Modeling Human Behaviour: from Activities toScene Understanding

Pedro Canotilho Ribeiro

Abstract— This paper concerns human behaviour under-standing. The aim is to jointly use information from theenvironment spatial affordances and the observed activitiesperformed by people. Starting from a recognizable set offour basic activities, {Active, Inactive, Walking, Running}, spa-tial regions of interest are identified. These regions are de-fined by a mixture of Gaussians, estimated using the Ex-pectation Maximization algorithm. The defined regions endup containing specific transitions between the basic activi-ties and are grouped in three different classes: {Entry/Exit,Target, Neutral}. The environment can then be modeled as aFinite-State Automata where each state (region) offers dif-ferent affordances. This spatial affordances are further mod-eled as Stochastic Timed Automatas. A hierarchical modelwith two levels is thus obtained and two possible applica-tions are sugested, in order to improve the activity classifi-cation and generate scene descriptions.

Keywords— Discrete Event Systems, Stochastic Systems,Spatial affordances, Human behaviour modeling, Scene de-scription.

I. Introduction

In recent years there has been a growing interest in theanalysis of human activities, with emphasis in the move-ment of body parts [1], human actions [6], [4] and even in-teractions between people [2]. This is due to the improvingaccuracy of algorithms that perform people’s detection andtracking [6]. There are some works that model the spatialorganization of the environment - some use a grid to dividethe scene and then use the person’s path to identify activ-ities, and others use the tracking results to automaticallydefine zones of interest.

With newer tracking systems one can accurately com-pute features that can give some abstraction from thistracker pixel measurements and work trough the higherlevel behaviour description. In our previous work [3] wefocus on human activity recognition that works as a bridgebetween the tracker lower level and the work presentedhere. There, we have demonstrated that it is possible to ac-curately classify, based only in the observation of people’sactions, a set of five activities: {Active, Inactive, Walking,Running, Fighting}.

Considering that {Fighting} can be included in {Active},the present work starts from a set of four basic activitiesand uses it to define regions of interest, since it is con-sidered that the environment has an essential role in peo-ple’s behaviour. This role comes as affordances offered bythe environment. An affordance is viewed as a resourcethat the environment offers a person, that must also havethe ability to use it. An affordance thus exists, whetherit is perceived and used or not. Using an affordance im-plies a second reciprocal relationship between perceptionand action. Perception provides the information for action,

and action generates consequences that inform perception.This information may be propriceptive, letting the animalknow how its body is performing; but information is alsoexteroceptive, reflecting the way the animal changes theenvironmental context with respect to the affordances.

From this evidence it is clear that the place where peo-ple evolve has a restrictive influence on the activities per-formed. In this work it is jointly used information from theenvironment spatial affordances and the observed activitiesperformed by people. Using a hierarchical approach thatautomatically identifies three classes of regions of interestin the scene and then modeling each one to work as a spa-tial affordance it is then possible, through the observationof people’s actions, to describe human behaviour.

The paper is organized in different sections, where Sec-tion II describes the information needed to identify theregions of interest and the learning algorithm that geomet-rically defines them; Section III introduces the DiscreteEvent Systems used to model the scene as a hierarchicalapproach; Section IV explores some possible applicationsfor this model and Section V presents the final remarks.

II. Defining regions of interest

In order to use spatial information regarding humanbehaviour recognition it is essential to define the regionsin which particular activities are performed. Consideringthat people are mainly walking around and that they mustchange this activity to perform specific tasks, when in re-gions that allow it, the useful information involved is givenby the transitions between activities. This simple remarkdefines that to automatically identify the spatial regions ofinterest one needs to have information about the activitiesperformed by the tracked people.

In this work it is used the ground truth information avail-able in CAVIAR1 project, where four basic activities aredefined: {Active, Inactive, Walking, Running}. This infor-mation could also have been obtained from our previouswork [3], which shows that it is possible to automaticallyrecognize these activities with an error of less than 5%.This recognizer could be used in different scenarios withoutnew training and with the same level of performance, pro-viding the ground plane homographic tranformation. Theremotion of the perspective distortion is an essential stepfor the recognizer to work as it allows it to have an or-tographic view of the ground plane. Figure 1 illustratesthe observed (perspectively distorted) image and obtained

1CAVIAR: Context Aware Vision using Image-based ActiveRecognition was funded by the EC’s Information Society Tech-nology’s programme project IST 2001 37540. Home page:http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 2

view with the estimated homography. From now on everyinformation about coordinates in the image is refered tothis orthographic view.

Fig. 1. Original (left) and resulting transformed (right) images ofthe INRIA - Caviar scenario. Perspective distortion is removed bymapping the ground plane using the estimated homography.

The transitions used to identify the regions of interestare based on three classes of activities {Translation, Body,Null}, that contain the previously identified basic activi-ties, as shown in Table I:

Activity classes Activities Description

(T)ranslation {Walking, Running} translation in space(B)ody {Inactive, Active} body movements(N)ull no target no person in the space

TABLE I

Definition of the activitiy classes used to identify the

regions of interest in the scene.

From the activity classes defined in Table I two types oftransitions will be considered, Null↔Translation (N↔T)and Translation↔Body (T↔B). The transitions related tothe appearance or disappearance of a target (i.e. N↔Ttransitions) will be used to identify the scene’s entry/exitregions. Changes in the person’s walking activity can beidentified by the T↔B transitions, meaning that the personstopped walking in order, for instance, to perform a specifictask, such as sitting on the sofa or reading the informationavailable in the information sheets or panel.

This transitions will thus define regions of interest, thatcan be of three types: (i) Entry/Exit regions, (ii) Targetregions, where the person has physical objects to interactwith (like an information panel to read or a reception desk)and (iii) a Neutral region, corresponding to the existing re-maining space. Since the target regions necessarily containphysical objects, generally associated with the presence ofedges in the image, this edge information is used to identifythis regions, that must contain more than 10 edge pixels.

The edge information is computed using the well knownCanny edge detector with the threshold set to 0.1 and thesigma of the gaussian filter to 1 pixel. Figure 2 shows theresulting edges detected (left image) and an example of theTranslation↔Body transitions plotted in red (right image).

Table II resumes the considered regions, the transitionsassociated and the edge information used.

Fig. 2. Resulting edge detection (left) and identified T↔B transitions(right). The edges were detected using the Canny edge detector andon the right the transitions are shown in red against a sample image.

Region types Transitions Edge pixels

Entry/Exit Null↔Translation –Target Translation↔Body > 10

Neutral remaining area –

TABLE II

Definition of the regions of interest considered and the

types of transitions used to identify them.

A. Learning the regions of interest

Having defined the information needed to identify theregions, is it then required a learning method to automat-ically compute them. Another question is the shape of theregions, that obviously depends on the learning method.

The problem here was viewed as a clustering problem.Having a set of two dimensional transition points, it is in-tended to group them into subsets, such that those withineach cluster are more similar to each other than to the onesassigned to different clusters. In this case the points repre-sent spatial coordinates and the clusters must be based ontheir proximity.

The use of the Expectation-Maximization algorithm toestimate a Gaussian mixture model is a possible solution,having the advantage of making probabilistic assignmentsof points to clusters centers and allowing the definition ofelliptic regions based on the Gaussian estimated covariancematrix.

As explained in the previous section, there are two differ-ent types of transitions, N↔T and T↔B, that define twosets of points, Ti = {(x1, y1, w1), ..., (xni

, yni, wni

)}, i =1, 2, that represent the location where the transitions occur,being w the weight of each point. The point weight is com-puted according to the number of transitions that occur inthe exact same location renormalized so that

∑ni

i=1 wi = 1.The N↔T transitions correspond to i = 1 and the T↔B

to i = 2. The EM algorithm is then applied to estimatea Gaussian mixture for each one of the sets, where eachcomponent of the mixture clusters a subset of points. Thelikelihood function is thus approximated by:

fi(x, y) ≈Ni∑j=1

πjN (x, y;µj ,Σj) , i = 1, 2. (1)

where N(x, y;µj ,Σj) denotes a Normal distribution, πj

represents the weight of that Gaussian in the mixture, iis one of the two types of regions and Ni is the number ofGaussians in the mixture.

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 3

The Expectation-Maximization algorithm [5] estimatesthe unknown parameters of the mixture, (µj ,Σj , πj), al-lowing also the number of Gaussians, Ni, to vary by usinga distance designated by kurtosis that measures how closea distribution is to a Gaussian. It is used to split the dis-tribution if it is too different from a Gaussian. Similarly,it can be used as a closeness measurement to merge distri-butions, if this distance becomes too small.

After applying the learning algorithm two likelihoodfunctions are obtained, f1(x, y) and f2(x, y), each defininga Gaussian mixture that does not correspond directly toregions. Defining the regions implies the use of some met-ric distance. The Mahalanobis distance is a natural choicethat, given Gaussian parameters (µj ,Σj), defines ellipticalregions at some desired distance.

Each region is thus obtained by imposing, for all thepixels in the image, that their Mahalanobis distance is lessthan 1:

Rj = {P : (µj − P )Σ−1j (µj − P )T < 1 ∧ P ∈ image} (2)

where P = (x, y) represents the coordinates of a pixel inthe image and Rj results in the set of pixels that belong toregion j.

The weight of each Gaussian in the mixture measuresthe importance of that Gaussian and has a direct relationto the number of transitions that the corresponding regioncontains. In this way, in the Gaussian mixture i, the Gaus-sian j is eliminated if πj < 1

Ni× 0.45.

In Figure 3 the identified regions are shown. The ellipsesfrom number 1 to 8 correspond to Entry/Exit regions (inblue) and from number 9 to 14 to Target regions (in green).Notice that the system has accurately identified all the pos-sible Entry/Exit regions, while the Target regions have amore ambiguous meaning that makes it difficult to evaluatethe results.

Fig. 3. Regions of interest identified in the scenario. The regionsare of three different types: Entry/Exit (blue), Target (green) andNeutral (remaining area).

Nevertheless, most of the identified regions are interest-ing: regions T9 and T12 are browsing areas where there is

some information for people to read, region T11 is the re-ception desk and in region T13 there are lounges. RegionT10 may not be evidently interesting but it corresponds tothe place where the reception clerk usually is and regionT14 has a difficult interpretation but it was detected be-cause some people have stopped there for a considerableperiod of time.

III. Modeling the scene as a Finite-StateAutomata

The previous sections describes how to identify regionsin a given scene, based on observable human activities. Thedetected regions can now be modeled to allow recognizinguseful human behaviours. If the regions are consideredas states, and the transitions between regions consideredevents, it is easy to model our scene as a DeterministicAutomata.

From the analysis of Figure 3 it is clear that the Au-tomata that models the scene should be a Finite-State Au-tomata (FSA) with the interesting characteristic of mark-ing a regular language. The FSA used here is a six-tupleG = (X, E, f,Γ, X0, Xm) defined by the direct graph ofFigure 4 and has some important characteristics: a transi-tion between regions is always to or from region N ; conse-quently, the only state that has more than one active eventis state N , all the others only have one event active thatmakes a transition to N .

Fig. 4. Resulting Finite-State Automata that models the scenario.

The events are represented by the name concatenationof the two states involved; for instance E1N is a transitionfrom E1 to N and NE1 from N to E1.

The following parameters are still important to define:X is the set of states: {E1, ..., E8, T9, ..., T14, N}.E is the set of events: {E1N, ..., E8N,NE1, ..., NE8,T9N, ..., T14N,NT9, ..., NT14}.f : X × E → X is the transition function.Γ : X → 2E is the active event function.X0 is the set of initial states: {E1, ..., E8}Xm ⊆ X is the set of marked states and depends onthe problem’s purpose.

When tracking a person the output is a position Pt =(xt, yt) that belongs to a specific region, according to equa-tion (2). The events are therefore obtained when a persongoes from one region to another. The language generated

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 4

by the Automata describes the person’s path through thedefined regions and specific states can also be marked inorder to recognize a desired marked language.

The human behaviour depends not only on the physicalpresence of the person in a region but also on the actionsperformed in that region. Therefore, it is necessary to re-strict the event detection by imposing an affordable be-haviour, that is, people’s actions have to follow a learnedpattern inherent to each region. For instance, in order toconsider the state T11 (going to the reception desk), a per-son has to remain (active or inactive) in that region for aminimum predefined period of time.

A. Learned regions as affordances

Until now only deterministic systems have been used,which means that the event sequence that triggers the FSAonly specifies the chronological order in which the eventstake place, but it does not provide the time instants associ-ated with them. This approach is suitable when modelingprocesses like the transitions between regions because onecan observe it without doubt. However, here it is necessaryto include the temporal stochastic process that models thenormal time duration of a performed activity.

The idea is, for each state of the Finite-State Au-tomata, to estimate a Stochastic Timed Automata thatmodels a person’s behaviour afforded by each region. ThisAutomata has four possible states, {Active, Inactive,Walking, Running}, with a generic form represented inFigure 5, where the states are represented by the first let-ter of the activity.

Fig. 5. Standard form of the Stochastic Timed Automata.

The event set considered has only one possible event: thetransition between activities. The Stochastic Timed Au-tomata is thus a six-tuple (ε, χ, Γ, p, X0, G), where ε = eis the only event, χ the state space, Γ(x) = e the enabledevent for all states, p(x′;x, e) the state transition proba-bility, X0 = {W,R} the set of initial states and G thestochastic clock structure. This way a Generalized Semi-Markov Process (GSMP) is generated. The stochastic clockstructure is a distribution that characterizes the intereventstochastic clock sequence {Vk} = {V1, V2, ..., Vk}, Vk ∈R+, k = 1, 2, ....

Modeling the event as a Poisson process has severalproperties that allow simplifying GSMP models to obtain

Markov chains and turn out to be accurate to model manypractical interesting problems. The interevent process {Vk}is therefore distributed by the exponential distribution:G(t) = P [Vk ≤ t] = 1− e−λt with a cdf obtained by differ-entiating G(t) to give g(t) = λe−λt.

From the images, and in each region a sample x ={x1, x2, ..., xn} was obtained by measuring the intereventtime. Notice that the interevent time represents the tem-poral interval a person stays performing the same activity(and the event models a change between activities). Anestimate for the rate parameter, λ̂, can be obtained us-ing the maximum likelihood by constructing the likelihoodfunction:

L(λ) =n∏

i=1

λe−λxi = λne−λ∑n

i=1 xi = λne−λnx̄ (3)

where x̄ = 1n

∑ni=1 xi is the sample mean.

The derivative of the likelihood function’s logarithm is:

d

dλlnL(λ) =

d

dλ(n ln(λ)− λnx̄) =

n

λ− nx̄ (4)

and imposing ddλ lnL(λ) = 0 gives an estimate for the rate

parameter λ̂ = 1x̄ .

The obtained rate parameters for each one of theStochastic Automata learned for each region are shown inFigure 6.

Fig. 6. Values of the estimated rate parameters for the 14 regions,each one corresponding to a different Stochastic Timed Automata.

The state transition probabilities, p(x′;x, e), are alsolearned from the data by counting the number of transi-tions between each state and the others and normalizingit so that for each state the

∑x′ p(x′;x, e) = 1. Then the

states with zero probability of transition are eliminated.This way a Stochastic Timed Automata is specified for eachregion, that is for each state of the Finite-State Automata.

The example shown in Figure 7 represents the resultingStochastic Timed Automatas for regions T9, T11 and T12,together with the Neutral region N .

Using this hierarchical approach it is then possible toeliminate the transitions of the Finite-State Automata thatdo not follow the learned Stochastic Automata for thatregion.

DISCRETE EVENT DYNAMIC SYSTEMS - FINAL PROJECT 5

Fig. 7. Resulting Stochastic Timed Automatas for regionsT9, T11 and T12, and the Neutral region N .

IV. Exploring some possible results

Figure 8 shows a person walking trough the scenario.

Fig. 8. Four sample images of a sequence corresponding to a personwalking in the scenario. Image frames 310, 390, 480 and 550.

The language generated from the Finite-State Automatafor this example is: E8N NT13 T13N NE3. Althoughthe person enters region T13 he doesn’t use its affordances(that is, he doesn’t sit on the sofas). Using the StochasticTimed Automata it is possible to eliminate events NT13

and T13N because there is no change to the walking activityof the person. The obtained description of the scene istherefore E8N NE3, that accurately explains the person’sbehaviour: the person enters the space from region E8,passes trough the neutral region and exits at E3.

Using this kind of approach it is then possible to obtaina description of the tracked person’s behaviour in terms ofthe regions they enter and the affordances they use.

A. Improving the activity classification

This approach can also be used to help the activity clas-sification procedure to improve its results. The estima-

tion of the posterior distribution over the state space st

(possible activities) conditioned to the available data d0...t:p(st|d0...t, Rt), where Rt defines the region, can be im-proved by the models previously constructed. The clas-sification can be done using a Markov assumption that as-sumes that the past is independent of the future given thecurrent state:

p(st|d0...t,Mt) = p(st|o1, e1, ..., ot−1, et−1, ot,Mt)Bayes

= αtp(ot|o1, e1, ..., ot−1, et−1, st,Mt)p(st|o1, e1, ..., ot−1, et−1,Mt)Markov= αtp(ot|st)p(st|o1, e1, ..., ot−1, et−1,Mt)Tot.Prob.= αtp(ot|st)

∑st−1

p(st|o1, e1, ..., ot−1, et−1, st−1,Mt)p(st−1|o1, e1, ..., ot−1, et−1,Mt)Markov= αtp(ot|st)

∑st−1

p(st|et−1, st−1,Mt)p(st−1|o1, e1, ..., ot−1, et−1,Mt)= αt p(ot|st)︸ ︷︷ ︸

Perceptual Model

∑st−1

p(st|et−1, st−1,Mt)︸ ︷︷ ︸Transition Model

p(st−1|d0...t−1,Mt−1)The perceptual model probability is obtained by the ac-

tivity recognizer and the transition model by the StochasticTimed Automatas defined for each region.

V. Conclusion

A hierarchical approach that models the observed scenehas been proposed. First, the regions of interest are au-tomatically defined using information from people’s tra-jectories, activities and scene edges; then, using again in-formation from the activities performed, each region ismodeled based on its affordances. Using the Expectation-Maximization algorithm the regions are elliptically defined.This information is then transformed into a Finite-StateAutomata. The model is completed by adding a secondlevel of Stochastic Timed Automatas to each state of theFSA.

Two examples of possible uses for the model are thensuggested: (i) for describing the human behaviour in termsof regions visited and/or used and (ii) for improving a sys-tem that classifies human activities.

References

[1] Aaron F. Bobick, James W. Davis, “The Recognition of Hu-man Movement Using Temporal Templates”, IEEE Transac-tions on Pattern Analysis and Machine Intelligence, Vol. 23,No. 3, March 2001.

[2] Somboon Hongeng, Ram Nevatia, Francois Bremond, “Video-based event recognition: activity representation and probabilis-tic recognition methods”, Computer Vision and Image Under-standing 96, (2004) 129-162 .

[3] Pedro Canotilho Ribeiro, Jose Santos-Victor, “Human Activ-ities Recognition from Video: modeling, feature selection andclassification architecture”, HAREM , (September 2005) .

[4] O. Masoud, N. Papanikolopoulos, “Recognizing human activi-ties”, IEEE Conf. on Advanced Video and Signal Surveillance,2003.

[5] N. Vlassis and A. Likas, “The kurtosis-EM algorithm for Gaus-sian mixture modelling,” IEEE Trans. SMC , 1999.

[6] Tao Zhao, Ram Nevatia, “Tracking Multiple Humans in Com-plex Situations”, IEEE Transactions on Pattern Analysis andMachine Intelligence, Vol. 26, No. 9, September 2004.

[7] “CAVIAR PROJECT”, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.