Research Article A Low-Dimensional Radial Silhouette-Based ...downloads.hindawi.com/archive/2014/547069.pdfnd silhouette-based features which rely either on the whole shape of the

Research ArticleA Low-Dimensional Radial Silhouette-Based Feature forFast Human Action Recognition Fusing Multiple Views

Alexandros Andre Chaaraoui1 and Francisco Floacuterez-Revuelta2

1 Department of Computer Technology University of Alicante PO Box 99 03080 Alicante Spain2 Faculty of Science Engineering and Computing Kingston University Penrhyn Road Kingston uponThames KT1 2EE UK

Correspondence should be addressed to Alexandros Andre Chaaraoui alexandrosuaes

Received 30 April 2014 Accepted 6 July 2014 Published 29 October 2014

Academic Editor Antonios Gasteratos

Copyright copy 2014 A A Chaaraoui and F Florez-Revuelta This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

This paper presents a novel silhouette-based feature for vision-based human action recognition which relies on the contour ofthe silhouette and a radial scheme Its low-dimensionality and ease of extraction result in an outstanding proficiency for real-timescenarios This feature is used in a learning algorithm that by means of model fusion of multiple camera streams builds a bag of keyposes which serves as a dictionary of known poses and allows converting the training sequences into sequences of key posesTheseare used in order to perform action recognition by means of a sequence matching algorithm Experimentation on three differentdatasets returns high and stable recognition rates To the best of our knowledge this paper presents the highest results so far onthe MuHAVi-MAS dataset Real-time suitability is given since the method easily performs above video frequency Therefore therelated requirements that applications as ambient-assisted living services impose are successfully fulfilled

1 Introduction

Human action recognition has been in great demand in thefield of pattern recognition given its direct relation to videosurveillance human-computer interaction and ambient-assisted living (AAL) among other application scenariosEspecially in the latter human behavior analysis (HBA)in which human action recognition plays a fundamentalrole can endow smart home services with the requiredldquosmartnessrdquo needed to perform fall detection intelligentsafety services (closing a door or an open tap) or activities ofdaily living (ADL) recognition Upon these detection stagesAAL services may learn subjectsrsquo routines diets and evenpersonal hygiene habits which allow providing useful andproactive services For this reason human action recognitiontechniques are essential in order to develop AAL services thatsupport safety at home health assistance and aging in place

Great advances have been made in vision-based motioncapture and analysis [1] and at the same time the society isstarting to demand sophisticated and accurate HBA systemsThis is shown for example in the recent rise of interestin devices like the Microsoft Kinect sensor Current efforts

focus on achieving admissible recognition speeds [2] whichis essential for real-time and online systems Another goalis the successful handling of multiview scenarios so as toadd robustness to occlusions and improve the quality of therecognition [3] One of the main drawbacks of multiviewtechniques is that rich and detailed 3D scene reconstructionsare normally incompatible with real-time recognition On theother hand simpler 2D-based methods fail to achieve thesame recognition robustness [4]

The current proposal builds upon earlier work where wehave presented a human action recognition method basedon silhouettes and sequences of key poses which showsto be suitable for real-time scenarios and specially robustto actor variances [5] In [6] a study is included compar-ing different approaches of fusing multiple views using anapproach based on a bag of key poses which is extendedin the present contribution One method stage in whichsubstantial computational cost is added and success in laterstages depends upon is feature extraction In this paper thisspecific stage is especially targeted A low-dimensional radialsilhouette-based feature is combined with a simple learningapproach based on multiple video streams Working with

Hindawi Publishing CorporationInternational Scholarly Research NoticesVolume 2014 Article ID 547069 11 pageshttpdxdoiorg1011552014547069

2 International Scholarly Research Notices

silhouette contour points radial bins are computed usingthe centroid as the origin and a summary representation isobtained for each bin This pose representation is used inorder to obtain the per-view key poses which are involvedin each action performance Therefore a model fusion ofmultiple visual sensors is applied From the obtained bag ofkey poses the sequences of key poses of each action classare computed which are used later on for sequence matchingand recognition Experimentation performed on two publiclyavailable datasets (Weizmann [7] and MuHAVi [8]) and aself-recorded one shows that the proposed technique not onlyobtains very high and stable recognition rates but also provesto be suitable for real-time applications Note that by ldquorealtimerdquo we mean that recognition can be performed at videofrequency or above as is common in the field

The remainder of this paper is organized as followsSection 2 summarizes the most recent and relevant works inhuman action recognition focusing on the type of featuresused and howmultiview scenarios are managed In Section 3our proposal is detailed offering a low-dimensional featurebased on silhouette contours and a radial scheme Sections4 and 5 specify the applied multiview learning approachbased on a bag of key poses and action recognition throughsequence matching Section 6 analyzes the obtained resultsand compares them with the state of the art in terms ofrecognition rate and speed providing also an analysis ofthe behaviour of the proposed method with respect to itsparameters Finally we present conclusions and discussion inSection 7

2 Related Work21 Feature Extraction Regarding the feature extractionstage of human action recognition methods based on visionthese can be differentiated by the either static or dynamicnature of the feature Whereas static features consider onlythe current frame (extracting diverse types of characteristicsbased on shape gradients key points etc) dynamic featuresconsider a sequence of several frames and apply techniqueslike image differencing optical flow and spatial-temporalinterest points (STIP)

Among the former we find silhouette-based featureswhich rely either on the whole shape of the silhouette oronly on the contour points In [19] action primitives areextracted reducing the dimensionality of the binary imageswith principal component analysis (PCA) Polar coordinatesare considered in [20] where three radial histograms aredefined for the upper part the lower part and the wholehuman body Each polar coordinate system has several binswith different radii and angles and the concatenated nor-malized histograms are used to describe the human postureSimilarly in [21] a log-polar histogram is computed choosingthe different radii of the bins based on logarithmic scaleSilhouette contours are employed in [10] with the purpose ofcreating a distance signal based on the pointwise Euclideandistances between each contour point and the centroid of thesilhouette Conversely in [22] the pairwise distances betweencontour points are computed to build a histogramof distancesresulting in a rotation scale and translation invariant feature

In [9] the whole silhouette is used for gait recognition Anangular transform based on the average distance betweenthe silhouette points and the centroid is obtained for eachcircular sectorThis shows robustness to segmentation errorsSimilarly in [23] the shape of the silhouette contour isprojected on a line based on the R transform which is thenmade invariant to translation Silhouettes can also be used toobtain stick figures for instance by means of skeletonizationChen et al [24] applied star skeletonization to obtain a five-dimensional vector in star fashion considering the headthe arms and the legs as local maxima Pointwise distancesbetween contour points and the centroid of the silhouetteare used to find the five local maxima In the work ofIkizler and Duygulu [11] a different approach based on aldquobag-of-rectanglesrdquo is presented In their proposal orientedrectangular patches are extracted over the human silhouetteand the human pose is represented with a histogram ofcircular bins of 15∘ each

A very popular dynamic feature in pattern recognitionbased on computer vision is optical flow Fathi and Mori[13] rely on low-level features based on optical flow In theirwork weighted combinations of midlevel motion featuresare built covering small spatiotemporal cuboids from whichthe low-level features are chosen In [25] motion over asequence of frames is considered defining motion historyand energy images These encode the temporal evolutionand the location of the motion respectively over a numberof frames This work has been extended by [26] so as toobtain a free-viewpoint representation from multiple viewsA similar objective is pursued in [7] where time is consideredas the third dimension building space-time volumes basedon sequences of binary silhouettes Action recognition isperformed with global space-time features composed of theweighted moments of local space-time saliency and orienta-tion Cherla et al [27] combine eigenprojections of the widthprofile of the actor with the centroid of the silhouette and thestandard deviation in the 119883 and 119884 axes in a single featurevector Robustness to occlusions and viewpoint changes istargeted in [28] A 3D histogram of oriented gradients(3DHOG) is computed for densely distributed regions andcombined with temporal embedding to represent an entirevideo sequence Tran and Sorokin [12] merge both silhouetteshape and optical flow in a 286-dimensional feature whichalso includes the context of 15 surrounding frames reducedby means of PCA This feature has been used successfully inother works as for instance recently in [29] Rahman et al[30] take an interesting approach proposing a novel featureextraction techniquewhich relies on the surrounding regionsof the subjects These negative spaces present advantagesrelated to robustness to boundary variations caused by partialocclusions shadows and nonrigid deformations

RGB-D data that is RGB color information along pixel-wise depthmeasurement is increasingly being used since theMicrosoft Kinect device has been released Using the depthdata and relying on an intermediate body part recognitionprocess a markerless body pose estimation in form of3D skeletal information can be obtained in real time [2]This kind of data results proficient for gesture and actionrecognition required by applications such as gaming and

International Scholarly Research Notices 3

natural user interfaces (NUI) [31] In [31 32] more detailedsurveys about these recently appeared depth-based methodscan be found

Naturally the usage of static features does not mean thatthe temporal aspect cannot be considered Temporal cuesare commonly reflected in the change between successiveelements of a sequence of features or in the learning algorithmitself For further details about the state of the art we refer to[1 33]

22 Multiview Recognition Another relevant area for thiswork is how human action recognition is handled whendealing with multiple camera views Multiview recognitionmethods can be classified for example by the level at whichthe fusion of information happens Initially when dealingwith 2D data from multiple sources these can be usedin order to create a 3D representation [34 35] This datafusion allows applying a single feature extraction processwhich minimizes information loss Nevertheless 3D rep-resentations usually imply a higher computational cost asappropriate 3D features need to be obtained Feature fusionplaces the fusion process one step further by obtaining single-view features for each of the camera views and generatinga common representation for all the features afterwardsThe fusion process depends on the type of data Featurevectors are commonly combined by aggregation functions orconcatenation of vectors [36 37] or also more sophisticatedtechniques as canonical correlation analysis [29] The appealof this type of fusion is the resulting simplicity of transitionfrom single- to multiview recognition methods since multi-viewdata is only handled implicitly A learningmethodwhichin fact learns and extracts information from actions or posesfrom multiple views requires considerations at the learningscheme Through model fusion multiple views are learnedeither as other possible instances of the same class [36] orby explicitly modelling each possible view [38] These 2Dor 3D models may support a limited or unlimited numberof points of view (POV) Last but not least informationfusion can be applied at the decision level In this case foreach of the views a single-view recognition method is usedindependently and a decision is taken based on the single-view recognition results The best view is chosen based onone or multiple criteria like closest distance to the learnedpattern highest scoreprobability of feature matching ormetrics which try to estimate the quality of the receivedinput pattern However the main difficulty is to establishthis decision rule because it depends strongly on the type ofactions to recognize and on the camera setup For examplein [39] a local segment similarity voting scheme is employedto fusemultiple views and superior results are obtainedwhencomparedwith feature fusion based on feature concatenationFinally feature extraction and fusion of multiple views donot necessarily have to be considered two separate processingstages For instance in [40 41] lattice computing is proposedfor low-dimensional representation of 2D shapes and datafusion

In our case model fusion has been chosen because of tworeasons (1) in comparison with fusion at the decision levelonly a single learning process is required in order to perform

multiview recognition and (2) it allows explicit modeling ofthe poses from each view that are involved in a performanceof an action As a consequence multiple benefits can beobtained as follows

(1) Once the learning process has been finished furtherviews and action classes can be learned withoutrestarting the whole process This leads to sup-porting incremental learning and eliminating thewidely accepted limitation of batch-mode training forhuman action recognition [42]

(2) The camera setups do not need to match betweentraining and testing stages More camera views mayimprove the result of the recognition though it is notrequired to have all the cameras available

(3) Each camera view is processed separately andmatched with the corresponding view withoutrequiring to know specifically at which angle it isinstalled

These considerations are important requirements in orderto apply the proposed method to the development of AALservices Model fusion enabled us to fulfil these constraintsas will be seen in the following sections

3 Pose Representation Feature

As has been previously introduced our goal is to performhuman action recognition in real time and to do so evenin scenarios with multiple cameras Therefore the compu-tational cost of feature extraction needs to be minimal Thisleads us to the usage of silhouette contours Human silhou-ettes contain rich shape information and can be extracted rel-atively easily for example through background subtractionor human body detection In addition silhouettes and theircontours show certain robustness to lighting changes andsmall viewpoint variations compared to other techniquesas optical flow [43] Using only the contour points of thesilhouette results in a significant dimensionality reduction bygetting rid of the redundant interior points

The following variables are used along this section

(1) the number of contour points 119899(2) the number of radial bins 119878(3) the indices 119894 119895 119896 and 119897 for all 119894 119896 119897 isin 1 119899 and

for all119895 isin 1 119878

We use the border following algorithm from [44] toextract the 119899 contour points P = 119901

1 1199012 119901

119899 where

119901119894= (119909119894 119910119894) Our proposal consists in dividing the silhouette

contour in 119878 radial bins of the same angle Taking the centroidof the silhouette as the origin the specific bin of each contourpoint can be assigned Then in difference to [20 21] whereradial or log-polar histograms are used as spatial descriptorsor [24] where star skeletonization is applied in our approachan efficient summary representation is obtained for eachof the bins whose concatenation returns the final feature(Figure 1 shows an overview of the process)


pk

pl

C

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

f(pk pk+1 pl) = 1

Figure 1 Overview of the feature extraction process (1) All thecontour points are assigned to the corresponding radial bin (2) foreach bin a summary representation is obtained (Example with 18bins)

The motivation behind using a radial scheme is two-foldOn one hand it relies on the fact that when using a directcomparison of contours even after length normalization as in[10] spatial alignment between feature patterns is still miss-ing Each silhouette has a distinct shape depending on theactor and the action class and therefore a specific part of thecontour can have more or less points in each sample Usingan element-wise comparison of the radial bins of differentcontours we ignore howmany points each sample has in eachbin This avoids an element-wise comparison of the contourpoints which would imply the erroneous assumption thatthese are correlated On the other hand this radial schemeallows us to apply an even further dimensionality reductionby obtaining a representative summary value for each radialbin

The following steps are taken to compute the feature

(1) The centroid of the contour points 119862 = (119909119888 119910119888) is

calculated as

119909119888=sum119899

119894=1

119909119894

119899 119910

119888=sum119899

119894=1

119910119894

119899 (1)

(2) The pointwise Euclidean distances between each con-tour point and the centroid D = 119889

1 1198892 119889

119899 are

obtained as in [10] Consider

119889119894=1003817100381710038171003817119862119898 minus 119901119894

1003817100381710038171003817 forall119894 isin 1 119899 (2)

(3) Considering a clockwise order the corresponding bin119904119894of each contour point 119901

119894is assigned as follows

(for the sake of simplicity 120572119894= 0 is considered as

120572119894= 360)

120572119894=

arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 if 119909i ge 0

180 + arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 otherwise

119904119894= lceil

119878 sdot 120572119894

360rceil forall119894 isin 1 119899

(3)

(4) Finally a summary representation is obtained for thepoints of each bin The final feature V results of theconcatenation of summary representations These arenormalized to unit sum in order to achieve scaleinvariance

V119895=119891 (119901119896 119901119896+1 119901

119897)

119904119896 119904

119897

= 119895 and 119896 119897 isin 1 119899

forall119895 isin 1 119878

V119895=

V119895

sum119878

0=1

V0

forall119895 isin 1 119878

V = V1 V2 sdot sdot sdot V

119878

(4)

The function 119891 could be any type of function whichreturns a significant value or property of the input pointsWe tested three types of summaries (variance max valueand range) based on the previously obtained distances to thecentroid whose results will be analyzed in Section 6

The following definitions of 119891 are used

119891var (119901119896 119901119896+1 119901119897) =119897

sum

119894=119896

(119889119894minus 120583)2

(5)

where 120583 is the average distance of the contour points of eachbin Consider

119891max (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

119891range (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

minusmin (119889119896 119889119896+1 119889

119897)

(6)

Figure 2 shows an example of the result of the 119891max summaryfunction

4 Multiview Learning Algorithm

Considering that multiple views of the same field of view areavailable our method learns from these views at the modellevel relying therefore on model fusion K-means clusteringis used in order to identify the per-view representativeinstances the so-called key poses of each action class Theresulting bag of key poses serves as a dictionary of knownposes and can be used to simplify the training sequences ofpose representations to sequences of key poses


Obtain119898119886119905119888ℎ119890119904 and 119886119904119904119894119892119899119898119890119899119905119904for each 119886119888119905119894119900119899 119888119897119886119904119904 isin 119905119903119886119894119899119894119899119892 119904119890119905 do

for each 119891119903119886119898119890 isin 119886119888119905119894119900119899 119888119897119886119904119904 doV = 119891119890119886119905119906119903119890 119890119909119905119903119886119888119905119894119900119899(119891119903119886119898119890)

119896119901 119896119901 119888119897119886119904119904 = 119899119890119886119903119890119904119905 119899119890119894119892ℎ119887119900119903(V bag-of -key-poses)if kp 119888119897119886119904119904 = 119886119888119905119894119900119899 119888119897119886119904119904 then119898119886119905119888ℎ119890119904kp = 119898119886119905119888ℎ119890119904kp + 1

end if119886119904119904119894119892119899119898119890119899119905119904kp = 119886119904119904119894119892119899119898119890119899119905119904kp + 1

end forend for

Obtain key pose weightsfor each kp isin bag-of -key-poses do

if 119886119904119904119894119892119899119898119890119899119905119904kp gt 0 then

119908kp =119898119886119905119888ℎ119890119904kp

119886119904119904119894119892119899119898119890119899119905119904kpelse119908kp = 0

end ifend for

Algorithm 1 Pseudocode for obtaining the key pose weights 119908

Figure 2 Example of the result of applying the 119891max summaryfunction

First all the training video sequences need to be pro-cessed to obtain their pose representations Supposing that119872views are available and 119877 action classes need to be learnedK-means clustering with Euclidean distance is applied forthe pose representations of each combination of view andaction class separatelyHence119870 clusters are obtained for eachof the 119872 times 119877 groups of data The center of each cluster istaken as a key pose and a bag of key poses of 119870 times 119872 times

119877 class representatives is generated In this way an equalrepresentation of each of the action classes and fused viewscan be assured in the bag of key poses (Figure 3 shows anoverview of the process)

At this point the training data has been reduced to arepresentative model of the key poses that are involved ineach view of each action class Nevertheless not all the keyposes are equally important Very common poses such asstanding still are not able to distinguish between actionswhereas a bend pose can most certainly be only found inits own action class For this reason a weight 119908 whichindicates the capacity of discrimination of each key pose 119896119901 isobtained For this purpose all available pose representationsare matched with their nearest neighbor among the bag ofkey poses (using Euclidean distance) so as to obtain the ratioof within-class matches 119908

119896119901= 119898119886119905119888ℎ119890119904

119896119901119886119904119904119894119892119899119898119890119899119905119904

119896119901 In

this mannermatches is defined as the number of within-classassignments that is the number of cases in which a poserepresentation is matched with a key pose from the sameclass whereas assignments denotes the total number of timesthat key pose got chosen Please see Algorithm 1 for greaterdetail

Video recognition presents a clear advantage over imagerecognition which relies on the temporal dimension Theavailable training sequences present valuable informationabout the duration and the temporal evolution of actionperformances In order to model the temporal relationshipbetween key poses the training sequences of pose represen-tations are converted into sequences of key poses For eachsequence the corresponding sequence of key poses Seq =

1198961199011 1198961199012 119896119901

119905 is obtained by interchanging each pose

representation with its nearest neighbor key pose among thebag of key poses This allows us to capture the long-termtemporal evolution of key poses and at the same time tosignificantly improve the quality of the training sequences asnoise and outliers are filtered


Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses

Figure 3Overviewof the generation process of the bag of key posesFor each action the per-view key poses are obtained through K-means clustering taking the cluster centers as representatives

5 Action Recognition

During the recognition stage the goal is to assign an actionclass label to an unknown sequence For this purpose thevideo sequence is processed in the same way as the trainingsequences were (1) The corresponding pose representationof each video frame is generated and (2) the pose repre-sentations are replaced with the nearest neighbor key posesamong the bag of key posesThis way a sequence of key posesis obtained and recognition can be performed by means ofsequence matching

Since action performances can nonuniformly vary inspeed depending on the actor and hisher conditionsequences need to be aligned properly Dynamic time warp-ing (DTW) [45] shows proficiency in temporal alignment ofsequences with inconsistent lengths accelerations or decel-erations We use DTW in order to find the nearest neighbortraining sequence based on the lowest DTW distance

Given two sequences Seq = 1198961199011 1198961199012 119896119901

119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906

the DTW distance 119889DTW(Seq Seq1015840

) canbe obtained as follows

119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min

dtw (119894 minus 1 119895) dtw (119894 119895 minus 1)

dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)

where the distance between two key poses 119889(119896119901119894

1198961199011015840

119895

) isobtained based on both the Euclidean distance between theirfeatures and the relevance of the match of key poses As seenbefore not all the key poses are as relevant for the purposeof identifying the corresponding action class Hence it can

Table 1 Value of 119911 based on the pairing of key poses and the signeddeviation Ambiguous stands for 119908 lt 01 and discriminative standsfor 119908 gt 09 (These values have been chosen empirically)

Signed deviation Pairing 119911

dev(119894 119895) lt 0 Discriminative minus1

dev(119894 119895) gt 0 Discriminative +1

Any Ambiguous minus1

Any Discriminative and ambiguous +1

be determined how relevant a specific match of key poses isbased on their weights 119908

119894and 1199081015840

119895

In this sense the distance between key poses is obtained

as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)

rel (119894 119895) = 10038161003816100381610038161003816dev (119894 119895) lowast 119908119894 lowast 1199081015840

119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895

10038161003816100381610038161003816minus average distance

(8)

where average distance corresponds to the average distancebetween key poses computed throughout the training stageAs it can be seen the relevance rel(119894 119895) of the match isdetermined based on the weights of the key poses thatis the capacity of discrimination and the deviation of thefeature distance Consequently matches of key poses whichare very similar or very different are consideredmore relevantthan those that present an average similarity The value of119911 depends upon the desired behavior Table 1 shows thechosen value for each case In pairings of discriminative keyposes which are similar to each other a negative value ischosen in order to reduce the feature distance If the distanceamong them is higher than average this indicates that theseimportant key poses do notmatchwell together and thereforethe final distance is increased For ambiguous key poses thatis key poses with low discriminative value pairings are not asimportant for the distance between sequences On the otherhand a pairing of a discriminative and an ambiguous keypose should be disfavored as these key poses should matchwith instances with similar weights Otherwise the operatoris based on the sign of dev(119894 119895) which means that low featuredistances are favored (119911 = minus1) and high feature distancesare penalized (119911 = +1) This way not only the shape-basedsimilarity between key poses but also the relevance of thespecific match is taken into account in sequence matching

Once the nearest neighbor sequence of key poses is foundits label is retrieved This is done for all the views that areavailable during the recognition stage The label of the matchwith the lowest distance is chosen as the final result of therecognition that is the result is based on the best viewThis means that only a single view is required in order toperform the recognition even though better viewing anglesmay improve the result Note that this process is similar todecision-level fusion but in this case recognition relies onthe same multiview learning model that is the bag of keyposes


6 Experimental Results

In this section the presented method is tested on threedatasets which serve as benchmarks On this single- andmul-tiview data our learning algorithm is used with the proposedfeature and the results of the three chosen summary repre-sentations (variance max value and range) are comparedIn addition the distance-signal feature from Dedeoglu et al[10] and the silhouette-based feature from Boulgouris et al[9] which have been summarized in Section 2 are used asa reference so as to make a comparison between featurespossible Lastly our approach is compared with the state ofthe art in terms of recognition rates and speed

61 Benchmarks The Weizmann dataset [7] is very popularin the field of human action recognition It includes videosequences from nine actors performing ten different actionsoutdoors (bending jumping jack jumping forward jumping inplace running galloping sideways skipping walking wavingone hand andwaving two hands) and has been recorded witha static front-side camera providing RGB images of a resolu-tion of 180 times 144 px We use the supplied binary silhouetteswithout postalignmentThese silhouettes have been obtainedautomatically through background subtraction techniquestherefore they present noise and incompleteness It is worthto mention that we do include the skip action which isexcluded in several other works because it usually has anegative impact on the overall recognition accuracy

The MuHAVi dataset [8] targets multiview humanaction recognition since it includes 17 different actionsrecorded from eight views with a resolution of 720 times

576 px MuHAVi-MAS provides manually annotated silhou-ettes for a subset of two views from 14 (MuHAVi-14 Col-lapseLeft CollapseRight GuardToKick GuardToPunch Kick-Right PunchRight RunLeftToRight RunRightToLeft Standu-pLeft StandupRightTurnBackLeft TurnBackRightWalkLeft-ToRight and WalkRightToLeft) or 8 (MuHAVi-8 CollapseGuard KickRight PunchRight Run Standup TurnBack andWalk) actions performed by two actors

Finally our self-recorded DAI RGBD dataset has beenacquired using amultiview setup ofMicrosoftKinect devicesTwo cameras have captured a front and a 135∘ backsideview This dataset includes 12 actions classes (Bend Car-ryBall CheckWatch Jump PunchLeft PunchRight SitDownStandingStill StandupWaveBothWaveLeft andWaveRight)performed by three different actors Using depth-basedsegmentation the silhouettes of the so-called users of aresolution of 320 times 240 px are obtained In future works weintend to expand this dataset with more subjects and samplesand make it publicly available

We chose two tests to be performed on these datasets asfollows

(1) Leave-one-sequence-out cross validation (LOSO)Thesystem is trained with all but one sequence which isused as test sequence This procedure is repeated forall available sequences and the accuracy scores areaveraged In the case of multiview sequences each

Table 2 Comparison of recognition results with different summaryvalues (variance max value and range) and the features fromBoulgouris et al [9] and Dedeoglu et al [10] Best results have beenobtained with 119870 isin 5 130 and 119878 isin 8 46 (Bold indicates highestsuccess rate)

Dataset Test [9] [10] 119891var 119891max 119891range

Weizmann LOSO 656 785 903 935 935Weizmann LOAO 785 806 925 946 957MuHAVi-14 LOSO 618 941 956 912 956MuHAVi-14 LOAO 529 868 706 912 882MuHAVi-8 LOSO 691 985 100 100 100MuHAVi-8 LOAO 676 956 838 985 971DAI RGBD LOSO 500 556 500 528 694DAI RGBD LOAO 556 611 528 694 750

video sequence is considered as the combination ofits views

(2) Leave-one-actor-out cross validation (LOAO) Thistest verifies the robustness to actor-variance In thissense the sequences from all but one actor are usedfor training while the sequences from the remainingactor unknown to the system are used for testingThis test is performed for each actor and the obtainedaccuracy scores are averaged

62 Results The feature from Boulgouris et al [9] which hasbeen originally designed for gait recognition presents advan-tages regarding for instance robustness to segmentationerrors since it relies on the average distance to the centroid ofall the silhouette points of each circular sector Neverthelesson the tested action recognition datasets it returned lowsuccess rates which are significantly outperformed by theother four contour-based approaches Both the feature fromDedeoglu et al [10] and ours are based on the pointwisedistances between the contour points and the centroid of thesilhouette Our proposal distinguishes itself in that a radialscheme is applied in order to spatially align contour partsFurther dimensionality reduction is also provided by summa-rizing each radial bin in a single characteristic value Table 2shows the performance we obtained by applying this existingfeature to our learning algorithmWhereas on theWeizmanndataset the results are significantly behind the state of the artand the rates obtained on the DAI RGBD dataset are ratherlow the results for the MuHAVi dataset are promising Thedifference of performance can be explained with the differentqualities of the binary silhouettes The silhouettes from theMuHAVi-MAS subset have been manually annotated inorder to separate the problem of silhouette-based humanaction recognition from the difficulties which arise from thesilhouette extraction taskThis stands in contrast to the otherdatasets whose silhouettes have been obtained automaticallyrespectively through background subtraction or depth-basedsegmentation presenting therefore segmentation errorsThisleads us to the conclusion that the visual feature from [10] isstrongly dependant on the quality of the silhouettes


Table 2 also shows the results that have been obtainedwith the different summary functions from our proposalThevariance summary representation which only encodes thelocal dispersion but not reflects the actual distance to thecentroid achieves an improvement in some tests at the costof obtaining poor results on the MuHAVi actor-invariancetests (LOAO) and the DAI RGBD dataset The max valuesummary representation solves this problem and returnsacceptable rates for all tests Finally with 119891range the rangesummary representation obtains the best overall recognitionrates achieving our highest rates for the Weizmann datasetthe MuHAVi LOSO tests and the DAI RGBD dataset

In conclusion the proposed radial silhouette-based fea-ture not only achieves to substantially improve the resultsobtained with similar features as [9 10] but its low-dimensionality also offers an additional advantage in compu-tational cost (feature size is reduced from sim300 points in [10]to sim20 radial bins in our approach)

63 Parameterization The presented method uses twoparameters which are not given by the constraints of thedataset and the action classes which have to be recognizedand therefore have to be established by design The firstone is found at the feature extraction stage that is thenumber of radial bins 119878 A lower value of 119878 leads to a lowerdimensionality which reduces the computational cost andmay also improve noise filtering but at the same time it willreduce the amount of characteristic data This data is neededin order to differentiate action classes The second parameteris the number of key poses per action class and view 119870 Inthis case the appropriate amount of representatives needs tobe found to capture the most relevant characteristics of thesample distribution in the feature space discarding outlierand nonrelevant areas Again higher values will lead to anincrease of the computational cost of the classificationThere-fore a compromise needs to be reached between classificationtime and accuracy

In order to analyse the behavior of the proposed algo-rithmwith respect to these two parameters a statistic analysishas been performed Due to the nondeterministic behavior ofthe K-means algorithm classification rates vary among exe-cutions We executed ten repetitions of each test (MuHAVi-8 LOAO cross validation) and obtained the median value(see Figure 4) It can be observed that a high value of keyposes that is feature space representatives only leads to agood classification rate if the feature dimensionality is nottoo low otherwise a few key poses are enough to capturethe relevant areas of the feature space Note also that a higherfeature dimensionality does not necessarily require a highernumber of key poses since it does not imply a broader sampledistribution of the feature space Finally with the purposeof obtaining high and reproducible results the parametervalues have been chosen based on the highest median successrate (926) which has been obtained with 119878 = 12 and119870 = 5 in this case Since lower values are preferred for bothparameters the lowest parameter values are used if severalcombinations reach the same median success rate

Table 3 Comparison of recognition rates and speeds obtained onthe Weizmann dataset with other state-of-the-art approaches

Approach Number of actions Test Rate FPSIkizler and Duygulu [11] 9 LOSO 100 NATran and Sorokin [12] 10 LOSO 100 NAFathi and Mori [13] 10 LOSO 100 NAHernandez et al [14]a 10 LOAO 903 98Cheema et al [15] 9 LOSO 916 56Chaaraoui et al [5] 9 LOSO 928 124Sadek et al [16]a 10 LOAO 978 18Our approach 10 LOSO 935 263Our approach 10 LOAO 957 263Our approacha 10 LOAO 978 263aUsing 90 out of 93 sequences (repeated samples are excluded)

525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5

Figure 4Median value of the obtained success rates for119870 isin 5 130

and 119878 isin 8 46 (MuHAVi-8 LOAO test) Note that outlier valuesabove or below 15 times IQR are not predominant

64 Comparison with the State of the Art Comparisonbetween different approaches can be difficult due to thediverse goals human action recognitionmethodsmay pursuethe different types of input data and the chosen evaluationmethods In our case multiview human action recognitionis aimed at an indoor scenario related to AAL servicesTherefore the system is required to perform in real timeas other services will rely on the action recognition outputA comparison of the obtained classification and recogni-tion speed rates for the publicly available Weizmann andMuHAVi-MAS datasets is provided in this section

The presented approach has been implemented withtheNET Framework using the OpenCV library [46] Perfor-mance has been tested on a standard PC with an Intel Core 2DuoCPU at 3GHz and 4GB of RAMwithWindows 7 64-bitAll tests have been performed using binary silhouette imagesas input and no further hardware optimizations have beenperformed

Table 3 compares our approach with the state of the art Itcan be seen that while perfect recognition has been achievedfor the Weizmann dataset our method places itself well interms of both recognition accuracy and recognition speed


Table 4 Comparison of recognition rates and speeds obtained onthe MuHAVi-14 dataset with other state-of-the-art approaches

Approach LOSO LOAO FPSSingh et al [8] 824 618 NAEweiwi et al [17] 919 779 NACheema et al [15] 860 735 56Chaaraoui et al [5] 912 824 72Chaaraoui et al [6] 941 868 51Our approach 956 882 93


Approach LOSO LOAO FPSSingh et al [8] 978 764 NAMartınez-Contreras et al [18] 984 mdash NAEweiwi et al [17] 985 853 NACheema et al [15] 956 831 56Chaaraoui et al [5] 971 882 81Chaaraoui et al [6] 985 956 66Our approach 100 971 94

when comparing it to methods that target fast human actionrecognition

On the MuHAVi-14 and MuHAVi-8 datasets ourapproach achieves to significantly outperform the knownrecognition rates of the state of the art (see Tables 4 and 5)To the best of our knowledge this is the first work toreport a perfect recognition on the MuHAVi-8 datasetperforming the leave-one-sequence-out cross validation testThe equivalent test on the MuHAVi-14 dataset returned animprovement of 96 in comparison with the work fromCheema et al [15] which also shows real-time suitabilityFurthermore our approach presents very high robustnessto actor-variance as the leave-one-actor-out cross validationtests show and it achieves to perform at over 90 FPS with thehigher resolution images from the MuHAVi dataset It is alsoworth mentioning that the training stage of the presentedapproach runs at similar rates between 92 and 221 FPS

With these results proficiency has been shown in han-dling both low and high quality silhouettes It is knownthat silhouette extraction with admissible quality can beperformed in real time through background subtractiontechniques [47 48] Furthermore recent advances in depthsensors make it possible to obtain human poses of substantialhigher quality by means of real-time depth based segmenta-tion [2] In addition depth infrared or laser sensors allowpreserving privacy as RGB information is not essential forsilhouette-based human action recognition

7 Conclusion

In this work a low-dimensional radial silhouette-basedfeature has been proposed which in combination with asimple yet effective multiview learning approach based ona bag of key poses and sequence matching shows to be a very

robust and efficient technique for human action recognitionin real time By means of a radial scheme contour partsare spatially aligned and through the summary functiondimensionality is drastically reduced This proposal achievesto significantly improve recognition accuracy and speed andis proficient with both single- and multiview scenarios Incomparison with the state of the art our approach presentshigh results on the Weizmann dataset and to the best of ourknowledge the best rates achieved so far on the MuHAVidataset Real-time suitability is confirmed since performancetests returned results clearly above video frequency

Future works include finding an optimal summary rep-resentation or the appropriate combination of summaryrepresentations based on a multiclassifier system Tests witha greater number of visual sensors need to be performed soas to see how many views can be handled by the learningapproach based onmodel fusion and towhich limitmultiviewdata improves the recognition For this purpose multiviewdatasets such as IXMAS [26] and i3DPost [49] can beemployed The proposed approach does not require thateach viewing angle matches with a specific orientation ofthe subject because different orientations can be modelled ifseen at the training stage Nevertheless since the method isnot explicitly addressing view-invariance it cannot deal withcross-view scenarios

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work has been partially supported by the SpanishMinistry of Science and Innovation under Project ldquoSistema devision para la monitorizacion de la actividad de la vida diariaen el hogarrdquo (TIN2010-20510-C04-02) and by the EuropeanCommission under Project ldquocaring4UmdashA study on peopleactivity in private spaces towards a multisensor networkthat meets privacy requirementsrdquo (PIEF-GA-2010-274649)Alexandros Andre Chaaraoui acknowledges financial sup-port by the Conselleria drsquoEducacio Formacio i Ocupacio ofthe Generalitat Valenciana (Fellowship ACIF2011160) Thefunders had no role in study design data collection andanalysis decision to publish or preparation of the paper Theauthors sincerely thank the reviewers for their constructiveand insightful suggestions that have helped to improve thequality of this paper

References

[1] T B Moeslund A Hilton and V Kruger ldquoA survey of advancesin vision-based humanmotion capture and analysisrdquo ComputerVision and Image Understanding vol 104 no 2-3 pp 90ndash1262006

[2] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 June 2011


[3] M B Holte C Tran M M Trivedi and T B MoeslundldquoHuman action recognition usingmultiple views a comparativeperspective on recent developmentsrdquo in Proceedings of the JointACMWorkshop onHumanGesture and Behavior Understanding(J-HGBU 11) pp 47ndash52 New York NY USA December 2011

[4] J-C Nebel M Lewandowski J Thevenon F Martınez andS Velastin ldquoAre current monocular computer vision systemsfor human action recognition suitable for visual surveillanceapplicationsrdquo in Advances in Visual Computing G Bebis RBoyle B Parvin et al Eds vol 6939 of Lecture Notes inComputer Science pp 290ndash299 Springer Berlin Germany 2011

[5] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences ofkey posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[6] A A Chaaraoui P Climent Perez and F Florez-Revuelta ldquoAnefficient approach for multi-view human action recognitionbased on bag-of-key-posesrdquo inHumanBehavior UnderstandingA A Salah J Ruiz-del Solar CMericli and P-Y Oudeyer Edsvol 7559 pp 29ndash40 Springer Berlin Germany 2012

[7] M Blank L Gorelick E Shechtman M Irani and R BasrildquoActions as space-time shapesrdquo in Proceedings of the 10th IEEEInternational Conference on Computer Vision (ICCV rsquo05) vol 2pp 1395ndash1402 October 2005

[8] S Singh S A Velastin and H Ragheb ldquoMuHAVi a mul-ticamera human action video dataset for the evaluation ofaction recognition methodsrdquo in Proceedings of the 7th IEEEInternational Conference on Advanced Video and Signal Based(AVSS rsquo10) pp 48ndash55 September 2010

[9] N V Boulgouris K N Plataniotis and D Hatzinakos ldquoGaitrecognition using linear time normalizationrdquo Pattern Recogni-tion vol 39 no 5 pp 969ndash979 2006

[10] Y Dedeoglu B Toreyin U Gudukbay and A CetinldquoSilhouette-based method for object classification andhuman action recognition in videordquo in Computer Vision inHuman-Computer Interaction T Huang N Sebe M Lew et alEds vol 3979 of Lecture Notes in Computer Science pp 64ndash77Springer Berlin Germany 2006

[11] N Ikizler and P Duygulu ldquoHuman action recognition usingdistribution of oriented rectangular patchesrdquo inHumanMotionUnderstanding Modeling Capture and Animation A Elgam-mal B Rosenhahn andRKlette Eds vol 4814 ofLectureNotesin Computer Science pp 271ndash284 Springer Berlin Germany2007

[12] D Tran and A Sorokin ldquoHuman activity recogn ition withmetric learningrdquo in Computer VisionmdashECCV 2008 D ForsythP Torr and A Zisserman Eds vol 5302 of Lecture Notesin Computer Science pp 548ndash561 Springer Berlin Germany2008

[13] A Fathi and G Mori ldquoAction recognition by learning mid-level motion featuresrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo08) pp 1ndash8Anchorage Alaska USA June 2008

[14] J Hernandez A S Montemayor J Jose Pantrigo and ASanchez ldquoHuman action recognition based on tracking fea-turesrdquo in Foundations on Natural and Artificial Computation JM Ferrandez J R Alvarez-Sanchez F de la Paz andF J ToledoEds vol 6686 of Lecture Notes in Computer Science pp 471ndash480 Springer Berlin Germany 2011

[15] S Cheema A Eweiwi C Thurau and C Bauckhage ldquoActionrecognition by learning discriminative key posesrdquo in Proceeding

of the IEEE International Conference on Computer VisionWork-shops (ICCV 11) pp 1302ndash1309 Barcelona Spain November2011

[16] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoA fast sta-tistical approach for human activity recognitionrdquo InternationalJournal of Intelligence Science vol 2 no 1 pp 9ndash15 2012

[17] A Eweiwi S Cheema CThurau andC Bauckhage ldquoTemporalkey poses for human action recognitionrdquo in Proceedings of theIEEE International Conference on Computer Vision Workshops(ICCV rsquo11) pp 1310ndash1317 November 2011

[18] F Martınez-Contreras C Orrite-Urunuela E Herrero-JarabaH Ragheb and S A Velastin ldquoRecognizing human actionsusing silhouette-based HMMrdquo in Proceedings of the 6th IEEEInternational Conference on Advanced Video and Signal BasedSurveillance (AVSS rsquo09) pp 43ndash48 Genova Italy September2009

[19] C Thurau and V Hlavac ldquon-grams of action primitives forrecognizing human behaviorrdquo in Computer Analysis of Imagesand Patterns W Kropatsch M Kampel and A Hanbury Edsvol 4673 of Lecture Notes in Computer Science pp 93ndash100Springer Berlin Germany 2007

[20] C Hsieh P S Huang and M Tang ldquoHuman action recog-nition using silhouette histogramrdquo in Proceedings of the 34thAustralasian Computer Science Conference (ACSC rsquo11) pp 11ndash15Darlinghurst Australia January 2011

[21] F Lv and R Nevatia ldquoSingle view human action recognitionusing key pose matching and viterbi path searchingrdquo in Pro-ceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo07) pp 1ndash8 June 2007

[22] Z Z Htike S Egerton and K Y Chow ldquoModel-free viewpointinvariant human activity recognitionrdquo in International Multi-Conference of Engineers and Computer Scientists (IMECS rsquo11)vol 2188 of Lecture Notes in Engineering and Computer Sciencepp 154ndash158 March 2011

[23] Y Wang K Huang and T Tan ldquoHuman activity recognitionbased on R transformrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR rsquo07) pp 1ndash8 June 2007

[24] H-S Chen H-T Chen Y-W Chen and S-Y Lee ldquoHumanaction recognition using star skeletonrdquo in Proceedings of the 4thACM International Workshop on Video Surveillance and SensorNetworks (VSSN rsquo06) pp 171ndash178 New York NY USA 2006

[25] A F Bobick and J W Davis ldquoThe recognition of humanmovement using temporal templatesrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 23 no 3 pp 257ndash267 2001

[26] D Weinland R Ronfard and E Boyer ldquoFree viewpoint actionrecognition using motion history volumesrdquo Computer Visionand Image Understanding vol 104 no 2-3 pp 249ndash257 2006

[27] S Cherla K Kulkarni A Kale and V RamasubramanianldquoTowards fast view-invariant human action recognitionrdquo inIEEE Computer Society Conference on Computer Vision andPattern Recognition Workshops (CVPR rsquo08) pp 1ndash8 June 2008

[28] D Weinland M Ozuysal and P Fua ldquoMaking action recogni-tion robust to occlusions and viewpoint changesrdquo in ComputerVision (ECCV rsquo10) K Daniilidis P Maragos and N ParagiosEds vol 6313 of Lecture Notes in Computer Science pp 635ndash648 Springer Berlin Germany 2010

[29] R Cilla M A Patricio A Berlanga and J M Molina ldquoHumanaction recognition with sparse classification and multiple-viewlearningrdquo Expert Systems 2013


[30] S A Rahman I Song M K H Leung I Lee and K LeeldquoFast action recognition using negative space featuresrdquo ExpertSystems with Applications vol 41 no 2 pp 574ndash587 2014

[31] L Chen H Wei and J Ferryman ldquoA survey of human motionanalysis using depth imageryrdquo Pattern Recognition Letters vol34 no 15 pp 1995ndash2006 2013 Smart Approaches for HumanAction Recognition

[32] J Han L Shao D Xu and J Shotton ldquoEnhanced computervision with microsoft kinect sensor a reviewrdquo IEEE Transac-tions on Cybernetics vol 43 no 5 pp 1318ndash1334 2013

[33] J Aggarwal and M Ryoo ldquoHuman activity analysis a reviewrdquoACM Computing Surveys vol 43 pp 161ndash1643 2011

[34] P Yan S M Khan and M Shah ldquoLearning 4D action featuremodels for arbitrary view action recognitionrdquo in Proceedingsof the 26th IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo08) usa June 2008

[35] C Canton-Ferrer J R Casas and M Pardas ldquoHuman modeland motion based 3D action recognition in multiple viewscenariosrdquo in Proceedings of the 14th European Signal ProcessingConference pp 1ndash5 September 2006

[36] C Wu A H Khalili and H Aghajan ldquoMultiview activityrecognition in smart homes with spatio-temporal featuresrdquo inProceeding of the 4th ACMIEEE International Conference onDistributed Smart Cameras (ICDSC 10) pp 142ndash149 New YorkNY USA September 2010

[37] T Maatta A Harma and H Aghajan ldquoOn efficient use ofmulti-view data for activity recognitionrdquo in Proceedings of the4th ACMIEEE International Conference on Distributed SmartCameras (ICDSC rsquo10) pp 158ndash165 ACM New York NY USASeptember 2010

[38] R Cilla M A Patricio A Berlanga and J M Molina ldquoAprobabilistic discriminative and distributed system for therecognition of human actions from multiple viewsrdquo Neurocom-puting vol 75 pp 78ndash87 2012

[39] F Zhu L Shao and M Lin ldquoMulti-view action recognitionusing local similarity random forests and sensor fusionrdquoPatternRecognition Letters vol 34 no 1 pp 20ndash24 2013

[40] V G Kaburlasos S E Papadakis and A Amanatiadis ldquoBinaryimage 2D shape learning and recognition based on lattice-computing (LC) techniquesrdquo Journal of Mathematical Imagingand Vision vol 42 no 2-3 pp 118ndash133 2012

[41] V G Kaburlasos and T Pachidis ldquoA Lattice-computing ensem-ble for reasoning based on formal fusion of disparate data typesand an industrial dispensing applicationrdquo Information Fusionvol 16 pp 68ndash83 2014 Special Issue on Information Fusion inHybrid Intelligent Fusion Systems

[42] R Minhas A Mohammed and Q Wu ldquoIncremental learningin human action recognition based on snippetsrdquo IEEE Transac-tions on Circuits and Systems for Video Technology vol 22 pp1529ndash1541 2012

[43] M Angeles Mendoza and N P de la Blanca ldquoHMM-basedaction recognition using contour histogramsrdquo in Pattern Recog-nition and Image Analysis J Martı J M Benedı A MMendonca and J Serrat Eds vol 4477 of Lecture Notes inComputer Science pp 394ndash401 Springer BerlinGermany 2007

[44] S Suzuki and K Abe ldquoTopological structural analysis ofdigitized binary images by border followingrdquo Computer VisionGraphics and Image Processing vol 30 no 1 pp 32ndash46 1985

[45] H Sakoe and S Chiba ldquoDynamic programming algorit hmoptimization for spoken word recognitionrdquo IEEE Transactionson Acoustics Speech and Signal Processing vol 26 no 1 pp 43ndash49 1978

[46] G Bradski ldquoTheOpenCV libraryrdquoDrDobbrsquos Journal of SoftwareTools 2000

[47] THorprasert DHarwood and L Davis ldquoA statistical approachfor real-time robust background subtraction and shadow detec-tionrdquo in Proceedings of the IEEE International Conference onComputer Vision Frame-Rate Workshop (ICCV rsquo99) pp 256ndash261 1999

[48] K Kim T H Chalidabhongse D Harwood and L DavisldquoReal-time foreground-background segmentation using code-bookmodelrdquoReal-Time Imaging vol 11 no 3 pp 172ndash185 2005

[49] N Gkalelis H Kim A Hilton N Nikolaidis and I PitasldquoThe i3DPost multi-view and 3D human actioninteractiondatabaserdquo in Proceeding of the 6th European Conference forVisualMedia Production (CVMP 09) pp 159ndash168 LondonUKNovember 2009

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



silhouette contour points radial bins are computed usingthe centroid as the origin and a summary representation isobtained for each bin This pose representation is used inorder to obtain the per-view key poses which are involvedin each action performance Therefore a model fusion ofmultiple visual sensors is applied From the obtained bag ofkey poses the sequences of key poses of each action classare computed which are used later on for sequence matchingand recognition Experimentation performed on two publiclyavailable datasets (Weizmann [7] and MuHAVi [8]) and aself-recorded one shows that the proposed technique not onlyobtains very high and stable recognition rates but also provesto be suitable for real-time applications Note that by ldquorealtimerdquo we mean that recognition can be performed at videofrequency or above as is common in the field

The remainder of this paper is organized as followsSection 2 summarizes the most recent and relevant works inhuman action recognition focusing on the type of featuresused and howmultiview scenarios are managed In Section 3our proposal is detailed offering a low-dimensional featurebased on silhouette contours and a radial scheme Sections4 and 5 specify the applied multiview learning approachbased on a bag of key poses and action recognition throughsequence matching Section 6 analyzes the obtained resultsand compares them with the state of the art in terms ofrecognition rate and speed providing also an analysis ofthe behaviour of the proposed method with respect to itsparameters Finally we present conclusions and discussion inSection 7

2 Related Work21 Feature Extraction Regarding the feature extractionstage of human action recognition methods based on visionthese can be differentiated by the either static or dynamicnature of the feature Whereas static features consider onlythe current frame (extracting diverse types of characteristicsbased on shape gradients key points etc) dynamic featuresconsider a sequence of several frames and apply techniqueslike image differencing optical flow and spatial-temporalinterest points (STIP)

Among the former we find silhouette-based featureswhich rely either on the whole shape of the silhouette oronly on the contour points In [19] action primitives areextracted reducing the dimensionality of the binary imageswith principal component analysis (PCA) Polar coordinatesare considered in [20] where three radial histograms aredefined for the upper part the lower part and the wholehuman body Each polar coordinate system has several binswith different radii and angles and the concatenated nor-malized histograms are used to describe the human postureSimilarly in [21] a log-polar histogram is computed choosingthe different radii of the bins based on logarithmic scaleSilhouette contours are employed in [10] with the purpose ofcreating a distance signal based on the pointwise Euclideandistances between each contour point and the centroid of thesilhouette Conversely in [22] the pairwise distances betweencontour points are computed to build a histogramof distancesresulting in a rotation scale and translation invariant feature

In [9] the whole silhouette is used for gait recognition Anangular transform based on the average distance betweenthe silhouette points and the centroid is obtained for eachcircular sectorThis shows robustness to segmentation errorsSimilarly in [23] the shape of the silhouette contour isprojected on a line based on the R transform which is thenmade invariant to translation Silhouettes can also be used toobtain stick figures for instance by means of skeletonizationChen et al [24] applied star skeletonization to obtain a five-dimensional vector in star fashion considering the headthe arms and the legs as local maxima Pointwise distancesbetween contour points and the centroid of the silhouetteare used to find the five local maxima In the work ofIkizler and Duygulu [11] a different approach based on aldquobag-of-rectanglesrdquo is presented In their proposal orientedrectangular patches are extracted over the human silhouetteand the human pose is represented with a histogram ofcircular bins of 15∘ each

A very popular dynamic feature in pattern recognitionbased on computer vision is optical flow Fathi and Mori[13] rely on low-level features based on optical flow In theirwork weighted combinations of midlevel motion featuresare built covering small spatiotemporal cuboids from whichthe low-level features are chosen In [25] motion over asequence of frames is considered defining motion historyand energy images These encode the temporal evolutionand the location of the motion respectively over a numberof frames This work has been extended by [26] so as toobtain a free-viewpoint representation from multiple viewsA similar objective is pursued in [7] where time is consideredas the third dimension building space-time volumes basedon sequences of binary silhouettes Action recognition isperformed with global space-time features composed of theweighted moments of local space-time saliency and orienta-tion Cherla et al [27] combine eigenprojections of the widthprofile of the actor with the centroid of the silhouette and thestandard deviation in the 119883 and 119884 axes in a single featurevector Robustness to occlusions and viewpoint changes istargeted in [28] A 3D histogram of oriented gradients(3DHOG) is computed for densely distributed regions andcombined with temporal embedding to represent an entirevideo sequence Tran and Sorokin [12] merge both silhouetteshape and optical flow in a 286-dimensional feature whichalso includes the context of 15 surrounding frames reducedby means of PCA This feature has been used successfully inother works as for instance recently in [29] Rahman et al[30] take an interesting approach proposing a novel featureextraction techniquewhich relies on the surrounding regionsof the subjects These negative spaces present advantagesrelated to robustness to boundary variations caused by partialocclusions shadows and nonrigid deformations

RGB-D data that is RGB color information along pixel-wise depthmeasurement is increasingly being used since theMicrosoft Kinect device has been released Using the depthdata and relying on an intermediate body part recognitionprocess a markerless body pose estimation in form of3D skeletal information can be obtained in real time [2]This kind of data results proficient for gesture and actionrecognition required by applications such as gaming and















for all119895 isin 1 119878


1 1199012 119901

119899 where




pk

pl

C

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

f(pk pk+1 pl) = 1





calculated as

119909119888=sum119899

119894=1

119909119894

119899 119910

119888=sum119899

119894=1

119910119894

119899 (1)


1 1198892 119889

119899 are


119889119894=1003817100381710038171003817119862119898 minus 119901119894

1003817100381710038171003817 forall119894 isin 1 119899 (2)




120572119894= 360)

120572119894=

arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 if 119909i ge 0

180 + arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 otherwise

119904119894= lceil

119878 sdot 120572119894


(3)


V119895=119891 (119901119896 119901119896+1 119901

119897)

119904119896 119904

119897

= 119895 and 119896 119897 isin 1 119899

forall119895 isin 1 119878

V119895=

V119895

sum119878

0=1

V0

forall119895 isin 1 119878


119878

(4)



119891var (119901119896 119901119896+1 119901119897) =119897

sum

119894=119896

(119889119894minus 120583)2

(5)


119891max (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

119891range (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

minusmin (119889119896 119889119896+1 119889

119897)

(6)






for each 119891119903119886119898119890 isin 119886119888119905119894119900119899 119888119897119886119904119904 doV = 119891119890119886119905119906119903119890 119890119909119905119903119886119888119905119894119900119899(119891119903119886119898119890)


end if119886119904119904119894119892119899119898119890119899119905119904kp = 119886119904119904119894119892119899119898119890119899119905119904kp + 1

end forend for


if 119886119904119904119894119892119899119898119890119899119905119904kp gt 0 then

119908kp =119898119886119905119888ℎ119890119904kp

119886119904119904119894119892119899119898119890119899119905119904kpelse119908kp = 0

end ifend for






119896119901= 119898119886119905119888ℎ119890119904

119896119901119886119904119904119894119892119899119898119890119899119905119904

119896119901 In



1198961199011 1198961199012 119896119901




Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses






119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906



119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min


dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)


1198961199011015840

119895









119894and 1199081015840

119895


as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)


119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895


(8)

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

















for all119895 isin 1 119878


1 1199012 119901

119899 where




pk

pl

C

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

f(pk pk+1 pl) = 1





calculated as

119909119888=sum119899

119894=1

119909119894

119899 119910

119888=sum119899

119894=1

119910119894

119899 (1)


1 1198892 119889

119899 are


119889119894=1003817100381710038171003817119862119898 minus 119901119894

1003817100381710038171003817 forall119894 isin 1 119899 (2)




120572119894= 360)

120572119894=

arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 if 119909i ge 0

180 + arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 otherwise

119904119894= lceil

119878 sdot 120572119894


(3)


V119895=119891 (119901119896 119901119896+1 119901

119897)

119904119896 119904

119897

= 119895 and 119896 119897 isin 1 119899

forall119895 isin 1 119878

V119895=

V119895

sum119878

0=1

V0

forall119895 isin 1 119878


119878

(4)



119891var (119901119896 119901119896+1 119901119897) =119897

sum

119894=119896

(119889119894minus 120583)2

(5)


119891max (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

119891range (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

minusmin (119889119896 119889119896+1 119889

119897)

(6)






for each 119891119903119886119898119890 isin 119886119888119905119894119900119899 119888119897119886119904119904 doV = 119891119890119886119905119906119903119890 119890119909119905119903119886119888119905119894119900119899(119891119903119886119898119890)


end if119886119904119904119894119892119899119898119890119899119905119904kp = 119886119904119904119894119892119899119898119890119899119905119904kp + 1

end forend for


if 119886119904119904119894119892119899119898119890119899119905119904kp gt 0 then

119908kp =119898119886119905119888ℎ119890119904kp

119886119904119904119894119892119899119898119890119899119905119904kpelse119908kp = 0

end ifend for






119896119901= 119898119886119905119888ℎ119890119904

119896119901119886119904119904119894119892119899119898119890119899119905119904

119896119901 In



1198961199011 1198961199012 119896119901




Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses






119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906



119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min


dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)


1198961199011015840

119895









119894and 1199081015840

119895


as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)


119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895


(8)

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




pk

pl

C

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

f(pk pk+1 pl) = 1





calculated as

119909119888=sum119899

119894=1

119909119894

119899 119910

119888=sum119899

119894=1

119910119894

119899 (1)


1 1198892 119889

119899 are


119889119894=1003817100381710038171003817119862119898 minus 119901119894

1003817100381710038171003817 forall119894 isin 1 119899 (2)




120572119894= 360)

120572119894=

arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 if 119909i ge 0

180 + arccos(119910119894minus 119910119888

119889119894

) sdot180

120587 otherwise

119904119894= lceil

119878 sdot 120572119894


(3)


V119895=119891 (119901119896 119901119896+1 119901

119897)

119904119896 119904

119897

= 119895 and 119896 119897 isin 1 119899

forall119895 isin 1 119878

V119895=

V119895

sum119878

0=1

V0

forall119895 isin 1 119878


119878

(4)



119891var (119901119896 119901119896+1 119901119897) =119897

sum

119894=119896

(119889119894minus 120583)2

(5)


119891max (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

119891range (119901119896 119901119896+1 119901119897) = max (119889119896 119889119896+1 119889

119897)

minusmin (119889119896 119889119896+1 119889

119897)

(6)






for each 119891119903119886119898119890 isin 119886119888119905119894119900119899 119888119897119886119904119904 doV = 119891119890119886119905119906119903119890 119890119909119905119903119886119888119905119894119900119899(119891119903119886119898119890)


end if119886119904119904119894119892119899119898119890119899119905119904kp = 119886119904119904119894119892119899119898119890119899119905119904kp + 1

end forend for


if 119886119904119904119894119892119899119898119890119899119905119904kp gt 0 then

119908kp =119898119886119905119888ℎ119890119904kp

119886119904119904119894119892119899119898119890119899119905119904kpelse119908kp = 0

end ifend for






119896119901= 119898119886119905119888ℎ119890119904

119896119901119886119904119904119894119892119899119898119890119899119905119904

119896119901 In



1198961199011 1198961199012 119896119901




Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses






119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906



119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min


dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)


1198961199011015840

119895









119894and 1199081015840

119895


as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)


119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895


(8)

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





for each 119891119903119886119898119890 isin 119886119888119905119894119900119899 119888119897119886119904119904 doV = 119891119890119886119905119906119903119890 119890119909119905119903119886119888119905119894119900119899(119891119903119886119898119890)


end if119886119904119904119894119892119899119898119890119899119905119904kp = 119886119904119904119894119892119899119898119890119899119905119904kp + 1

end forend for


if 119886119904119904119894119892119899119898119890119899119905119904kp gt 0 then

119908kp =119898119886119905119888ℎ119890119904kp

119886119904119904119894119892119899119898119890119899119905119904kpelse119908kp = 0

end ifend for






119896119901= 119898119886119905119888ℎ119890119904

119896119901119886119904119904119894119892119899119898119890119899119905119904

119896119901 In



1198961199011 1198961199012 119896119901




Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses






119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906



119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min


dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)


1198961199011015840

119895









119894and 1199081015840

119895


as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)


119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895


(8)

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Action1

ActionR

View1

View1

ViewM

ViewM

K-means

K-means

K-means

K-means

K key poses

K key poses

K key poses

K key poses

Bag of key poses






119905 and Seq1015840 =

1198961199011015840

1

1198961199011015840

2

1198961199011015840

119906



119889DTW (Seq Seq1015840) = dtw (119905 119906)

dtw (119894 119895) = min


dtw (119894 minus 1 119895 minus 1)

+ 119889 (119896119901119894

1198961199011015840

119895

)

(7)


1198961199011015840

119895









119894and 1199081015840

119895


as

119889 (119896119901119894

1198961199011015840

119895

) =10038161003816100381610038161003816119896119901119894

minus 1198961199011015840

119895

10038161003816100381610038161003816+ 119911 rel (119894 119895)


119895

10038161003816100381610038161003816

dev (119894 119895) = 10038161003816100381610038161003816119896119901119894 minus 1198961199011015840

119895


(8)

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in

























525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










525456585105125

03040506070809

1

8121620242832364044

K

Succ

ess r

ate

S

5














7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in











7 Conclusion






Acknowledgments


References




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




























































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Documents

Research Article A Low-Dimensional Radial Silhouette-Based ...downloads.hindawi.com/archive/2014/547069.pdfnd silhouette-based features which rely either on the whole shape of the