25
Multimed Tools Appl DOI 10.1007/s11042-011-0763-8 Extracting representative motion flows for effective video retrieval Zhe Zhao · Bin Cui · Gao Cong · Zi Huang · Heng Tao Shen © Springer Science+Business Media, LLC 2011 Abstract In this paper, we propose a novel motion-based video retrieval approach to find desired videos from video databases through trajectory matching. The main component of our approach is to extract representative motion features from the video, which could be broken down to the following three steps. First, we extract the motion vectors from each frame of videos and utilize Harris corner points to compensate the effect of the camera motion. Second, we find interesting motion flows from frames using sliding window mechanism and a clustering algorithm. Third, we merge the generated motion flows and select representative ones to capture the motion features of videos. Furthermore, we design a symbolic based trajectory matching method for effective video retrieval. The experimental results show that our algorithm is capable to effectively extract motion flows with high accuracy and outperforms existing approaches for video retrieval. Z. Zhao · B. Cui (B ) State Key Laboratory of Software Development Environment & Department of Computer Science, Peking University, Beijing, China e-mail: [email protected], [email protected] Z. Zhao e-mail: [email protected] G. Cong Nanyang Technological University, Nanyang, Singapore e-mail: [email protected] Z. Huang · H. T. Shen The University of Queensland, Queensland, Australia Z. Huang e-mail: [email protected] H. T. Shen e-mail: [email protected]

Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

  • Upload
    domien

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools ApplDOI 10.1007/s11042-011-0763-8

Extracting representative motion flows for effectivevideo retrieval

Zhe Zhao · Bin Cui · Gao Cong · Zi Huang ·Heng Tao Shen

© Springer Science+Business Media, LLC 2011

Abstract In this paper, we propose a novel motion-based video retrieval approachto find desired videos from video databases through trajectory matching. The maincomponent of our approach is to extract representative motion features from thevideo, which could be broken down to the following three steps. First, we extractthe motion vectors from each frame of videos and utilize Harris corner points tocompensate the effect of the camera motion. Second, we find interesting motionflows from frames using sliding window mechanism and a clustering algorithm. Third,we merge the generated motion flows and select representative ones to capturethe motion features of videos. Furthermore, we design a symbolic based trajectorymatching method for effective video retrieval. The experimental results show thatour algorithm is capable to effectively extract motion flows with high accuracy andoutperforms existing approaches for video retrieval.

Z. Zhao · B. Cui (B)State Key Laboratory of Software Development Environment & Department of ComputerScience, Peking University, Beijing, Chinae-mail: [email protected], [email protected]

Z. Zhaoe-mail: [email protected]

G. CongNanyang Technological University, Nanyang, Singaporee-mail: [email protected]

Z. Huang · H. T. ShenThe University of Queensland, Queensland, Australia

Z. Huange-mail: [email protected]

H. T. Shene-mail: [email protected]

Page 2: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Keywords Video retrieval · Content feature · Motion flow · Trajectory matching

1 Introduction

The continuously rapid growth of videos has created new challenges for managing,exploring and retrieving video data effectively and efficiently. It is common thatsome commercial search engines, e.g., Google and Yahoo!, provide online videosearching service, and enable users to find relevant videos via keywords. However,the videos are not always annotated with proper keywords, and text alone neversuffice to characterize videos. Hence, there is an essential need for content-basedvideo retrieval (CBVR) systems. In the last two decades, CBVR has attractedincreasing attention, and for better performance in video retrieval, lots of approacheshave been proposed to extract and organize the video’s content features, e.g., QBIC[11], Video-Q [4], Virage [12] and Netra-V [9].

In CBVR systems, extracting content features that can precisely characterize thevideo content is one of the key components. On one hand, CBVR systems usuallyexploit the low-level static features extracted from keyframes of videos, such ascolors, textures and shapes. On the other hand, it is also essential to take accountof the motion features. Motion features provide the global (spatial and temporal)information of a video, which makes them more important than static featuresin representing the semantic information [26]. Intuitively, when watching a video,people typically are more interested in the movement of objects, e.g., the actionsof an actor, than static objects. By exploiting motion features, CBVR systems oftenyield better performance in searching videos from large video databases. Indeed,many existing CBVR systems have utilized motion features for video retrieval, e.g.[4, 8, 10, 11, 19, 20, 23]. The methods of extracting and representing motion featuresin the existing CBVR systems can be divided into two categories.

(1) Statistics-based motion features. This line of work, e.g. [10, 19], uses statisticsto analyze the tendency and distribution of local motion. The statistics-basedapproach is usually computationally efficient, but cannot extract comprehensiveinformation ( e.g. objects’ motions).

(2) Trajectory-based motion features. Some researchers proposed to extract objects’motion trajectories by object or local interesting point detection and tracking[8, 18, 24, 27–29]. Such approaches are often more effective in retrievingrelevant videos. However, they have several drawbacks. First, although thereexist several video object segmentation algorithms [7, 25], it remains challengingto “automatically” compute the centroids of video objects across consecutiveframes due to the complexity and ill-posed nature of the problem. Additionally,it incurs heavy computational cost and noise when the trajectory grows in track-ing stage. Instead of segmenting and extracting objects, another stream of workextracts motion features from the motion vector sequences embedded in videos,such as MPEG bitstream [30]. It has been shown in [23] that CBVR systemsusing motion flows (trajectories) generated from motion vectors perform betterthan existing approaches using object trajectories. However, the existing motionflows based approaches generate motion flows from all the motion vectors ineach frame, and might introduce noisy motion flows. It is challenging to extractrepresentative motion flows for effective video retrieval.

Page 3: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Considering all the aforementioned problems, we propose a novel approach toextract the motion flows, and design a novel trajectory matching method for videoretrieval in this paper. Our approach on motion flow extraction and similaritymeasurement can be adopted in CBVR system to find similar videos based on theirmotion feature. Note that, although we focus on the motion features in this work,these features can be easily combined with static content features in the CBVRsystems.

The proposed motion trajectory extraction method aims to extract accurate andrepresentative motion features and is operated in the following three main steps:

– First, we extract local motion using the motion vectors with camera motioncompensation, which represents the motions of objects in each frame.

– Second, we generate motion flows by “matching” and “linking” local motionvectors in a series of frames of video shots. More specifically, we divide framesof video shots into windows, and develop a sliding window framework for thetracking process. To filter noisy motion flows, we group flows into clusters, andonly select interesting clusters in which the motion flows have high potentials torepresent the motion features of video.

– Third, after processing the whole sequence of a video shot, we cluster the gen-erated interesting motion flows and only keep those in the interesting clusters.Note that, we here select the “global interesting” motion flows, which are themost meaningful trajectories in the whole video clip, while in step two we choosethe “local interesting” flows which are the most important clusters in the slidingwindows. Finally, the “global interesting” motion flows with high similarity areselected and merged, also, only representative motion trajectories are used as themotion features for video retrieval.

The proposed motion trajectory matching approach works as follows. Afterextracting representative motion flows, we adopt a bottom-up segmentation strategy[15] to segment a motion flow into multiple segments (subsections) which arerepresented by their regression parameters. We then map the sequence of segmentsinto symbolic representation using a robust symbol table, and utilize it to evaluatethe similarity between motion flows.

We conduct extensive experiments to evaluate the performance of our approachagainst some existing trajectory matching and motion-based video retrieval methods.The experimental results show that our method performs better in motion featureextraction and video retrieval.

The rest of this paper is organized as follows. In Section 2, we review the relatedwork. Section 3 presents our method of motion flow extraction. In Section 4, weintroduce our approach for trajectory based video search. Section 5 describes adetailed performance study. Finally, we conclude the paper.

2 Related work

The techniques used in CBVR systems cover various topics, such as video coding, im-age and video processing, video indexing, searching and mining, etc. Generally, twotypical content features are used for CBVR, i.e., static image features such as color,texture and shape extracted from keyframes of video, and motion features extracted

Page 4: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

from the differences between adjacent frames in a sequence of video frames. Sincevideo is not just a collection of images, the static features are insufficient to describethe rich visual content of a video. However, motion features, which can exhibit theobject motions in videos, i.e. spatial and temporal characteristics, play a key role invideo search and indexing. We then briefly introduce the work closely related to ourresearch.

2.1 Motion feature extraction

A number of approaches have been proposed to extract motion features from videose.g., [8, 10, 19, 26, 29]. The commonly used motion features include motion vectors,global motion, object trajectories, etc. In order to extract motion features fromvideo stream, we not only need global motion features like camera motion, but alsoneed to extract local motion features such as object movement. Common method ofgenerating local motion is to estimate camera motion by a regression model, e.g.,4-parameter-regression model [21], and make motion vectors in the whole framecompensate camera motion at their location. The local motion information is widelyused by existing approaches to extract and represent video motion features, whichbasically belongs to two categories.

First, statistic-based methods are proposed to estimate local motions’ distributionand tendency. The statistic-based motion feature extraction can only get the overallpicture, not the comprehensive and detailed motion information. Causal Gibbs mod-els were used in [10] to represent the spatial-temporal distribution of the dynamiccontent in a video shot. A multi-dimensional vector was generated by measuring theenergy distribution of a motion vector field [19]. Also, some very recent works, suchas [26] proposed a statistical motion feature, Expanded Relative Motion Histogramof Bag-of-Visual-Words (ERMH-BoW) which fused the local feature histogramsinto one high dimensional matrix formed vector and used such motion features inevent detection.

The second category methods make use of trajectory based motion features. Mostmethods in this category are object-based ones that compute trajectories by objectdetection and tracking. The performance of such approaches highly depends on theobject detection and tracking algorithms, e.g., Harris Corner Point detection [13],SIFT algorithm [18], KLT tracking algorithm [24], Monte Carto based Particle filter[29]. Dagtas et al. proposed to use a combination of trajectory- and trail-basedmodels to characterize the motion of a video object [8]. While the object-basedapproaches can provide detailed objects’ motion trajectories, prior knowledge isrequired to correctly derive the video objects, rendering object-based approachesoften infeasible in practice. Recently, the motion flow trajectory [23] is proposed torepresent motion features. The motion flow trajectory is extracted based on the localmotion vectors in video frames, and is shown to be more effective in video retrievalthan other motion features.

2.2 Trajectory based video retrieval

Trajectory matching is important for trajectory based video retrieval, and the meth-ods of time series matching have been widely adopted [2, 3, 5, 6, 14, 17, 22, 23]. Thesemethods sample the trajectories by some significant points, or segment trajectories

Page 5: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

into several subsections, then represent the trajectory by symbolic or numeric meth-ods. Le et al. [17] extracted control points of trajectories, represented trajectories bythe sequences of movement direction and distance ratio, and evaluated the similaritywith Edit Distance. Hsieh et al. [14] proposed a similar trajectory matching methodwhich transformed control points to different symbols. Su et al. [23] adopted controlpoints in representation of motion flows, and proposed a coarse-to-fine trajectorycomparison mechanism for video retrieval. Bashir et al. [2, 3] segmented trajectoriesusing a distribution based hypothesis test method, mapped subsections to lettersvia clustering method and measured the similarity of trajectories by Edit Distance.Symbolic based trajectory representation [5] maps real sequence to a string ofsymbols according to a specific symbol table which is generated by moving directionand distance. Then, edit distance on symbolic based trajectories is used to measuretheir similarity.

3 Motion feature extraction

In this section we focus on how to extract motion features from video shots. Thefeature extraction processing consists of the following steps: local motion vectorgeneration by camera motion compensation, motion flow extraction and filtering.

Figure 1 shows the framework of motion feature extraction of our system. In thiswork, we get the forward motion vectors from the B frames and P frames in MPEGfiles. There are three type of frames in MPEG files, i.e., I frames, B frames and Pframes. B frames and P frames are two types of frames referring to adjacent I framesand storing motion vectors in compressed domain according to MPEG standard. Weuse the subtraction of the forward motion vectors to get the motion vector of adjacentframes.

The reason we use MPEG stream is that the motion vector is easy to obtain fromthe compressed domain, which is widely used in several related works [23, 32]. Notethat, our work does not focus on how to extract motion vectors, thus MPEG streamwith motion vectors is adopted as one possible way of doing the extraction. Besidessuch motion vectors in MPEG files, we can also adopt other methods independentfrom video coding technology to obtain motion vectors, e.g., using the optical flowextraction.

Then we extract Harris corner points and generate statistic features to recognizecamera motion using SVM classifier. As long as there is a camera motion, a 4-parameter estimation model is adopted to estimate camera motion parameters andcompute local motion vectors by camera motion compensation. After that, we matchmotion vectors of blocks containing corner points of two adjacent frames. The videoshot is partitioned into shorter windows to process the flows with smaller granularityand better accuracy. In each window, we normalize the motion flows into unit lengthand extract one multi-dimensional vector to represent it. We then cluster the vectorsof motion flows, and filter out clusters in which motion flows are likely to be noises.For the succeeding windows, we first match and connect motion flows with thosein the remaining clusters of the preceding windows, and then cluster the remainingmotion flows. After the whole video shot is processed, we merge the remainingsimilar motion flows to generate the representative motion trajectories and they arelater used to represent the motion features of video.

Page 6: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

VideoFrames

of a shot

Motion VectorExtracting From

MPEG files

Corner PointDetection

MotionVectors

CameraMotion

Cornerpoints

ParameterEstimation

CameraMotion

Compensation

LocalMotionVectors

Motion FlowGeneration

InterestingMotionFlows

Clustering andMerging

RepresentativeMotion Flows

Sliding Window BasedInteresting Motion

Flow Extraction

Clustering

Fig. 1 The processing of motion feature extraction

3.1 Local motion extraction

In this subsection, we present the proposed method of extracting local motion usingmotion vectors from MPEG files. The motion vectors in MPEG files describe bothcamera motion (or global motion) and local motion. Hence, we need to compensatethe camera motion to calculate local motion vectors. There exist several approachesto estimate camera motion [21, 23, 31], and one approach is based on the four-parameter global motion model [21, 23]. However, this original model has twoproblems.

First, the motion vectors may not precisely capture the characteristics of motionsin a frame if the texture of video blocks is fuzzy or simple, e.g., a large blue regionthat represents sky or sea. As shown in Fig 2, buildings, sky as well as the greenplayground have only dim texture features, thus their motion vectors are 0 regardlessof the movement of camera. Second, computing global motion using the 4-parametermodel may deteriorate performance when the camera is fixed. The global motioncompensation may introduce errors to the local motion vectors because it addsneedless and disturbing camera motion as noise to every generated local motionvector.

Page 7: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Fig. 2 Motion vectors in avideo frame

To address the first problem, i.e., reduce noise and improve the accuracy of localmotion vectors, we use Harris Corner Point Detector to find blocks containing localinteresting points (LIPs) with clear and distinctive texture features. Harris CornerPoints are usually the points at the crossing of two areas with different colors, whichcan be easily located when there is a motion. We use the location and distribution ofthe blocks containing LIPs, i.e., the detected Harris Corner Points as well as theirmotion vectors to conduct camera motion detection and local motion extraction.Figure 3 shows the Harris corner points of the frame, where most of the LIPs arelocated in areas with distinctive texture feature.

We adopt the following 4-parameter-regression model [21] for camera motioncompensation:

−−−−→MVcam =

(Cz Cr

−Cr Cz

)·(

xy

)+ (

Cp Ct)

(1)

where Cz, Cr, Cp and Ct are the four parameters describing camera motion of zoom,rotate, pan and tilt respectively.

In contrast to the approach [23] that uses motion vectors from all the blocks in aframe to compute the four regression parameters, we use blocks that contain Harriscorner points. After estimating the 4 parameters, we generate camera motions ineach block and use them to calculate local motions. Because motion vectors in the

Fig. 3 Local motion vectors ofHarris corner points

Page 8: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

area of fuzzy texture are prone to noise and meaningless, we only reserve the localmotions of the blocks that contain LIPs for the motion flow extraction, and thus wecan achieve better efficiency and effectiveness.

When it comes to the second problem caused by fixed camera, we build a classifierto detect the possible camera motion in video frame. If camera is fixed, cameramotion compensation will not be applied. Comparing the frames with moving cameraand those with fixed camera, we have the following observations:

– When there exists a camera motion, background area is fuzzy and always keepschanging, which decreases the number of LIPs. Thus the frames with cameramotion generally have fewer LIPs than those with fixed camera in the samescenes.

– When there is a camera motion, the number of blocks whose motion vectors arenotable is usually high.

– The average motion vectors of frames with camera motion have notable lengths,while the lengths of average motion vectors with fixed camera are close to 0.

Based on the above observations, we build a classier with SVM using the followingthree features to identify the frames with camera motion:

( |CHarris||C| ,

|CMotion||C| ,

�i∈CMotion(Xi, Yi)

|CMotion|)

(2)

where C is the set of all blocks; CMotion is the set of motion vectors that have motionlarger than a predefined threshold ξ , one fourth of block length in our experiment;CHarris is the set of blocks containing LIPs; (Xi, Yi) is the value of motion vector i inCMotion.

Figure 2 shows a running basketball player with a camera tracking him. The file isin MPEG-2 format and the block size is 16 × 16. The motion vectors inside the playerare nearly 0 because the camera is tracking him. Figure 3 shows the local motion ofblocks that contain LIPs we extracted. Comparing the two figures, we can see thatmore than half area of the frame is the background with fuzzy plain texture featureand its motion should be ignored, and the extracted local motion vectors from blocksusing our method are of high accuracy.

3.2 Representative motion flow extraction

Having introduced the approach to generating the motion vectors, we are ready topresent the proposed mechanism of matching and linking motion vector sequencesto form the motion flows.

Because the motion vectors are stored in MPEG format block by block, themotion vectors which begin at the center of a block may end at a random location innext frame. So when we access the next frame, it’s necessary for us to find a block tomatch the end of the motion vectors. Let

−−→Lcur denote the local motion vector that is

extracted from the current frame. We mark the block in the next frame to match theend of motion vector

−−→Lcur as Banc. Banc is computed as follows:

Banc = arg minm∈C

‖ (αθm βφm (1 − α − β)ϕm

) ‖ (3)

Page 9: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Where C is the set of blocks from the next frame that may match−−→Lcur, m is a block

in C, θm is the angle of−−→Lcur with the motion vector from block m, φm is the similarity

of the color histogram of block m and the block of−−→Lcur, ϕm is the m’s overlapping

area in the next frame, and α, β are the weights of components. In our experiments,we give equal weights of these three factors.

In Fig. 4, we show an example of the motion flows extracted from a video shotwhere a basketball player is running with camera tracking him. We only use the localmotion vectors from blocks containing LIPs.

By matching the local motion vectors of each frame, we can get motion flows ofvideo shots. However, motion flows obtained in this way may still contain a lot ofnoises, because the error motion vectors in some frames lead to mismatching, andalso the number of motion flows grows bigger when no existing motion flows canmatch the motion vectors in the frame to be processed.

To address this problem, ideally we would like to eliminate noise and keep onlythe representative motion flows, to find “representative motion flow” by keepingsome important motion flows and dropping insignificant flows. We propose toextract interesting motion flows by clustering motion flows with similar tendencyand location, and then filtering the noisy flows. Since the motion flows on thewhole shot may have different lengths and are more complicated, to achieve bettergranularity we adopt a sliding window framework to extract interesting motionflows and connect the matched flows. The processing of representative motion flowextraction is outlined in Fig. 5:

– For each window, which is a specific number of adjacent frames, we extract andnormalize the motion flows.

– We conduct Discrete Cosine Transform (DCT) on the normalized motion flowsand extract a multi-dimensional vector representing flow feature.

– In each window, we first compare the motion flows with the interesting motionflows extracted from its preceding window by the similarity measure we definedand connect them if matched. Then, we cluster the remaining flows, select“good” clusters and reserve motion flows in these clusters as interesting motionflows.

– After processing the whole video shot, we filter the noises and redundancies andmerge the similar motion flows. The reserved representative motion flows areused to represent the video motion feature in retrieval.

Fig. 4 Motion flows extractedfrom a video shot

Page 10: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Fig. 5 The processing ofrepresentative motion flowextraction

Cornerpoints

Information

LocalMotionVectors

Generate MotionInformation ofNext Window

ClusterNCluster2Cluster1

Clustering

Motion FlowsExaction in a windowMatching

Clusteringand Merging Is End of Shot

RepresentativeMotion Flows

InterestingMotion Flows

3.2.1 Motion f low description for clustering

We split the video sequence with windows to achieve better granularity, and thenemploy clustering technique to find interesting flows in each window. Within eachwindow, we normalize the motion flows for clustering. Consider a flow in the form of{(Xi, Yi, Ti), i = 1, . . . , n}, where n is the length of the motion flow and (Xi, Yi, Ti) isthe i-th node of the motion flow. It can be normalized by the following equation, foreach (Xi, Yi, Ti):

(X

′i Y

′i T

′i

) =(

Xi − X1 Yi − Y1 Ti)

�lk=2‖Xk − Xk−1, Yk − Yk−1‖

(4)

After normalization, each motion flow starts at (0, 0) and its length is 1. Weperform DCT on these normalized motion flows by its x-dimension and y-dimensionrespectively. After DCT transformation, we reserve first several dimensions both onx and y and add the start location of a motion flow to form a multi-dimensional vectoras the motion flow’ feature for clustering. Then a specific distance measurement suchas Euclidean Distance in this feature space is used to evaluate the similarity betweenmotion flows.

The dimensionality in DCT is a tuning parameter. We perform clustering algo-rithms on motion flows based on their multi-dimensional vectors. If the numberof remained dimensions is large, the effectiveness of clustering is low due to thedimensionality curse problem. On the contrary, if the number is small, the clusteringperformance is not effective because the reserved information may not be sufficientto identity the similarity between motion flows. In our implementation, we reservethe first 10 dimensions on the x and y respectively as a trade-off between the twofactors. For ease of presentation, a 22-dimensional vector is used to represent thefeatures of a motion flow in this work.

Page 11: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

3.2.2 Clustering and cluster selection

Generally, the interesting motion flows are from the significant moving objectscorresponding to human perception. It is difficult to determine whether an individualmotion flow is interesting or not. Instead, we extract interesting motion flows byclustering and filtering noisy flows at cluster level. We use cluster size, and sum ofintra-cluster distance in a cluster to determine whether a cluster contains interestingmotion flows. We use K-means algorithm to group the flows into clusters in eachwindow.

– The size of a cluster. It is the number of motion flows in a cluster. It is anticipatedthat clusters with a reasonable size are more likely to contain important movinginformation while a cluster containing a small number of motion flows is morelikely to be noise.

– The sum of intra-cluster distance. It is the sum of distances between motion flowsand the cluster center. ∑

i∈Clusterm

(‖Fmi − Centerm‖)

where Fmi and Centerm are the multi-dimensional features of motion flows incluster m and the centroid motion flow of cluster m respectively. If the value islow, the motion flows in a cluster have high similarity and they are more likelyto be interesting motion flows.

We measure the importance of clusters by the ratio between the above twovalues, and keep the motion flows in good clusters as interesting motion flows.After clustering and extracting motion flows from the featured clusters, we showthe interesting flows in Fig. 6. Comparing with the original flows in Fig. 4, we cansee that some noises or unimportant motions that are isolated and far away from themoving object can be filtered effectively.

3.2.3 Generating representative motion f low by sliding window

We have discussed how to extract motion flows and select interesting motion flowsin a window. We next discuss how to match and connect the interesting motion flowsin the sliding window framework and generate motion flows that can represent the

Fig. 6 Interesting motionflows in a window

Page 12: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

whole video shot. Our aim is to extract representative and distinctive motion flowsfrom video shot.

To extract the interesting motion flows, we first match the flows in currentwindow to the set of interesting flow candidates extracted from preceding window,and connect the matched motion flows to their preceding flows. That is, for eachmotion flow CFlow in the set of reserved interesting flows of the processed windows,we connect f lowi, which is a motion flow in current window, with CFlow, if thesimilarity between f lowi and CFlow is the maximum for all the motion flows incurrent window and larger than a certain threshold which is a tuning parameter. Ad-ditionally, we cluster the remaining motion flows and identify new local interestingmotion flow candidates in the window. Both the connected flows and local interestingflows will be used for its succeeding windows. The similarity for flow matching isdefined as:

Sim = α cos(CFlow, f lowi) − β|CFlow f t − f lowi f 0 |

BLK+ δ

| f lowi|max f low

(5)

Where α, β, δ are positive weights; cos(CFlow, f lowi) represents the cosine ofthe angle between the last motion vector in CFlow and the first motion vector inf lowi, the higher value of this means greater chances the two flows can be connectedsince they have the same move direction in nearby frames; |CFlow f t − f lowi f 0 | is thedistance between the flow location in the last frame of CFlow and the first frame off lowi, BLK is the size of each block, which is 16*16 in our approach; and | f lowi| isthe length of f lowi, normalized by dividing the maximum length of the flows in theselected window, since the flow with longer length is less likely to be noise. In ourexperiments, we set α, β, and δ equally.

After the whole shot is processed with the sliding window mechanism, we needto analyze the extracted interesting motion flows and finalize the motion flows thatrepresent the motion feature of video shot. Due to the sliding window mechanism,the interesting motion flows generated by the above strategy have different lengthsand may span multiple windows. The method of selecting final representative motionflows is outlined as follows:

– For each window, we first select the longest interesting motion flow startingfrom the corresponding time slot in the window. Also we cluster the restinteresting motion flows, keep the clusters with good property and drop theothers according to the criteria given in last subsection. Thus we generate twosets of representative motion flows which keep the most meaningful trajectories.

– We next merge similar motion flows by computing the similarity with 22-D vectorfeatures proposed previously. In a set of similar flows, we only select the longestone as the representative trajectory. Thus, we obtain the representative motionflows to represent video shots.

Figure 7a shows the motion flows extracted from the whole video shot afterprocessing all the windows; Figure 7b shows the representative motion flow thatrepresents the motion feature of the video shot using the aforementioned strategy,which is different from the interesting flows as shown in Fig. 6. Note that, we canselect single or multiple trajectories to represent the motion feature of videos forvideo retrieval.

Page 13: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

(a) (b)

Fig. 7 Representative motion flows of a video shot

4 Motion feature based video retrieval

In the motion feature extraction stage, we extract the representative motionflows to represent the motion features of video. The extracted trajectories canbe considered as two dimensional time series data, which can be describe as S =[(x1, y1), (x2, y2), ..., (xn, yn)]. where (xi, yi) is the i-th node of the trajectory.

Although we can apply existing trajectory matching algorithms [3, 5, 6] to this task,the similarity computation based on this representation may yield poor performancesince the raw data may contain noise and redundant points. Instead, we develop anew approach for trajectory matching, and our approach works in three steps: (1)segment the trajectories into subsections using a Bottom-up segmentation algorithm[15, 16]; (2) generate the symbolic representation of a trajectory by using a newmethod to map each subsection to single/multiple letters as its signature; (3) retrievetrajectories based on their symbolic representation using the Edit Distance onMovement Pattern String (EDM) approach [5].

4.1 Trajectory segmentation

Given a raw trajectory in the form of [(x1, y1), (x2, y2), ..., (xn, yn)], we first segmentthe trajectory to achieve a compact representation for efficient trajectory matching.We adopt the Bottom-Up segmentation algorithm [15, 16] to segment trajectoriesand capture the characteristic of trajectories. The algorithm begins with creating n/2segments of the time series that are the finest possible approximation. The algorithmthen calculates the merging cost for each adjacent segment pair and merges adjacentsegments that have the lowest merging cost. At the end of each merging step, themerging costs of the affected segments (neighbor segments of merged segments) arerecalculated.

In this approach, approximation error is used as the merging cost of adjacentsegments, which is defined by the trajectory residual sum of squares:

Errorwhole =N∑

s=1

ls∑o=1

(ts1 · xso + ts2 − yso)2 (6)

Where N is the number of segments, ls is the length of the s-th segment, (xso, yso)

represents the o-th point in the s-th segment, and ts1 and ts2 are regression parameters.

Page 14: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

4.2 Symbolic representation of trajectories

We use linear regression parameters of segments to calculate their moving directionand distance. For each segment Ts of a trajectory, the first and last nodes (xs1 , ys1)

and (xsls, ysls

) are extracted in addition to its linear regression parameters ts1 and ts2 ,where ls is the length of Ts and ts1 and ts2 satisfying:

mints1 ,ts2

(ls∑

o=1

(ts1 · xso + ts2 − yso)2

). (7)

The moving direction α of Ts is computed by:

α =

⎧⎪⎪⎨⎪⎪⎩

arctan(ts1), xsls≥ xs1 , (8)

arctan(ts1) + π, xsls< xs1 , ysls

≥ ys1 , (8′)

arctan(ts1) − π, xsls< xs1 , ysls

< ys1 . (8′′)

And the distance of segment Ti is computed by:

dis =√

1 + t2s1|xsls

− xs1 |. (9)

where√

1 + t2s1

= 1| cos(α)| .

Trajectories of moving objects often contain noise because of uncertain factorslike video capturing techniques or compression algorithms. Appropriate represen-tation can reduce noise while keeping useful information. In [5], moving directionand distance between adjacent nodes of trajectories are calculated and mappedinto words with a symbol table, i.e., Edit Distance on Movement (EDM) Pattern.Unlike quantized real sequences, symbolic based trajectory representation maps realsequence to a string of symbols according to a specific symbol table. Symbolic basedtrajectory representation is often smoother and more robust to noise. Thus, it isexpected that symbolic representations of trajectories will yield better performancefor video retrieval.

In our approach, we design a novel mapping mechanism which maps every unitdistance of movement with different moving directions to specific characters. Thuswe can use fewer distinct characters to represent segments with different movingdirections and distances than the proposal in [5]. More specifically, we build a symboltable only according to the objects’ moving direction. Unlike the symbol table in [5]shown in Fig. 8a, for each segment Ts in direction αs and distance diss, we map eachunit distance Udis of segment by αs to symbols′ according to the symbol table. ThusTs can be mapped to diss

Udisof symbols′ s. Figure 8b shows our symbol table with 8

directions. Comparing to the 64 symbols used in [5], the number of symbols used inour approach is much less, which can be robust for pruning noise. On the other hand,retrieval accuracy will not be sacrificed with fewer symbols because the differentmoving distances of segments can be represented by the different lengths of strings.For example, a trajectory represented with “ZC” in [5] can be represented with“ggggf ” in our approach.

Page 15: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

(a) EDM [5] (b) Our approach

movement distanceratio

movementdirection

Fig. 8 Examples of symbol table

4.3 Trajectory matching for video retrieval

After representing trajectories by symbol strings, we use the EDM approach tomeasure their similarity, since it is reported that EDM [5] yields better performancethan other distance functions, e.g., Edit distance, LCSS and DTW. Comparing toEdit Distance on strings, EDM eliminates the cost when two different symbols areadjacent in the symbol table, while in Edit Distance on strings, these two symbolscause cost even their moving directions are similar but lie on both sides of a boundaryin symbol table. The major difference of our approach from the EDM [5] approachfor trajectory retrieval is: we design a novel mapping mechanism which maps everyunit distance of movement with different moving direction to specific characters.Thus we can use fewer distinct characters to represent segments with differentmoving directions and distances.

The overall processing of video retrieval based on motion trajectory matchingworks as follows. We extract the representative motion flows from videos, and storethem in a database as motion features. To conduct video retrieval, we extract themotion trajectory of query video, and use the trajectory matching method to findsimilar videos in the database. The proposed method can effectively generate therepresentative trajectories as motion features, and retrieve similar videos from thedatabase efficiently.

5 Experimental result

We conduct extensive experiments to evaluate the performance of the proposedapproaches, including motion feature extraction method and trajectory retrievalmethod, by comparing with the some recent and related techniques. All experiments

Page 16: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

are conducted in Matlab 7.1 on a PC with Core 2 CPU (2.2 GHz), and Windows XPOS.

We used two different datasets for evaluation.

– To evaluate the performance of motion flow extraction and video retrieval, weuse 54 hours of video clips downloaded from YouTube [1]. We segment the videoclips into 29604 shots by simply comparing the difference of adjacent frameswith a predefined threshold and extract interesting Motion Flows from the videoshots. This dataset contains video clips from different scenes and the quality ofthese clips varies.

– To evaluate our trajectory representation and retrieval method, we use ASL andNHL data set [5]. The ASL data consists of samples of Australian Sign Languagesigns, and contains 699 trajectories; and the NHL set consists of 1000 NationalHockey League players’ trajectories.

In Section 5.1, We use some examples to intuitively illustrate the effectiveness ofour approach in extracting motion features, and evaluate the usefulness of extractedmotion features in video retrieval in Section 5.2.2.

5.1 Motion feature extraction

In our motion feature extraction approach, several parameters need to be consideredin the implementation. We empirically set the window size to be 30 frames, the sizeof each block in video to be 16 × 16 and for each window we only reserve the top5 clusters with the highest cluster quality from 20 clusters after K-means clustering(defined in Section 3.2.2). We set the threshold in Eq. 5 to be 10, which is used tomerge similar motion flows as the representative motion flows.

5.1.1 Local motion extraction

We first use an example to illustrate the effectiveness of the proposed approachin local motion vectors extracting. To determine the camera motion, we use 100annotated frames with or without camera motion as the training set to build a SVMclassifier.

Figure 9a shows the original motion vector of a frame when a basketball playerjumps and the camera tracking the player also moves up. Figure 9b shows the localmotion vectors extracted by our method and Fig. 9c shows the motion vectors cal-culated with plain 4-parameter-regression model [21]. Figure 9a shows that although

(a) Origin motion vectors (b) Motion vectors of our method (c) Plain 4-Parameter model

Fig. 9 Motion Vector generated by different approaches

Page 17: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

there is a camera motion, nearly half of the background blocks have a null motionvector. This is because blocks of background like sky and playground have nearlyplain textures and thus it is hard to generate motion vectors for them. Our approachis to estimate camera motion from blocks that contains LIPs, e.g., have distinc-tive texture features, while the plain 4-parameter-regression model uses all blocksincluding blocks with null motion vector to estimate camera motion. ComparingFig. 9b with c, we can see that our approach exactly compensate the camera motionand extract the local motion precisely; the plain 4-parameter-regression modelgenerates noise as shown in the blocks of sky and playground in Fig. 9c and cannotextract the player’s local motion vectors very well since the camera motions have notbeen eliminated sufficiently.

5.1.2 Representative motion f low extraction

We use a video of traffic flow to compare our motion feature extraction method to[23]. The video shows the vehicles on the road some of which are changing the lanes.

Figure 10a shows the result of our method using sliding window and cluster basedmotion feature extraction. The left of sub-figure (a) shows all the interesting motionflows after filtering noisy motion flows in the video shot. Because of the cluster

(a) Our approach

(b) Motion flow [23]

Fig. 10 Motion flows in video shot and representative flows

Page 18: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

selection in each window, we only reserve and match the important motion flows anddrop the noisy motion flows, thus the extracted motion flows from our approach aremore representative. The right sub-figure in Fig. 10a shows the representative flowsof the video that are computed by merging the similar motion flows. From Fig. 10b,we can see that the motion flow extraction algorithm [23] cannot eliminate the noiseseffectively because of the effect of camera motion and diversity of object movementsin the video shot. The motion flows after merging in the right sub-figure show thatmotion flow extraction using all the motion vectors incurs many errors.

Figure 11 shows three more examples of motion feature extraction. In each row,the left 3 images are the representative frames for each video shot, and the twotrajectories shown in the right are generated using our method and the method [23]respectively. Note that we only show one representative trajectory for the clarityof presentation. Our method can extract motion flows with a higher accuracy thanthe approach [23]. This is because we only utilize the meaningful motion vectors invideo flames and sliding window based clustering mechanism can capture the mostrepresentative motion flows. The first example shows the horse is walking in thecourt although the position in the screen does not change much; and our approachon the second video shot precisely models the motion of basketball player, i.e., run,jump and drop.

5.2 Trajectory and video retrieval

In this set of experiments, we first evaluate the performance of our approach fortrajectory retrieval, and then apply our technique to video retrieval. For all theapproaches in our experiments, the parameter of Errorwhole in Eq. 6 is set to be 0.1 in

Fig. 11 More examples of motion feature extraction

Page 19: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

ASL dataset, 0.02 in NHL dataset and our 54-hour video dataset. The unit distanceUdis in our symbolic method is set as the minimum length of segments in the dataset.

5.2.1 Trajectory retrieval

We compare our method with the coarse-to-fine method (represented by MotionFlow) proposed in [23] as well as two relevant approaches, the EDM [5] and thePCA String [3]. The method in [23] segments the trajectories and transforms theminto a 6-d time series. The EDM [5] adopts symbolic representation for trajectorymatch; The PCA String [3] uses principle components to represent sub-trajectories,and maps them into symbols where edit distance is used to compute the similarity.The retrieval performance is measured by precision and recall, which are widely usedin evaluating the performance of video retrieval. Precision measures how precisethe search results are (the number of correct results divided by the number of allreturned results), and Recall measures the completeness of available relevant results(the number of correct results divided by the number of results that should have beenreturned).

Figure 12 shows the performance of different approaches on ASL and NHLtrajectory datasets. On ASL data, the PCA String algorithm [3] performs betterthan the EDM [5] and the results are consistent with what has been reported in[3]. On NHL data, the PCA String algorithm performs worse than the EDM andthe reason is that the patterns of NHL trajectories are simple and long. On bothdatasets, our approach yields the best performance in most cases. In contrast withthe trajectory matching algorithm in [23], we symbolize the motion direction to asmall set of symbols and quantify the distance as the length of symbols. By doingso, we measure only the moving direction as the moving patterns introduced in [5],where such moving patterns can be approximated when their symbols are adjacent.Additionally, we do not treat moving distances as moving patterns but quantify themby a sequence of symbols. Thus, the edit distance between two segments can be largeif their distances and directions differ a lot; the distance can only be 0 or 1 in [5]. Thismakes our approach more effective and robust for different kinds of datasets.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Motion Flow[23]PCA String[3]EDM[5]Our Approach

(a) On ASL data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cis

ion

Motion Flow[23]PCA String[3]EDM[5]Our Approach

(b) On NHL data

Fig. 12 Performance on trajectory retrieval

Page 20: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

5.2.2 Video retrieval

The experiment of video retrieval is conducted on the 54 hours of YouTube videodataset. We choose 10 different video shots, which contain different type of motions,i.e, fast or slow motion with or without camera motion, as the set of queries andfor each query the returned videos in top N with the highest similarity, e.g., 3, 5,7 and 10 will be presented to three evaluators who determine whether the top-Nreturned images are relevant to the query video. P@N, one of the popular metricsused in information retrieval, is used to measure the fraction of the top N videosretrieved that are relevant to the user’s interest. In this section, we first select onerepresentative trajectory to represent motion feature as in [14, 23], and then usemultiple representative motion flows to evaluate the video retrieval performance.

We first evaluate the effect of window size in motion flow extraction on retrievalperformance by varying the window size from 15 to 60 frames, and the results areshown in Fig 13. When the window size is too small, the extracted flows in such ashort period may not exhibit significant move patterns; while if the window size istoo large, the extracted motion flows are too complicated and introduce more noisesto interesting flow clusters. In our experiments, we observe 30 is the best granularityfor sliding window based flow extraction.

We next proceed to compare our method with the method [23]. To evaluate theeffectiveness of the extracted motion flows without the effect of trajectory matchingmethod, we also compare with a variant of [23] which uses our trajectory retrievalmethod. Figure 14 shows the results of comparison. We can see that our methodyields higher precision. Overall, the average performance improvement is more than30%. Our approach has two advantages over the algorithm in [23]. First, we canextract more representative trajectories from the video data. Our sliding windowbased clustering and filtering method keeps meaningful motion flows and pruningnoise. Second, the proposed trajectory matching method is more effective and robustfor trajectory comparison.

We only select the most representative motion flow for video retrieval in theexperiment shown in Fig. 14. Both our approach and [23] can generate multiple

Fig. 13 Effect of window size

P@3 P@5 P@7 P@100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

W = 15W = 30W = 60

Page 21: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Fig. 14 Performance of videoretrieval

p@3 p@5 p@7 p@100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Our ApproachMotion Flow[23] withour matching methodMotion Flow[23]

representative motion flows. The number of generated trajectories fo different videocan vary from 1 to 10 under the parameter settings in Section 5.1.

In this experiment, we evaluate the retrieval performance of using multiplerepresentative motion flows, and the results are presented in Fig. 15. We conduct allthe possible matches between multiple trajectories of videos, calculate the trajectorysimilarity in each matching result and find the maximum similarity as the similaritybetween two shots. In our approach, the multiple representative flows perform betterthan single flow, i.e., about 5%; while performance improvement of [23] is lesssignificant. Multiple representative flows can capture more motion features, but alsointroduce more noises. Our approach can extract more meaningful and accuratemotion flows, hence yields better performance than [23]. Note that, though theeffectiveness of multiple trajectory matching is higher than that of single trajectory,the time cost is about 5 times higher, as the average number of representativetrajectories extracted is around 3 in our video dataset.

Fig. 15 Performance onmultiple representative flows

P@3 P@5 P@7 P@100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Our Approach Multiple FlowsOur Approach Single FlowMotion Flow[23] Multiple FlowsMotion Flow[23] Single Flow

Page 22: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

6 Conclusion

In this paper, we have proposed a novel motion feature extraction strategy andused this feature for content based video retrieval. We used local interesting pointsdetected by Harris detector to compensate the effect of camera motion and gen-erated motion flows. We extracted the representative motion flows by clustering themotion flows with sliding windows. We also proposed a segmentation based matchingalgorithm to compute the similarity between trajectories and retrieve videos bytrajectory matching. Experimental results show that our approach performs betterthan the state-of-the-art techniques.

Acknowledgements This research was supported by the National Natural Science foundation ofChina under Grant No.60933004, 60811120098 and 61073019, and Grant SKLSDE-2010KF-03.

References

1. Youtube. http://www.youtube.com/2. Bashir FI, Khokhar AA, Schonfeld D (2007) Object trajectory-based activity classification and

recognition using hidden markov models. IEEE Trans Image Process 16(7):1912–19193. Bashir FI, Khokhar AA, Schonfeld D (2007) Real-time motion trajectory-based indexing and

retrieval of video sequences. IEEE Trans Multimedia 9(1):58–654. Chang SF, Chen W, Meng HJ, Sundaram H, Zhong D (1998) A fully automated content-based

video search engine supporting spatiotemporal queries. IEEE Trans Circuits Syst Video Technol8(3):602–615

5. Chen L, Özsu MT, Oria V (2004) Symbolic representation and retrieval of moving objecttrajectories. In: the 6th ACM multimedia workshop on MIR, pp 227–234

6. Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories.In: Proc. of ACM SIGMOD conference, pp 491–502

7. Cucchiara R, Grana C, Piccardi M, Prati A (2003) Detecting moving objects, ghosts, and shadowsin video streams. IEEE Trans Pattern Anal Mach Intell 25(10):1337–1342

8. Dagtas S, Al-Khatib W, Ghafoor A, Kashyap RL (2000) Models for motion-based video indexingand retrieval. IEEE Trans Image Process 9(1):88–101

9. Deng Y, Manjunath BS (1998) Netra-V: toward an object-based video representation. IEEETrans Circuits Syst Video Technol 8(5):616–627

10. Fablet R, Bouthemy P, Perez P (2002) Nonparametric motion characterization using causalprobabilistic models for video indexing and retrieval. IEEE Trans Image Process 11(4):393–407

11. Flickner M, Niblack HW, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, PetkovicD, Steele D, Yanker P (1995) Query by image and video content: the QBIC system. Comput28(9):23–32

12. Hamrapur A, Gupta A, Horowitz B, Shu CF, Fuller C, Bach J, Gorkani M, Jain R (1997) Viragevideo engine. In: SPIE proc. storage and retrieval for image and video databases V, pp 188–197

13. Harris CG, Stephens MJ (1988) A combined corner and edge detector. In: Proc. of 4th Alveyvision conference, pp 147–151

14. Hsieh J-W, Yu S-L, Chen Y-S (2006) Motion-based video retrieval by trajectory matching. IEEETrans Circuits Syst Video Technol 16:396–409

15. Keogh E, Chu S, Hart D, Pazzani M (2004) Segmenting time series: a survey and novel approach.In: Data mining in time series databases. World Scientific

16. Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In:Proc. of ICDM conference, pp 289–296

17. Le T-L, Boucher A, Thonnat M (2007) Subtrajectory-based video indexing and retrieval. In:Proc. of MMM conference, pp 418–427

18. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis60(2):91–110

19. Ma Y-F, Zhang H-J (2002) Motion texture: a new motion based video representation. In: Proc.of ICPR conference, pp 548–551

Page 23: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

20. Manjunath BS, Salembier P, Sikora T (2002) Introduction to mpeg-7: multimedia content de-scription interface. In: Proc. of ICPR conference, pp 548–551

21. Rath GB, Makur A (1999) Iterative least squares and compression based estimations for a four-parameter linear global motion model and global motion compensation. IEEE Trans CircuitsSyst Video Technol 9:1075–1099

22. Sivic J, Schaffalitzky F, Zisserman A (2004) Object level grouping for video shots. In: Proc. ofECCV conference, pp 85–98

23. Su C-W, Liao H-Y, Tyan H-R, Lin C-W, Chen D-Y, Fan K-C (2007) Motion flow-based videoretrieval. IEEE Trans Multimedia 9(6):1193–1201

24. Tomasi C, Kanade T (1991) Detection and tracking of point features. Carnegie Mellon UniversityTechnical Report, pp 864–975

25. Tsaig Y, Averbuch A (2002) Automatic segmentation of moving objects in video sequences: aregion labeling approach. IEEE Trans Circuits Syst Video Technol 12(7):597–612

26. Wang F, Jiang Y-G, Ngo C-W (2008) Video event detection using motion relativity and visualrelatedness. In: Proc. of ACM MM conference, pp 239–248

27. Wu X, Takimoto M, Satoh S, Adachi J (2008) Scene duplicate detection based on the pattern ofdiscontinuities in feature point trajectories. In: Proc. of ACM MM conference, pp 51–60

28. Yilmaz A, Javed O, Shah M (2006) Object tracking: a survey. ACM Comput Surv 38(4):1329. Zhu G, Liang D, Liu Y, Huang Q, Gao W (2005) Improving particle filter with support vector

regression for efficient visual tracking. In: Proc. of ICIP conference, pp 422–42530. Avrithis YS, Doulamis AD, Doulamis ND, Kollias SD (1999) A stochastic framework for optimal

key frame extraction from mpeg video databases. Comput Vis Image Underst 75:3–2431. Lertrusdachakul T, Aoki T, Yasuda H (2005) Camera motion characterization through image

feature analysis. In: Proc. of ICCIMA conference32. Yeo B-L, Liu B (1995) Rapid scene analysis on compressed video. IEEE Trans Circuits Syst

Video Technol 5:533–544

Zhe Zhao received the BS degree from Department of Computer Science, Peking University, in2008. He is currently a Master student in Database Lab, Department of Computer Science, PekingUniversity, China. His research interests include time series database, multimedia database and webmining. He is a student member of the ACM.

Page 24: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Bin Cui received the PhD degree in computer science from the National University of Singapore in2004. Currently, he is a Professor in Department of Computer Science, Peking University. His majorresearch interests include database performance issues, query and index techniques, multimediadatabases and Web data management. He has published over 60 research papers on internationaljournals and conferences including TKDE, TOIS, SIGMOD, SIGIR, ACM MM, etc. He has servedin the Technical Program Committee of various international conferences including SIGMOD,VLDB and ICDE, and is currently an associate editor of IEEE Transactions on Knowledge andData Engineering. Dr. Cui is a member of ACM, and a senior member of the IEEE.

Gao Cong is currently an Assistant Professor in the School of Computer Engineering, NanyangTechnological University (NTU). Before joining NTU, he was an Assistant professor in AalborgUniversity, Denmark (2008–2009). Before that, he worked as a researcher at the Microsoft ResearchAsia, China. From 2004 to 2006, He worked as a postdoc research fellow in the database group withinthe University of Edinburgh. He earned his Ph.D. in Computer Science from National Universityof Singapore in 2004. His current research interests include database, data mining, text mining,and information retrieval. His work was published in premier database, data mining, informationretrieval and natural language processing conferences, such as ACM SIGMOD, VLDB, ICDE, ACMKDD, etc. He also served as a PC member for the aforementioned conferences.

Page 25: Extracting representative motion flows for effective video ...net.pku.edu.cn/~cuibin/Papers/2012MTAP.pdf · Z. Huang ·H. T. Shen The University of Queensland, Queensland, Australia

Multimed Tools Appl

Zi Huang is an Australian Postdoctoral Fellow in School of ITEE, The University of Queensland.She received her BSc degree from Department of Computer Science, Tsinghua University, China,and her PhD in Computer Science from School of ITEE, The University of Queensland. Dr. Huang’sresearch interests include multimedia search, knowledge discovery, and bioinformatics.

Heng Tao Shen is an Associate Professor (Reader) in School of Information Technology &Electrical Engineering, The University of Queensland. He obtained his BSc (with 1st class Honours)and PhD from Department of Computer Science, National University of Singapore in 2000 and2004 respectively, then joined The University of Queensland as a Lecturer in June 2004 andSenior Lecturer in March 2007. His research interests include Multimedia/Mobile/Web Search,Database Management, P2P/Cloud Computing, etc. Heng Tao has published and served on programcommittees in most prestigious international publication venues of interests. He is also the winner ofChris Wallace Award for outstanding Research Contribution in 2010 from CORE Australasia.