IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR …dpfan.net/wp-content/...videos-using...propagation.pdflearning method, sparse representation method, and the fusion ... distribution

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 12, DECEMBER 2017 2527

Saliency Detection for Unconstrained VideosUsing Superpixel-Level Graph and

Spatiotemporal PropagationZhi Liu, Senior Member, IEEE, Junhao Li, Linwei Ye, Guangling Sun, and Liquan Shen

Abstract— This paper proposes an effective spatiotemporalsaliency model for unconstrained videos with complicated motionand complex scenes. First, superpixel-level motion and colorhistograms as well as global motion histogram are extracted asthe features for saliency measurement. Then a superpixel-levelgraph with the addition of a virtual background node represent-ing the global motion is constructed, and an iterative motionsaliency (MS) measurement method that utilizes the shortestpath algorithm on the graph is exploited to reasonably generateMS maps. Temporal propagation of saliency in both forward andbackward directions is performed using efficient operations oninter-frame similarity matrices to obtain the integrated temporalsaliency maps with the better coherence. Finally, spatial propaga-tion of saliency both locally and globally is performed via the useof intra-frame similarity matrices to obtain the spatiotemporalsaliency maps with the even better quality. The experimentalresults on two video data sets with various unconstrained videosdemonstrate that the proposed model consistently outperformsthe state-of-the-art spatiotemporal saliency models on saliencydetection performance.

Index Terms— Motion saliency (MS), saliency model, spatialpropagation, spatiotemporal saliency detection, superpixel-levelgraph, temporal propagation, unconstrained video.

I. INTRODUCTION

V ISUAL attention mechanism enables human observers tocapture the visually salient objects in a complex scene

effortlessly. The research on computational models for saliencydetection was motivated by simulating the visual attentionmechanism based on the biologically plausible architecture [1]and the feature integration theory [2], and was originallyused for human fixation prediction [3]. In the past decade,saliency detection has been a booming research topic, andhas benefited a broad range of applications such as salientobject detection [4]–[6], salient object segmentation [7]–[11],content-aware image/video retargeting [12]–[14], content-

Manuscript received January 27, 2016; revised May 22, 2016; acceptedJuly 15, 2016. Date of publication July 27, 2016; date of current versionDecember 13, 2017. This work was supported in part by the NationalNatural Science Foundation of China under Grant 61471230, Grant 61171144,and Grant 61422111 and in part by the Program for Professor of SpecialAppointment (Eastern Scholar) within the Shanghai Institutions of HigherLearning. This paper was recommended by Associate Editor S. Erturk.(Corresponding author: Zhi Liu.)

The authors are with the School of Communication and InformationEngineering, Shanghai University, Shanghai 200444, China (e-mail:[email protected]; [email protected]; [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2016.2595324

based image/video compression [15]–[17], image/video qual-ity assessment [18], [19], and visual scanpath prediction [20].

Recently, there have been many research efforts on saliencymodels for images, and some recent benchmarks have beenreported in [21] and [22]. In contrast, the research on spa-tiotemporal saliency models for videos is not as abundantas that for images, but it has received increasing attentionfrom the research community. Saliency detection in videosis different from static images due to the introduction oftemporal dimension, and the motion information in videois definitely important for measuring saliency. This paperfocuses on saliency detection in videos, and the following willmainly review the related spatiotemporal saliency models forvideos.

Due to the clear interpretation of the human visual attentionmechanism and the concise computation form, the center–surround scheme has been widely implemented withdifferent features and formulations in a number of spa-tiotemporal saliency models, which measure the saliency ofa pixel/patch/volume as the feature difference with its sur-roundings. For example, the surprise model [23] exploits thedifference on a set of features including luminance, color,orientation, flicker, and motion energy to generate the spa-tiotemporal saliency map. The feature difference betweeneach patch/volume and its spatiotemporal surroundings canbe measured using the local regression kernel based self-resemblance (SR) [24] and the earth mover’s distance betweenthe weighted histograms [25], and then is assigned as the spa-tiotemporal saliency of the center pixel in the patch/volume.Based on the discriminant center–surround hypothesis [26],the Kullback–Leibler divergence on dynamic texture featureis used to measure spatiotemporal saliency [27]. In [28],the contrast of directional coherence, which is calculatedon the distribution of spatiotemporal gradients, is evaluatedunder the multiscale framework to measure the spatiotemporalsaliency. From the view of feature change in the temporaldirection, salient object pixels significantly differ from mostof pixels in the video frames, and thus a multiscale background(MB) model [19] represented using Gaussian pyramid isexploited for saliency measurement.

Besides the center–surround scheme, recent spatiotemporalsaliency models are also formulated based on a number oftheories, methods, and schemes including information theory,control theory, frequency-domain analysis method, machinelearning method, sparse representation method, and the fusionscheme on spatial saliency and temporal saliency.

1051-8215 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2528 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 12, DECEMBER 2017

Some concepts in the information theory can be naturallyadapted to measure spatiotemporal saliency in video.For example, self-information [29], minimum conditionalentropy (CE) [30], and incremental coding length [31], [32]are calculated on the basis of patch/volume and used asspatiotemporal saliency measures. From the view of moderncontrol theory, the video sequence can be represented usingthe state-space model of linear system, and then either theobservability of output [33] or the controllability of states [34]is exploited to measure spatiotemporal saliency in order todiscriminate salient object motion from background motion.

The frequency-domain analysis method is first introducedinto image saliency detection [35], in which the spectralresidual of the amplitude spectrum of Fourier transformis used as saliency measures. Then as an extension, thespectral residual along the temporal slices at the horizon-tal/vertical direction is exploited to generate spatiotemporalsaliency maps for video [36]. Besides, the phase spectrumof quaternion Fourier transform (QFT), which incorporatescolor, orientation, and intensity in each frame and differ-ence between frames in parallel, is exploited to efficientlyperform multiresolution spatiotemporal saliency detectionin [15].

Machine learning methods such as probabilistic multitasklearning [37], support vector machine [38], and support vectorregression [39] have been exploited to build spatiotemporalsaliency models using eye tracking data as the training set. In[4], the conditional random field learning is used to integratemultiscale contrast, center–surround histogram, and spatialdistribution of color and motion vector field for saliency mea-surement. In [40], with the aim of dominant camera motionremoval (DCMR), the one-class support vector machine isused to classify the trajectories, and the saliency of eachtrajectory is diffused to its surrounding regions for generatingthe spatiotemporal saliency map.

Sparse representation method, which has been effectivelyutilized for saliency detection in images [41], is introduced intospatiotemporal saliency models. Sparse feature selection [42]and patch-level reconstruction error with sparse regularizationterm [43] are utilized to measure spatiotemporal saliency.The matrix that is composed by temporally aligned videoframes [44] or temporal slices [45] can be decomposed into alow-rank matrix for background and a sparse matrix for salientobjects to measure spatiotemporal saliency.

The fusion scheme has become a commonly used paradigmin a number of spatiotemporal saliency models. Consideringthe difference of nature between temporal domain and spatialdomain, temporal saliency and spatial saliency are measured,respectively, and then combined using different fusion schemesto generate spatiotemporal saliency map. Since motion inmost videos plays an important role on drawing humanvisual attention, motion vector field is generally used as themost important cue for temporal saliency measurement. Thedirectional consistency of motion [46], temporal coherence ofmotion amplitude and orientation [47], distribution of motionamplitude and orientation [48], amplitudes of residual motionvectors after global motion compensation [49], [50], globalmotion parameters [51], and temporal gradient difference [52]

are exploited to measure temporal saliency. In general, spatialsaliency measurement is performed on the basis of each videoframe. For example, the center–surround difference on spatialfeatures [48], [51], [52], contrast sensitivity function [49],interactions between cortical-like filters [50], and entropy ofhistogram of gradients features [47] are utilized to measurespatial saliency. Besides, some models [53], [54] directly useblock-level motion vectors and discrete cosine transform coef-ficients decompressed from the video bitstream to calculatespatial saliency map and temporal saliency map, respectively.To integrate temporal saliency map with spatial saliency map,fusion schemes such as normalization, linear addition, max,multiplication, and their combination are widely adopted in theabove-mentioned models to generate spatiotemporal saliencymap.

Recently, saliency models that are built on the basis ofsegmented regions/superpixels such as [7]–[9] have signifi-cantly elevated the saliency detection performance on images.Inspired by the improvement on image saliency models,a superpixel (SP) based spatiotemporal saliency model is pro-posed in [55], which exploits superpixel-level motion distinc-tiveness, global contrast, and spatial sparsity to measure tem-poral saliency and spatial saliency, respectively, and adaptivelyfuses them based on their interaction and selection to gener-ate spatiotemporal saliency map. Furthermore, a superpixel-level trajectory (SLT) based spatiotemporal saliency model isproposed in [56], which utilizes the long-term motion cuesinduced by superpixel-level trajectories to effectively enhancethe coherence of temporal saliency through the video, andexploits a quality-guided fusion scheme to integrate withspatial saliency for spatiotemporal saliency generation. In [11],the geodesic distance (GD) from each superpixel to the frameboundaries is adopted in an intra-frame graph for computingthe object probability of each superpixel, and an inter-framegraph is constructed to calculate the GD from each superpixelto the estimated background regions, which are obtainedby thresholding the object probabilities of superpixels, forgenerating superpixel-level spatiotemporal saliency maps. Thejoint utilization of intra-frame graph and inter-frame graphenables [11] to be an effective spatiotemporal saliency model,which is distinctive from previous models. Compared withother models, the above three models [11], [55], [56] canbetter highlight salient objects with well-defined boundariesand suppress background regions.

However, the existing spatiotemporal saliency models arestill insufficient to effectively handle a variety of unconstrainedvideos, which cause the difficulty to highlight salient objectsand suppress background due to the unconstrained conditionof capturing videos. There are no constraints on motionpatterns of salient objects and background regions inunconstrained videos. Specifically, salient objects may haverigid motion or deformation, consistent motion or intermittentmotion, and background regions may exhibit the complicatedmotion due to the mixed camera movements including pan,tilt, zoom, and camera jitter. Besides, the cluttered backgroundand low contrast between salient objects and background insome unconstrained videos even increase the difficulty ofsaliency detection.

LIU et al.: SALIENCY DETECTION FOR UNCONSTRAINED VIDEOS USING SUPERPIXEL-LEVEL GRAPH AND SPATIOTEMPORAL PROPAGATION 2529

Fig. 1. Illustration of the proposed SGSP model. (a) Original video frames. (b) Motion vector fields. (c) Superpixel segmentation results. (d) Superpixel-levelgraph. (e) MS map. (f) FTS map. (g) BTS map. (h) ITS map. (i) IStS map. (j) FStS map.

Therefore, in order to effectively improve the saliencydetection performance on unconstrained videos withcomplicated motion and complex scenes, this paper proposesa superpixel-level graph and spatiotemporal propaga-tion (SGSP)-based saliency model for unconstrained videos.Specifically, our main contributions are fourfold. First,inspired by [11], [55], and [56] as well as those graph-basedimage saliency models [57]–[59], we propose a superpixel-level graph-based motion saliency (MS) measurement method,which can effectively estimate and refine the MS of eachsuperpixel in an iterative manner. Second, we propose aneffective method for temporal propagation of saliency in bothforward and backward directions using efficient operations oninter-frame similarity matrices. Third, we propose a two-phasespatial propagation method to locally and globally propagatesaliency via the use of intra-frame similarity matrices.With the introduction of bidirectional temporal propagation,two-phase spatial propagation, and the unified operations oninter-frame/intra-frame similarity matrices, the above twocontributions enable the spatiotemporal saliency propagationin video frames to be more systematic and more effective forthe better saliency detection performance. Finally, we createda data set containing 18 unconstrained videos for saliencydetection (UVSD), to introduce more challenging videos withcomplicated motion and complex scenes for comprehensiveevaluation of spatiotemporal saliency detection performance.

Note that the recent spatiotemporal saliency models usesuperpixel-level graph in [11] and temporal propagation ofsaliency in [55] and [56]; the related components in theproposed SGSP model are considerably different from those inthe above three models. Specifically, the proposed MS mea-surement method realizes an iteratively updated superpixel-level graph with the introduction of a virtual background node,which is associated with the iteratively refined global motionhistogram for each frame. Both construction and update natureof the superpixel-level graph in our method are considerablydifferent from the superpixel-level intra-frame graph and inter-frame graph used in [11]. Our model enables an effectivebidirectional temporal propagation of saliency, while in [55],saliency is propagated only from the previous frame to the

current frame, and in [56], saliency propagation is performedonly for each SLT in the temporal direction.

Thanks to the above-mentioned main contributions, theproposed SGSP model shows new prospects of spatiotemporalsaliency modeling, and the extensive experimental resultson UVSD and another video data set SegTrackV2 [60]demonstrate that the proposed SGSP model consistently out-performs the state-of-the-art spatiotemporal saliency modelsincluding [11], [55], and [56]. The UVSD data set and thecode of the proposed SGSP model are publicly available athttp://www.ivp.shu.edu.cn/.

The rest of this paper is organized as follows. The proposedSGSP model is described in Section II, the experimental resultsand analysis are presented in Section III, and the conclusionis given in Section IV.

II. PROPOSED SPATIOTEMPORAL SALIENCY MODEL

The proposed SGSP model is illustrated in Fig. 1, whichconsists of four components marked using the dashed-dottedlines, and will be detailed from Sections II-A to II-D in turn.Feature extraction is first performed on the superpixel segmen-tation results of video frames to obtain motion histograms andcolor histograms in Section II-A, and then a superpixel-levelgraph is built for measuring MS of superpixels in Section II-B.The spatiotemporal propagation of saliency is actually per-formed in two constitutive steps, i.e., temporal propagation inSection II-C is exploited to generate temporal saliency maps,and then spatial propagation in Section II-D is exploited togenerate spatiotemporal saliency maps. It should be notedthat only motion features are utilized in MS measurement(Section II-B), while motion features and color features arejointly utilized in temporal propagation (Section II-C) andspatial propagation (Section II-D).

A. Feature Extraction

For each input video frame Ft , its pixel-level motion vectorfield MVFt,t−1 with respect to the previous frame Ft−1 iscalculated using the large displacement optical flow (LDOF)method in [61], which allows capturing large displacements


between frames. Each frame Ft is transformed into the Labcolor space, in which luminance channel and two chrominancechannels are separated, and then the simple linear iterativeclustering algorithm [62] is used to segment Ft into a set ofperceptually homogenous superpixels, spt,i (i = 1, . . . , nt ),where nt is the number of the generated superpixels andshould be set to preserve different boundaries between objectsand background in the superpixel segmentation results. Theanalysis of this parameter on saliency detection performanceis presented in Section III-B. Note that the notation spt ,which omits the subscript i representing the spatial index ofsuperpixel, denotes all superpixels in Ft . For the examplevideo in Fig. 1(a), the motion vector fields are visualizedin Fig. 1(b) and the superpixel segmentation results are shownin Fig. 1(c), which delineates the boundaries between adjacentsuperpixels using red lines.

Superpixel is the primitive processing unit in the pro-posed SGSP model. Specifically, superpixel-level motion/colorhistograms are extracted as the features for saliency mea-surement. Superpixel-level graph is built for measuring MSin Section II-B, and temporal propagation in Section II-C andspatial propagation in Section II-D are also performed on thebasis of superpixels.

Based on the superpixel segmentation result of eachframe Ft and the corresponding motion vector field MVFt,t−1,motion histograms are extracted at superpixel level to rep-resent the local motion pattern and also extracted at framelevel to represent the global motion pattern. The orientationspace of motion vectors in the complete range of [−π, π] isuniformly quantized into bM = 8 intervals, each of whichcovers π/4 radians. For each superpixel spt,i (i = 1, . . . , nt ),its superpixel-level motion histogram HM

t,i with bM bins iscalculated using the motion vectors of all pixels in spt,i .For the kth bin of HM

t,i , those motion vectors that fall intothe kth orientation space are accumulated to obtain HM

t,i (k).To represent the local motion pattern, HM

t,i is normalized to

have∑bM

k=1 HMt,i (k) = 1.

Similarly, the global motion histogram HMt,0, which is

exploited to represent the motion pattern of background in Ft ,is initialized using the motion vectors of all pixels in Ft

and normalized to have∑bM

k=1 HMt,0(k) = 1. Please note that

the subscript ‘0’ in HMt,0 is not actually an index. Since the

subscript i in the superpixel-level motion histogram HMt,i rep-

resents the index of superpixel and starts from 1, the subscript‘0’ is used here to denote the global motion histogram as HM

t,0,so as to keep the consistent format of notations for motionhistograms.

Color histogram is also extracted for each superpixel torepresent the color distribution of superpixel. For each frameFt in the Lab color space, each color channel is uniformlyquantized into bC = 16 bins for a moderate color quanti-zation of video frames. For each superpixel spt,i , the colorhistogram at each color channel, i.e., HC,L

t,i , HC,at,i , and HC,b

t,i ,respectively, is calculated with the normalization operationsimilarly as HM

t,i . Then the above three color histograms areconcatenated to form the superpixel-level color histogramHC

t,i = [HC,Lt,i , HC,a

t,i , HC,bt,i ] with 3bC bins for spt,i .

Fig. 2. Schematic example of superpixel-level graph.

B. Superpixel-Level Graph-Based Motion Saliency

Based on the superpixel segmentation result of eachframe Ft , an undirected weighted graph Gt = (Vt , Et ) isbuilt at superpixel level, and a schematic example is shownin Fig. 2. The node set Vt contains regular nodes {vt,i }nt

i=1,which correspond to all superpixels {spt,i }nt

i=1 in Ft , and anadditional virtual background node Bt , which represents themotion of background regions in Ft . As shown in Fig. 2, theedge set Et contains three types of edges: 1) blue edge betweeneach pair of adjacent regular nodes; 2) red edge between anytwo regular nodes, which correspond to two superpixels atthe border of the video frame; and 3) green edge betweenthe virtual background node Bt and each regular node, whichcorresponds to a superpixel at the border of the video frame.

The weight for the blue/red edge, which connects tworegular nodes, vt,i and vt, j , is defined as the motion differencebetween the corresponding superpixels, spt,i and spt, j , withthe following form:

ωa(vt,i , vt, j ) = exp[λ · χ2(HM

t,i , HMt, j

)](1)

where χ2(·) is the chi-squared distance between the twomotion histograms, and the coefficient λ is set to 0.1 for amoderate exponential effect.

The weight for the green edge, which connects the virtualbackground node Bt with a regular node vt,i , is similarlydefined as

ωb(vt,i , Bt ) = exp[λ · χ2(HM

t,i , HMt,0

)](2)

where Bt is associated with the global motion histogram HMt,0.

The MS of each superpixel spt,i can be measured as thedifference between its motion histogram HM

t,i and the globalmotion histogram HM

t,0. However, HMt,0 needs to be refined

by excluding those superpixels possibly belonging to salientobjects for a more accurate estimation. For each node vt,i

on the superpixel-level graph Gt , the sum of edge weightsalong the shortest path from vt,i to Bt can be naturally usedto measure the difference between HM

t,i and HMt,0. Therefore,

based on the above considerations, an iterative MS measure-ment method that utilizes the shortest path algorithm on thegraph is proposed as follows.

1) Calculate superpixel-level and global motion histograms{HM

t,i}nti=0 and initialize Gt using (1) and (2).

2) Find the shortest path from each regular node vt,i to thevirtual background node Bt using Dijkstra’s algorithm,and the summation of edge weights along the shortest


Fig. 3. Examples of iterative MS measurement process. From left to right:original video frames, MS maps after the first, second, third, and fourthiterations, respectively.

path is defined as the MS of spt,i with the followingform:

Mt (i) = minu1=vt,i ,u2,...,um−1,um=Bt[

m−2∑

k=1

ωa(uk, uk+1) + ωb(um−1, um)

]

(3)

where Mt (i) denotes the MS of spt,i compared withthe global motion associated with Bt . A higher valueof Mt (i) indicates that the motion of spt,i is moredistinctive from the global motion.

3) In order to more accurately estimate the global motionhistogram HM

t,0, the superpixel-level MS map Mt isthresholded using the Otsu’s method [63] to estimatebackground regions, which have the relatively lower MS.Then HM

t,0 is updated using the motion vectors in theestimated background regions.

4) Calculate the L1-norm distance between the two ver-sions of HM

t,0 (after update and before update). If thisdistance is smaller than a predetermined value, which isset to a rather smaller value, 0.01, for a stable estimationof HM

t,0, the whole iteration process is terminated. Other-wise, based on the current HM

t,0, the graph Gt is updatedby recalculating the weights of green edges using (2),and then go to step 2) for the next round of iteration.

Some examples of the iterative superpixel-level MS mea-surement process are shown in Fig. 3, which contains videoframes with various motion activities. It can be seen fromFig. 3 that the visual quality of MS maps improves with theiteration process, i.e., background regions are more effectivelysuppressed and more accurate object regions are highlightedwith the increase in the iteration times. Besides, Fig. 3 alsoshows that the whole iteration process for each video frameadaptively terminates as the MS map becomes visually stable.

C. Temporal Propagation

It should be noted that the MS maps utilize only the motioninformation between adjacent frames. In order to enhance thetemporal coherence of saliency measurement through con-secutive frames, MS measures of superpixels are temporally

Fig. 4. Pictorial illustration of projection operation on superpixel.

propagated over video frames to obtain for superpixels thetemporal saliency measures, which are more temporally coher-ent and more reliable for saliency measurement on video. Thefollowing will describe the temporal propagation of saliencyin detail.

First, the similarity between any pair of superpixels inthe adjacent frames, Ft and Ft+1, is evaluated based ontheir appearance similarity and overlap degree. All similaritymeasures between each superpixel in Ft and each superpixelin Ft+1 form an inter-frame similarity matrix At,t+1 witha dimension of nt × nt+1. Each element At,t+1(i, j) thatrepresents the similarity between spt,i and spt+1, j is definedas

At,t+1(i, j) = exp

[

− 1

γ· χ2(HC

t,i , HCt+1, j

)]

· | Prt+1(spt,i ) ∩ spt+1, j ||spt+1, j |

. (4)

The former term in (4) measures the appearance similar-ity between spt+1, j and spt,i using the chi-squared distancebetween their color histograms. The coefficient γ is set to themean of all chi-squared distances calculated for superpixelpairs, to serve as the normalization factor for calculating theappearance similarity. The overlap degree between spt+1, j andthe projected region of spt,i in the frame Ft+1 is measuredby the latter term in (4), where | · | denotes the number ofpixels in the superpixel or the overlapped region. As shownin Fig. 4, the projection operation on the superpixel spt,iprojects each pixel in spt,i with a displacement of the pixel’smotion vector in MVFt,t+1 into the adjacent frame Ft+1 toobtain the projected region Prt+1(spt,i ).

Using the similarity matrices between adjacent framessuch as At,t+1 or At,t−1, the similarity matrix between anytwo frames, Ft and Fr , can be calculated based on matrixmultiplication operations in the forward or backward direction


Fig. 5. Illustration of temporal propagation in both forward and backwarddirections.

as

At,r ={

At,t+1 · At+1,t+2 . . . . . . Ar−1,r , r > t

At,t−1 · At−1,t−2 . . . . . . Ar+1,r , r < t .(5)

With the similarity matrix At,r for any pair of frames, thetemporal saliency of each superpixel spt,i is measured throughthe temporal propagation of MS measures in its precedingframes as well as subsequent frames. Specifically, the temporalpropagation in the forward direction is performed to obtain theforward temporal saliency (FTS) measure of spt,i as

TFt (i) = Mt (i) +

∑t+τr=t+1

∑nrj=1 At,r (i, j) · Mr ( j)

∑t+τr=t+1

∑nrj=1 At,r (i, j)

(6)

where the number of frames involved in the propagation, τ , isset to five, a moderate value for a coherent saliency estimation.Similarly, the temporal propagation in the backward directionis performed to obtain the backward temporal saliency (BTS)measure of spt,i as

TBt (i) = Mt (i) +

∑t−1r=t−τ

∑nrj=1 At,r (i, j) · Mr ( j)

∑t−1r=t−τ

∑nrj=1 At,r (i, j)

. (7)

The temporal propagation in the forward and backwarddirections is illustrated in Fig. 5. The temporal propagationin both directions actually exploits the MS measures of super-pixels, which have a higher similarity with the current super-pixel spt,i , in the preceding frames and subsequent frames,to obtain a more reliable and temporally coherent estimationof saliency for spt,i . Using a combination of forward andbackward propagations, the integrated temporal saliency (ITS)measure of spt,i is defined as

Tt (i) = TFt (i) + TB

t (i). (8)

As shown in Fig. 1, both FTS and BTS maps inFig. 1(f) and (g) can highlight some salient object regionsmore effectively than the MS map in Fig. 1(e), and theircombination, i.e., the ITS map in Fig. 1(h), obviously high-lights the salient object more completely than the previousthree saliency maps in Fig. 1(e)–(g). Besides, the performanceanalysis of different saliency maps generated using the pro-posed model will be presented in Section III-B with Fig. 7,which demonstrates that the saliency detection performance isprogressively improved from MS maps to FTS/BTS maps andto ITS maps. Therefore, the proposed temporal propagationof saliency in this section effectively enhances the reliabilityand coherence of saliency measurement, and substantiallyimproves the saliency detection performance.

D. Spatial Propagation

Based on the ITS map, which has been enhanced on thetemporal coherence, we also need to enhance the spatialcoherence of saliency estimation in each frame for the evenbetter saliency detection performance. Therefore, the spatialpropagation of saliency, which is performed on the basis ofeach frame, is proposed to for this purpose.

For any pair of superpixels in each frame Ft , their similarityis evaluated based on the difference between their colorhistograms. All similarity measures constitute an intra-framesimilarity matrix At,t with a dimension of nt × nt . Follow-ing (4), each element At,t (i, j) that represents the similaritybetween spt,i and spt, j is defined as

At,t (i, j) = exp

[

− 1

γ· χ2(HC

t,i , HCt, j

)]

. (9)

Based on the ITS map Tt , the local spatial propagation ofsaliency for each superpixel spt,i is performed to obtain itsinitial spatiotemporal saliency (IStS) measure as

STLt (i) = Tt (i) +

∑spt, j ∈N(spt,i )

At,t (i, j) · Tt ( j)∑

spt, j ∈N(spt,i )At,t (i, j)

(10)

where the local neighborhood N(spt,i) contains all superpixelsspatially adjacent to spt,i .

Then the global spatial propagation of saliency exploitsthose superpixels, which could be outside of the local neigh-borhood of spt,i but have a higher similarity with spt,i over thewhole frame Ft , to propagate their saliency measures to spt,i ,and obtain the final spatiotemporal saliency (FStS) measureof spt,i as

STGt (i)=STL

t (i)+∑

spt, j ∈K(spt,i )At,t (i, j)·Dt(i, j)·STL

t ( j)∑

spt, j ∈K(spt,i )At,t (i, j) · Dt (i, j)

(11)

where K(spt,i ) is the global neighborhood for spt,i andDt (i, j) is the factor of spatial distance between spt,i and anysuperpixel spt, j in K(spt,i ). Specifically, K(spt,i) contains thesuperpixels, which are most similar with spt,i , found in theframe Ft using the KD tree. Dt (i, j) is defined as

Dt (i, j) = exp [−dt (i, j)] (12)

where dt (i, j) is the Euclidian distance between the centroidsof spt,i and spt, j , which are normalized using the diagonallength of video frame to the range of [0, 1].

Using the above spatial propagation, for each super-pixel spt,i , the saliency measures of superpixels which havea similar appearance with spt,i and a short distance to spt,iwill contribute higher to the spatiotemporal saliency of spt,i .As shown in Fig. 1(i) and (j), the local and global spatialpropagation of saliency can enhance the spatial coherenceof saliency and highlight the complete salient object moreaccurately than the ITS map in Fig. 1(h). Besides, the per-formance analysis presented in Section III-B with Fig. 7 willdemonstrate the effectiveness of spatial propagation, whichfurther improves the saliency detection performance comparedwith ITS maps.


Fig. 6. (Better viewed in color) PR curves of FStS maps generated using our SGSP model with a different number of superpixels on (a) SegTrackV2 and(b) UVSD.

III. EXPERIMENTAL RESULTS

We performed a comprehensive performance evaluation ofthe proposed SGSP model based on extensive experimentalresults on two video data sets. The video data sets and experi-mental settings are introduced in Section III-A, and the perfor-mance analysis of our SGSP model is shown in Section III-B.Then objective evaluation and subjective evaluation includingcomparisons with eight state-of-the-art spatiotemporal saliencymodels are presented in Sections III-C and III-D, respectively,and some failure cases are analyzed in Section III-E. Finally,the computation issue of spatiotemporal saliency models isdiscussed in Section III-F, and the effect of motion vectorfield on performance and computational efficiency is discussedin Section III-G.

A. Data Sets and Experimental Settings

We performed extensive experiments on two video datasets with manually annotated binary ground truths for salientobjects. The first data set SegTrackV2 [60] contains 14 videoswith various scenes and diverse motion activities. Note thatfor videos containing multiple salient objects, we combined theoriginal ground truth maps for individual objects into a unifiedground truth map. Nonetheless, for the video Penguins (seethe bottom example in Fig. 12), since only several penguinsin the center are annotated as salient objects in the originalground truths, we annotated all penguins as salient objectsto generate more suitable ground truths for the purpose ofevaluating saliency detection performance. The second data setUVSD is created by us to introduce more unconstrained videoswith complicated motion and complex scenes. The UVSD dataset contains a total of 18 videos, and we accurately segmentedsalient objects in each video frame using the Adobe PhotoshopCC software, to generate the binary ground truths.

We compared our SGSP model with eight state-of-the-art spatiotemporal saliency models including SR [24],CE [30], QFT [15], MB [19], DCMR [40], SP [55],SLT [56], and GD [11]. For all the eight models, thesource codes with default parameter settings or executables

provided by the authors were used for all videos in thetwo data sets. For a fair comparison, all saliency mapsgenerated using different models are normalized into thesame range of [0, 255] with the full resolution of originalvideos.

B. Performance Analysis

Our SGSP model first generates MS map in Section II-B,then generates FTS map, BTS map, and ITS mapin Section II-C, and finally generates IStS map and FStS mapin Section II-D. Since our SGSP model is built on the basisof superpixels, the number of the generated superpixels foreach frame, i.e., nt , is important to the saliency detectionperformance. Therefore, we first analyzed the effect of nt

with a series of values from 50 to 500, and we adopted thecommonly used precision-recall (PR) curve, which plots theprecision measure against the recall measure to characterizethe saliency detection performance.

Specifically, the thresholding operations using a series offixed integers from 0 to 255 are performed on each saliencymap, and thus a total of 256 binary salient object masks areobtained for each saliency map and a set of precision and recallvalues are calculated by comparing with the correspondingbinary ground truth. Then for each class of saliency maps (forexample, all the FStS maps generated for all video framesin a data set using our SGSP model with a specific valueof nt ), at each threshold, the precision and recall values of allsaliency maps are averaged, and the 256 average precisionvalues against the 256 average recall values are plotted togenerate the PR curve.

The PR curves of our FStS maps generated with differentvalues of nt are shown in Fig. 6. We can observe from thePR curves on the two data sets in Fig. 6 that the saliencydetection performance is close with nt not less than 200,while the performance obviously degrades with the decreasein nt less than 200. Therefore, we can conclude that theperformance of our SGSP model is rather stable with moderatesegmentation and oversegmentation of superpixels, while the


Fig. 7. (Better viewed in color) PR curves of different saliency maps generated using our SGSP model, i.e., MS, FTS, BTS, ITS, IStS, and FStS, respectively,on (a) SegTrackV2 and (b) UVSD.

Fig. 8. (Better viewed in color) PR curves of different saliency models on (a) SegTrackV2 and (b) UVSD.

performance degrades with the undersegmentation of super-pixels. Based on Fig. 6, nt is set to 400, which achieves theoverall better performance on both data sets, and is fixed forall the following experiments including the comparisons withall the other spatiotemporal saliency models.

To objectively evaluate the contribution of different partsin our SGSP model to the saliency detection performance,the PR curves for different classes of saliency maps, i.e.,MS, FTS, BTS, ITS, IStS, and FStS, are shown in Fig. 7.We can observe from the PR curves in Fig. 7 the better saliencydetection performance of ITS over FTS/BTS and FTS/BTSover MS on both data sets. This clearly demonstrates theeffectiveness of the proposed temporal propagation of saliencyon the reasonable estimate of MS. Furthermore, we can alsoobserve from Fig. 7 the better saliency detection performanceof FStS over IStS and IStS over ITS on both data sets. Thisclearly shows that the proposed spatial propagation of saliencycan further improve the saliency detection performance.In summary, these PR curves in Fig. 7 objectively demonstrate

the contribution of each part in our SGSP model to saliencydetection performance.

C. Objective Evaluation

1) Performance Comparison With Other Models: In orderto objectively evaluate the saliency detection performanceof different models, the PR curves of all the other eightmodels and our SGSP model on the two video data setsare plotted in Fig. 8 for a comparison. It can be seen fromFig. 8 that our SGSP model consistently outperforms all theother eight models on both data sets. As shown in Fig. 8(b),the performance gain of our SGSP model over all the othereight models is more significant on the UVSD data set.This demonstrates that our SGSP model effectively improvesthe saliency detection performance on more unconstrainedvideos with complicated motion and complex scenes. Besides,the PR curves of the four models including SP, SLT, GD,and SGSP are substantially higher than those of the otherfive models. The common characteristic of SP, SLT, GD, and


Fig. 9. (Better viewed in color) Comparison with PR curves of saliency maps generated using the belief propagation algorithm and the proposed temporalpropagation method on (a) SegTrackV2 and (b) UVSD.

SGSP is that they all use superpixel segmentation and theoptical flow estimation method LDOF, which serve as the basefor the relatively higher performance. In this sense, the betterperformance of our SGSP model over SP, SLT, and GD ispersuasive.

2) Comparison With the Global Optimization Method onMotion Saliency Measurement: On the basis of the superpixel-level graph Gt initialized in Section II-B, some globaloptimization methods such as belief propagation [64], [65]can be used to replace the proposed iterative MS measure-ment method for estimating the superpixel-level MS. Inspiredby [66], which uses the belief propagation algorithm to opti-mize the saliency measures of superpixels, the optimal MSMt (i) for each superpixel spt,i is obtained by minimizing theenergy function as

∑

iEd(Mt (i)) + η

∑

spt, j ∈N(spt,i )

Es(Mt (i), Mt ( j)) (13)

where the data term is defined as

Ed(Mt (i)) = ∥∥Mt (i) − MO

t (i)∥∥2 (14)

the smoothness term is defined as

Es(Mt (i), Mt ( j)) = ‖Mt (i) − Mt ( j)‖2

ωa(vt,i , vt, j )(15)

and η is the weight for balancing the two terms. Using asimilar formulation as (1) and (2), the observed MS for eachsuperpixel spt,i is defined as

MOt (i) = exp

[λ · χ2(HM

t,i , HMt,0

)]. (16)

The belief propagation algorithm is used to minimize theabove energy function and to obtain the optimal MS for eachsuperpixel, and the corresponding MS maps are abbreviatedas BP_MS maps. The PR curves of BP_MS maps with theweight η equal to 0.1, which achieves the overall bettersaliency detection performance on both data sets, are shownin Fig. 9 for a comparison with our MS maps. It can be seenfrom Fig. 9 that the PR curves of our MS maps are consistently

higher than those of BP_MS maps on both data sets. Therefore,compared with the above belief propagation-based method,the proposed iterative MS measurement method achieves thebetter saliency detection performance. The advantage of theproposed method is mainly due to the reasonable use ofshortest path algorithm for calculating MS and the updateof superpixel-level graph with the iteratively refined globalmotion histogram.

3) Evaluation of the Proposed Temporal PropagationMethod: Our SGSP model enables an effective bidirectionaltemporal propagation of saliency using the proposed methodin Section II-C, which exploits efficient operations on inter-frame similarity matrices to propagate saliency in both forwardand backward directions. To further evaluate the effectivenessof bidirectional temporal propagation, we used another classof MS maps, for example, the above BP_MS maps, as theinput, to correspondingly generate the FTS (BP_FTS) maps,BTS (BP_BTS) maps, and ITS (BP_ITS) maps in turn usingthe proposed temporal propagation method in Section II-C.The PR curves of these classes of saliency maps are alsoshown in Fig. 9. With different classes of MS maps, i.e.,either MS maps or BP_MS maps, as the input, we canobserve from the PR curves in Fig. 9 the better saliencydetection performance of ITS (BP_ITS) maps over FTS/BTS(BP_FTS/BP_BTS) maps and FTS/BTS (BP_FTS/BP_BTS)maps over MS (BP_MS) maps on both data sets. This clearlydemonstrates the robustness of the proposed temporal propa-gation method to different input saliency maps and the advan-tage of bidirectional temporal propagation over unidirectionaltemporal propagation. Besides, by comparing each pair ofPR curves for MS and BP_MS, FTS and BP_FTS, BTS andBP_BTS, and ITS and BP_ITS, respectively, it can be seenfrom Fig. 9 that with the better MS maps, the correspondinglygenerated forward/backward/integrated saliency maps achievethe better saliency detection performance.

D. Subjective Evaluation

Spatiotemporal saliency maps generated using our SGSPmodel and all the other eight spatiotemporal models for


Fig. 10. Examples of spatiotemporal saliency maps for some videos in SegTrackV2 (shown with an interval of six, ten, and five frames, respectively, fromtop to bottom). (a) Video frames. (b) Binary ground truths. Spatiotemporal saliency maps generated using (c) SR, (d) CE, (e) QFT, (f) MB, (g) DCMR,(h) SP, (i) SLT, (j) GD, and (k) our SGSP model.

some videos in SegTrackV2 and UVSD are shown inFigs. 10 and 11, respectively, for a subjective comparison.These videos exhibit various motions of salient objects suchas fast motion with cluttered scene (the top example inFig. 10), fast change of motion (the middle example inFig. 10), complicated nonrigid motion (the bottom examplein Fig. 10), multiple object motions (the top and middleexamples in Fig. 11), and fast motion of tiny object (the bottomexample in Fig. 11). Besides, these videos also show variousglobal motions induced by the mixed camera operations ofpan, tilt, zoom (all videos in Figs. 10 and 11), and camerajitter (the top example in Fig. 11).

Compared with all the other eight models, we can observefrom Figs. 10 and 11 that the saliency maps generated usingour SGSP model can better highlight the complete salientobjects and suppress background regions more effectively.SR, CE, and QFT only highlight regions around salient objectboundaries or a part of salient object regions, but fail tohighlight the complete salient object regions and/or falselyhighlight some background regions with high contrast. Thereason why the above three models generate low-quality

saliency maps is that they exploit center–surround differencesof features on the basis of patch/volume, which in essencecannot highlight salient object regions completely and cannotadapt well to complicated motions in videos. MB employsbackground modeling for detecting salient regions, but cannoteffectively handle videos with global motion such as thesevideos in Figs. 10 and 11. DCMR exploits key point trackingmethod, which is not robust to form reliable point-level tra-jectories on salient objects with fast motion and deformation,and the complicated motion further decreases the reliabilityof point-level trajectories. Therefore, DCMR fails to highlightthe complete salient objects in these example videos or eventotally misses salient objects in some video frames.

SP, SLT, and GD are built on the basis of superpixels,and generally can highlight salient objects and suppress back-ground better than the former five models. However, SP, SLT,and GD cannot effectively handle videos with camera jitterand fast object motion (see the top and bottom examplesin Fig. 11). In contrast, our SGSP model can better handle suchunconstrained videos to improve the quality of saliency maps,because the iterative refinement of global motion histogram


Fig. 11. Examples of spatiotemporal saliency maps for some videos in UVSD (shown with an interval of 10, 15, and 40 frames, respectively, from topto bottom). (a) Video frames. (b) Binary ground truths. Spatiotemporal saliency maps generated using (c) SR, (d) CE, (e) QFT, (f) MB, (g) DCMR, (h) SP,(i) SLT, (j) GD, and (k) our SGSP model.

and the update of superpixel-level graph in Section II-Benhance the reliability of MS measurement. However, bothSP and SLT exploit only the frame-level motion histogram asthe representation of global motion without any refinement,and GD exploits the superpixel-level graphs for each framewithout an update scheme. Besides, the use of temporalpropagation and spatial propagation in Sections II-C and II-D,respectively, also effectively enhances the spatiotemporalcoherency and improves the quality of our saliency maps.

E. Failure Cases and Analysis

As shown in Sections III-C and III-D, our SGSPmodel outperforms the state-of-the-art spatiotemporal saliency

models on both objective and subjective evaluations. However,it is still difficult for our SGSP model to effectively handlesome challenging videos such as the examples shown inFig. 12. In the top example, besides the salient object (thesinger) with body motion, the colorful screen content ischanging and the audiences are clapping hands. Therefore,the regions covering the screen and audiences also showdistinctive motions in contrast with the remaining large back-ground regions, and with the shifting light, such regions arefalsely highlighted in our spatiotemporal saliency maps. In themiddle example, the salient object (the cyclist) is captured bya camera mounted on the rear bicycle (see the fourth row),so the camera with a severe jitter tries to capture an inward


Fig. 12. Failure examples of spatiotemporal saliency maps for some challenging videos (shown with an interval of 30, 15, and 5 frames, respectively, fromtop to bottom). (a) Video frames. (b) Binary ground truths. Spatiotemporal saliency maps generated using (c) SR, (d) CE, (e) QFT, (f) MB, (g) DCMR,(h) SP, (i) SLT, (j) GD, and (k) our SGSP model.

motion of salient object. Although the salient object is high-lighted in some frames, it is difficult for our SGSP modelto effectively discriminate such an unstable inward motionof object from the cluttered motion of background regions.In the bottom example, the area of salient objects (a flock of

slowly moving penguins) is rather large, so the global motionhistogram is insufficient to effectively represent the motion ofbackground regions, which occupy a small area of the scene.Therefore, the complete salient object regions cannot be con-sistently highlighted through the video using our SGSP model.


TABLE I

COMPARISON OF AVERAGE PROCESSING TIME PER FRAME TAKEN BY DIFFERENT SPATIOTEMPORAL SALIENCY MODELS

TABLE II

AVERAGE PROCESSING TIME PER FRAME AND PERCENTAGE TAKEN BY EACH COMPONENT OF OUR SGSP MODEL

As shown in Fig. 12, all the other eight saliency models alsocannot handle such challenging videos. Nonetheless, due to theeffectiveness of superpixel-level graph-based MS measurementand bidirectional temporal propagation, our SGSP model stillhandles complicated motions and highlights salient objectsbetter than the other saliency models.

F. Computation Cost

We performed all experiments on a PC with Intel Core i72600 3.4-GHz CPU and 8-GB RAM, and compared thecomputation cost of all saliency models. Table I reports foreach model the average processing time per frame for videoswith a resolution of 320 × 240. The average processing timeper frame taken by the MATLAB implementation of ourSGSP model is 8.766 s. It can be seen from Table I thatthe four models including SGSP, GD, SLT, and SP have thehigher computation cost than the other five models, sincethe four models use the time-consuming LDOF method foroptical flow estimation to obtain motion vector fields. Theaverage processing time taken by each component of ourSGSP model is shown in Table II, and it can be seen thatthe optical flow estimation process occupies 91.6% of the totalprocessing time, while the saliency measurement process fromSections II-B–II-D (the latter three components in Table II)takes only 4.6% of the total processing time.

Therefore, in order to make our SGSP model more practicalfor applications with runtime requirements, the computationalefficiency of the LDOF method, which is the bottleneckof runtime, should be elevated with the highest priority.Fortunately, as reported in [67], using the GPU-acceleratedimplementation of the LDOF method, the processing time hascome down from 143 s for the serial implementation on CPUto 1.84 s for the parallel implementation on GPU, i.e., witha speedup of 78× for the resolution of 640 × 480 on anIntel Core2 Quad Q9550 processor running at 2.83 GHz inconjunction with a NVIDIA GTX 480 GPU. Therefore, withthe GPU-accelerated implementation of the LDOF method andan optimized C/C++ implementation of other components in

our SGSP model, we believe that the computation efficiencyof our SGSP model can be substantially accelerated.

G. Effect of Motion Vector Field

The quality of motion vector field is important to thesaliency detection performance of our SGSP model. We per-formed the following experiments to analyze the effect ofmotion vector field on the saliency detection performance.Using the block matching method [68], which is commonlyused in the motion estimation module of video encoder, themotion vector of each block in each frame Ft is estimatedwith the previous frame Ft−1 as the reference frame. We useddifferent block sizes including 4 × 4, 8 × 8, and 16 × 16,respectively, and set the search range to 64 × 64, which islarge enough to cover the range of motions between adjacentframes. The obtained block-level motion vector field is finallyupsampled to generate the pixel-level motion vector fieldMVFt,t−1, for the use in our SGSP model. Therefore, foreach data set, we generated a total of three groups of FStSmaps with three block sizes, which are denoted by BS = 4,BS = 8, and BS = 16, respectively, and their PR curves andthe PR curve of our FStS maps are shown in Fig. 13 for acomparison.

We can see from Fig. 13 that the saliency detection per-formance of our SGSP model degrades on both data setswith the use of block-level motion vector fields. Note that theLDOF method is designed for better capturing the large andcomplicated motion between frames, while the block matchingmethod is designed for finding the matched block with thelowest error such as the minimum sum of absolute differencebetween blocks. The block-level motion vector fields are withthe lower accuracy to represent the complicated motions ofobject and background regions, and thus degrade the saliencydetection performance. Compared with the SegTrackV2 dataset, on the UVSD data set, which contains more challengingvideos, the block-level motion vector fields are with the evenlower accuracy, and thus the saliency detection performancedecreases more significantly. Using the MATLAB implementa-tion, the average processing time for estimating the block-level


Fig. 13. (Better viewed in color) Comparison with PR curves of FStS maps generated with block-level motion vector fields on (a) SegTrackV2 and(b) UVSD.

motion vector field and upsampling to the pixel-level motionvector field per frame with a resolution of 320 ×240 is 1.256,0.399, and 0.136 s for the block size of 4×4, 8×8, and 16×16,respectively. It can be seen that the use of block-level motionestimation can substantially reduce the computation cost ofour SGSP model with the degradation on saliency detectionperformance.

In summary, the quality of motion vector field obviouslyaffects the saliency detection performance, and thus the opticalflow estimation methods such as the LDOF method, whichcan better handle the large and complicated motion, is thebetter choice for spatiotemporal saliency models. Indeed, thosespatiotemporal saliency models with the higher performance,including SP, SLT, GD, and our SGSP model, all use the LDOFmethod for estimating motion vector fields. However, as shownby those challenging videos in Fig. 12, even using the motionvector fields estimated by the LDOF method, it is still difficultfor SP, SLT, GD, and our SGSP model to effectively highlightthe complete salient objects. In such cases that motion vectorfields are insufficient for effective saliency detection, we needto resort to other sources of high-performing saliency mapsif available. In [69], an adaptive fusion method via learninga quality prediction model for different saliency maps hasshown the effectiveness to improve the saliency detection per-formance on images. The quality prediction model is learnedusing various quality measures [70] for saliency maps, and isused to generate for a given saliency map its quality score,which serves as the weight for adaptive fusion. Therefore,with the tailored quality prediction model, we believe thatthe adaptive fusion of our spatiotemporal saliency maps withother saliency maps generated using some high-performingsaliency models for images may facilitate to further improvethe saliency detection performance.

IV. CONCLUSION

In this paper, we have presented our SGSP model thatutilizes SGSP to effectively improve the saliency detection per-formance on unconstrained videos. Our SGSP model enables

the iterative MS measurement based on superpixel-level graphas well as the bidirectional (both forward and backward)temporal propagation of saliency and glocal (both global andlocal) spatial propagation of saliency via the efficient use ofinter-frame and intra-frame similarity matrices, respectively.Extensive experimental results on two video data sets with var-ious unconstrained videos demonstrate the consistently bettersaliency detection performance of our SGSP model comparedwith other state-of-the-art spatiotemporal saliency models.However, it is still difficult for our SGSP model to effectivelyhandle some challenging videos, in which the area of salientobjects is rather large as well as some background regionsshow distinctive motions in contrast with the remaining largebackground regions and the low-quality motion vector fields.Considering the above limitations of our SGSP model, in ourfuture work, we will investigate an adaptive saliency fusionmethod to further improve the saliency detection performanceon unconstrained videos. In addition, as a meaningful appli-cation of spatiotemporal saliency model, we will explore aneffective salient object segmentation framework for uncon-strained videos.

ACKNOWLEDGMENT

We would like to thank the anonymous reviewers and theAssociate Editor for their valuable comments, which havegreatly helped us to make improvements.

REFERENCES

[1] C. Koch and S. Ullman, “Shifts in selective visual attention: Towardsthe underlying neural circuitry,” Human Neurobiol., vol. 4, no. 4,pp. 219–227, 1985.

[2] A. M. Treisman and G. Gelade, “A feature-integration theory of atten-tion,” Cognit. Psychol., vol. 12, no. 1, pp. 97–136, Jan. 1980.

[3] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.

[4] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. PatternAnal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011.

[5] R. Shi, Z. Liu, H. Du, X. Zhang, and L. Shen, “Region diversitymaximization for salient object detection,” IEEE Signal Process. Lett.,vol. 19, no. 4, pp. 215–218, Apr. 2012.


[6] Y. Luo, J. Yuan, P. Xue, and Q. Tian, “Saliency density maximizationfor efficient visual objects discovery,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 21, no. 12, pp. 1822–1834, Dec. 2011.

[7] Z. Liu, R. Shi, L. Shen, Y. Xue, K. N. Ngan, and Z. Zhang, “Unsuper-vised salient object segmentation based on kernel density estimationand two-phase graph cut,” IEEE Trans. Multimedia, vol. 14, no. 4,pp. 1275–1289, Aug. 2012.

[8] Z. Liu, W. Zou, and O. Le Meur, “Saliency tree: A novel saliencydetection framework,” IEEE Trans. Image Process., vol. 23, no. 5,pp. 1937–1952, May 2014.

[9] M. M. Cheng, N. J. Mitra, X. L. Huang, P. H. S. Torr, and S. M. Hu,“Global contrast based salient region detection,” IEEE Trans. PatternAnal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.

[10] K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato,“Saliency-based video segmentation with graph cuts and sequentiallyupdated priors,” in Proc. IEEE ICME, Jun./Jul. 2009, pp. 638–641.

[11] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video objectsegmentation,” in Proc. IEEE CVPR, Jun. 2015, pp. 3395–3402.

[12] A. Shamir and S. Avidan, “Seam carving for media retargeting,”Commun. ACM, vol. 52, no. 1, pp. 77–85, 2009.

[13] Z. Yuan, T. Lu, T. S. Huang, D. Wu, and H. Yu, “Addressing visualconsistency in video retargeting: A refined homogeneous approach,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 6, pp. 890–903,Jun. 2012.

[14] H. Du, Z. Liu, J. Jiang, and L. Shen, “Stretchability-aware block scalingfor image retargeting,” J. Vis. Commun. Image Represent., vol. 24, no. 4,pp. 499–508, 2013.

[15] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliencydetection model and its applications in image and video compression,”IEEE Trans. Image Process., vol. 19, no. 1, pp. 185–198, Jan. 2010.

[16] Z. Li, S. Qin, and L. Itti, “Visual attention guided bit allocation in videocompression,” Image Vis. Comput., vol. 29, no. 1, pp. 1–14, Jan. 2011.

[17] L. Shen, Z. Liu, and Z. Zhang, “A novel H.264 rate control algorithmwith consideration of visual attention,” Multimedia Tools Appl., vol. 63,no. 3, pp. 709–727, 2013.

[18] H. Liu and I. Heynderickx, “Visual attention in objective image qualityassessment: Based on eye-tracking data,” IEEE Trans. Circuits Syst.Video Technol., vol. 21, no. 7, pp. 971–982, Jul. 2011.

[19] D. Culibrk, M. Mirkovic, V. Zlokolica, M. Pokric, V. Crnojevic, andD. Kukolj, “Salient motion features for video quality assessment,” IEEETrans. Image Process., vol. 20, no. 4, pp. 948–958, Apr. 2011.

[20] O. Le Meur and Z. Liu, “Saccadic model of eye movements for free-viewing condition,” Vis. Res., vol. 116, pp. 152–164, Nov. 2015.

[21] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: A bench-mark,” in Proc. 12th ECCV, Oct. 2012, pp. 414–429.

[22] A. Borji, D. N. Sihite, and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study,”IEEE Trans. Image Process., vol. 22, no. 1, pp. 55–69, Jan. 2013.

[23] L. Itti and P. Baldi, “A principled approach to detecting surprising eventsin video,” in Proc. IEEE CVPR, Jun. 2005, pp. 631–637.

[24] H. J. Seo and P. Milanfar, “Static and space-time visual saliencydetection by self-resemblance,” J. Vis., vol. 9, no. 12, Nov. 2009,Art. no. 15.

[25] Y. Lin, Y. Y. Tang, B. Fang, Z. Shang, Y. Huang, and S. Wang,“A visual-attention model using earth mover’s distance-based saliencymeasurement and nonlinear feature combination,” IEEE Trans. PatternAnal. Mach. Intell., vol. 35, no. 2, pp. 314–328, Feb. 2013.

[26] D. Gao, V. Mahadevan, and N. Vasconcelos, “On the plausibility ofthe discriminant center-surround hypothesis for visual saliency,” J. Vis.,vol. 8, no. 7, Jun. 2008, Art. no. 13.

[27] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamicscenes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1,pp. 171–177, Jan. 2010.

[28] W. Kim and C. Kim, “Spatiotemporal saliency detection using texturalcontrast and its applications,” IEEE Trans. Circuits Syst. Video Technol.,vol. 24, no. 4, pp. 646–659, Apr. 2014.

[29] C. Liu, P. C. Yuen, and G. Qiu, “Object motion detection using infor-mation theoretic spatio-temporal saliency,” Pattern Recognit., vol. 42,no. 11, pp. 2897–2906, Nov. 2009.

[30] Y. Li, Y. Zhou, J. Yan, Z. Niu, and J. Yang, “Visual saliency based onconditional entropy,” in Proc. 9th ACCV, Sep. 2009, pp. 246–257.

[31] X. Hou and L. Zhang, “Dynamic visual attention: Searching for codinglength increments,” in Proc. NIPS, Dec. 2008, pp. 681–688.

[32] Y. Li, Y. Zhou, L. Xu, X. Yang, and J. Yang, “Incremental sparse saliencydetection,” in Proc. 16th IEEE ICIP, Nov. 2009, pp. 3093–3096.

[33] V. Gopalakrishnan, D. Rajan, and Y. Hu, “A linear dynamical systemframework for salient motion detection,” IEEE Trans. Circuits Syst.Video Technol., vol. 22, no. 5, pp. 683–692, May 2012.

[34] K. Muthuswamy and D. Rajan, “Salient motion detection through statecontrollability,” in Proc. IEEE ICASSP, Mar. 2012, pp. 1465–1468.

[35] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,”in Proc. IEEE CVPR, Jun. 2007, pp. 1–8.

[36] X. Cui, Q. Liu, and D. Metaxas, “Temporal spectral residual: Fast motionsaliency detection,” in Proc. 17th ACM MM, Oct. 2009, pp. 617–620.

[37] J. Li, Y. Tian, T. Huang, and W. Gao, “Probabilistic multi-task learningfor visual saliency estimation in video,” Int. J. Comput. Vis., vol. 90,no. 2, pp. 150–165, 2010.

[38] E. Vig, M. Dorr, T. Martinetz, and E. Barth, “Intrinsic dimensionalitypredicts the saliency of natural dynamic scenes,” IEEE Trans. PatternAnal. Mach. Intell., vol. 34, no. 6, pp. 1080–1091, Jun. 2012.

[39] W.-F. Lee, T.-H. Huang, S.-L. Yeh, and H. H. Chen, “Learning-basedprediction of visual attention for video signals,” IEEE Trans. ImageProcess., vol. 20, no. 11, pp. 3028–3038, Nov. 2011.

[40] C.-R. Huang, Y.-J. Chang, Z.-X. Yang, and Y.-Y. Lin, “Video saliencymap detection by dominant camera motion removal,” IEEE Trans.Circuits Syst. Video Technol., vol. 24, no. 8, pp. 1336–1349, Aug. 2014.

[41] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object-oriented visual saliency detection framework based on sparse codingrepresentations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23,no. 12, pp. 2009–2021, Dec. 2013.

[42] Y. Luo and Q. Tian, “Spatio-temporal enhanced sparse feature selec-tion for video saliency estimation,” in Proc. IEEE CVPR Workshops,Jun. 2012, pp. 33–38.

[43] Z. Ren, S. Gao, L.-T. Chia, and D. Rajan, “Regularized feature recon-struction for spatio-temporal saliency detection,” IEEE Trans. ImageProcess., vol. 22, no. 8, pp. 3120–3132, Aug. 2013.

[44] Z. Ren, L.-T. Chia, and D. Rajan, “Video saliency detection with robusttemporal alignment and local-global spatial contrast,” in Proc. 2nd ACMICMR, Jun. 2012, Art. no. 47.

[45] Y. Xue, X. Guo, and X. Cao, “Motion saliency detection using low-rank and sparse decomposition,” in Proc. IEEE ICASSP, Mar. 2012,pp. 1485–1488.

[46] L. Wixson, “Detecting salient motion by accumulating directionally-consistent flow,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8,pp. 774–780, Aug. 2000.

[47] D. Mahapatra, S. O. Gilani, and M. K. Saini, “Coherency based spatio-temporal saliency detection for video object segmentation,” IEEE J. Sel.Topics Signal Process., vol. 8, no. 3, pp. 454–462, Jun. 2014.

[48] Y. Tong, F. A. Cheikh, F. F. E. Guraya, H. Konik, and A. Trémeau,“A spatiotemporal saliency model for video surveillance,” Cognit.Comput., vol. 3, no. 1, pp. 241–263, Mar. 2011.

[49] O. Le Meur, P. Le Callet, and D. Barba, “Predicting visual fixationson video based on low-level visual features,” Vis. Res., vol. 47, no. 19,pp. 2483–2498, 2007.

[50] S. Marat, T. H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, andA. Guérin-Dugué, “Modelling spatio-temporal saliency to predict gazedirection for short videos,” Int. J. Comput. Vis., vol. 82, no. 3,pp. 231–243, 2009.

[51] G. Abdollahian, C. M. Taskiran, Z. Pizlo, and E. J. Delp, “Cam-era motion-based analysis of user generated video,” IEEE Trans.Multimedia, vol. 12, no. 1, pp. 28–41, Jan. 2010.

[52] W. Kim, C. Jung, and C. Kim, “Spatiotemporal saliency detection andits applications in static and dynamic scenes,” IEEE Trans. Circuits Syst.Video Technol., vol. 21, no. 4, pp. 446–456, Apr. 2011.

[53] K. Muthuswamy and D. Rajan, “Salient motion detection in compresseddomain,” IEEE Signal Process. Lett., vol. 20, no. 10, pp. 996–999,Oct. 2013.

[54] Y. Fang, W. Lin, Z. Chen, C.-M. Tsai, and C.-W. Lin, “A video saliencydetection model in compressed domain,” IEEE Trans. Circuits Syst.Video Technol., vol. 24, no. 1, pp. 27–38, Jan. 2014.

[55] Z. Liu, X. Zhang, S. Luo, and O. Le Meur, “Superpixel-based spatiotem-poral saliency detection,” IEEE Trans. Circuits Syst. Video Technol.,vol. 24, no. 9, pp. 1522–1540, Sep. 2014.

[56] J. Li, Z. Liu, X. Zhang, O. Le Meur, and L. Shen, “Spatiotemporalsaliency detection based on superpixel-level trajectory,” Signal Process.,Image Commun., vol. 38, pp. 100–114, Oct. 2015.

[57] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proc.NIPS, Dec. 2006, pp. 545–552.

[58] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detectionvia graph-based manifold ranking,” in Proc. IEEE CVPR, Jun. 2013,pp. 3166–3173.


[59] Y. Li, K. Fu, L. Zhou, Y. Qiao, J. Yang, and B. Li, “Saliency detectionbased on extended boundary prior with foci of attention,” in Proc. IEEEICASSP, May 2014, pp. 2798–2802.

[60] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmen-tation by tracking many figure-ground segments,” in Proc. IEEE ICCV,Dec. 2013, pp. 2192–2199.

[61] T. Brox and J. Malik, “Large displacement optical flow: Descriptormatching in variational motion estimation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 3, pp. 500–513, Mar. 2011.

[62] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,“SLIC superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,Nov. 2012.

[63] N. Otsu, “A threshold selection method from gray-level histograms,”IEEE Trans. Syst., Man, Cybern., vol. 9, no. 1, pp. 62–66,Jan. 1979.

[64] R. Szeliski et al., “A comparative study of energy minimization methodsfor Markov random fields with smoothness-based priors,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 30, no. 6, pp. 1068–1080, Jun. 2008.

[65] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-levelvision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–47, 2000.

[66] S.-C. Pei, W.-W. Chang, and C.-T. Shen, “Saliency detection usingsuperpixel belief propagation,” in Proc. IEEE ICIP, Oct. 2014,pp. 1135–1139.

[67] N. Sundaram, T. Brox, and K. Keutzer, “Dense point trajectories byGPU-accelerated large displacement optical flow,” in Proc. 11th ECCV,Sep. 2010, pp. 438–451.

[68] Y. Nie and K.-K. Ma, “Adaptive rood pattern search for fast block-matching motion estimation,” IEEE Trans. Image Process., vol. 11,no. 12, pp. 1442–1448, Dec. 2002.

[69] X. Zhou, Z. Liu, G. Sun, L. Ye, and X. Wang, “Improving saliencydetection via multiple kernel boosting and adaptive fusion,” IEEE SignalProcess. Lett., vol. 23, no. 4, pp. 517–521, Apr. 2016.

[70] L. Mai and F. Liu, “Comparing salient object detection results withoutground truth,” in Proc. 13th ECCV, Sep. 2014, pp. 76–91.

Zhi Liu (M’07–SM’15) received the B.E. and M.E.degrees from Tianjin University, Tianjin, China,in 1999 and 2002, respectively, and the Ph.D.degree from the Institute of Image Processing andPattern Recognition, Shanghai Jiaotong University,Shanghai, China, in 2005.

He is currently a Professor with the Schoolof Communication and Information Engineering,Shanghai University, Shanghai, China. From 2012to 2014, he was a Visiting Researcher with theSIROCCO Team, IRISA/INRIA-Rennes, Rennes,

France, with the support by EU FP7 Marie Curie Actions. He has authoredor co-authored over 120 refereed technical papers in international journalsand conferences. His research interests include saliency models, image/videosegmentation, image/video retargeting, video coding, and multimediacommunication.

Prof. Liu was a TPC Member of ICME 2014, WIAMIS 2013, IWVP 2011,PCM 2010, and ISPACS 2010. He co-organized special sessions on visualattention, saliency models, and applications at WIAMIS 2013 and ICME 2014.He is an Area Editor of Signal Processing: Image Communication and servedas a Guest Editor of the Special Issue on Recent Advances in Saliency Models,Applications and Evaluations of Signal Processing: Image Communication.

Junhao Li received the B.E. degree from HubeiNormal University, Huangshi, China, in 2013and the M.E. degree from Shanghai University,Shanghai, China, in 2016.

His research interests include the saliency modeland salient object detection.

Linwei Ye received the B.E. degree from HangzhouDianzi University, Hangzhou, China, in 2013and the M.E. degree from Shanghai University,Shanghai, China, in 2016.

His research interests include saliency model andsalient object segmentation.

Guangling Sun received the B.S. degree in elec-tronic engineering from Northeast Forestry Univer-sity, Harbin, China, in 1996, and the M.E. andPh.D. degrees in computer application technologyfrom the Harbin Institute of Technology, Harbin, in1998 and 2003, respectively.

She has been an Associate Professor with theSchool of Communication and Information Engi-neering, Shanghai University, Shanghai, China, since2006. She was a Visiting Scholar with Universityof Maryland, College Park, MD, USA, from

2013 to 2014. Her research interests include saliency detection, face recogni-tion, and image/video processing.

Liquan Shen received the B.S. degree in automationcontrol from Henan Polytechnic University, Jiaozuo,China, in 2001, and the M.E. and Ph.D. degreesin communication and information systemsfrom Shanghai University, Shanghai, China,in 2005 and 2008, respectively.

He was with the Department of Electrical andComputer Engineering, University of Florida,Gainesville, FL, USA, as a Visiting Professor from2013 to 2014. He is currently a Professor withthe School of Communication and Information

Engineering, Shanghai University. He has authored or co-authored over 100refereed technical papers in international journals and conferences in thefield of video coding and image processing. He holds ten patents in the areasof image/video coding and communications. His research interests includeHigh Efficiency Video Coding, perceptual coding, video codec optimization,3DTV, and video quality assessment.

Documents

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR …dpfan.net/wp-content/...videos-using...propagation.pdflearning method, sparse representation method, and the fusion ... distribution