[ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Static saliency

Static Saliency vs. Dynamic Saliency:A Comparative Study

Tam V. NguyenNational University of

[email protected]

Mengdi Xu∗

National University ofSingapore

[email protected]

Guangyu GaoBeijing Institute of

[email protected]

Mohan KankanhalliNational University of

[email protected]

Qi TianUniversity of Texas at San

[email protected]

Shuicheng YanNational University of

[email protected]

ABSTRACTRecently visual saliency has attracted wide attention of re-searchers in the computer vision and multimedia field. How-ever, most of the visual saliency-related research was con-ducted on still images for studying static saliency. In this pa-per, we give a comprehensive comparative study for the firsttime of dynamic saliency (video shots) and static saliency(key frames of the corresponding video shots), and two keyobservations are obtained: 1) video saliency is often dif-ferent from, yet quite related with, image saliency, and 2)camera motions, such as tilting, panning or zooming, affectdynamic saliency significantly. Motivated by these observa-tions, we propose a novel camera motion and image saliencyaware model for dynamic saliency prediction. The extensiveexperiments on two static-vs-dynamic saliency datasets col-lected by us show that our proposed method outperformsthe state-of-the-art methods for dynamic saliency predic-tion. Finally, we also introduce the application of dynamicsaliency prediction for dynamic video captioning, assistingpeople with hearing impairments to better entertain videoswith only off-screen voices, e.g., documentary films, newsvideos and sports videos.

Categories and Subject DescriptorsH.4 [Information Systems Applications]: Miscellaneous

General TermsAlgorithms, Experimentation, Human Factors

KeywordsStatic Saliency, Dynamic Saliency, Cinematography, Cam-era Motion∗indicates equal contribution

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, October 21–25, 2013, Barcelona, Spain.Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00.http://dx.doi.org/10.1145/2502081.2502128 .

1. INTRODUCTIONVisual saliency refers to the preferential attention on con-

spicuous or meaningful regions in a scene. Such visual at-tention scheme is naturally built into complex biological vi-sual systems to rapidly detect potential prey, predators, ormates in a real world. The process of visual saliency hasbeen the subject of numerous studies in psychology, neu-roscience, computer vision and multimedia fields. Corre-spondingly, several computational models of saliency havebeen proposed in recent years [24, 9, 21, 32]. And manyapplications of automatic saliency detection have also beenproposed such as image re-sizing [1], image automatic col-lage creation [30] and advertisement design [20]. Recently,other matters related to human attention such as depth in-formation or attractiveness have also been explored [13, 22,23].

Although visual saliency has attracted the attention of re-searchers in the computer vision and multimedia fields forquite a long time, most of the visual saliency-related researchworks are conducted on still images. Video saliency receivesmuch less research attention, though it is becoming more andmore important along with the rapidly increasing demand ofintelligent video processing. Moreover, in the existing worksof video saliency [15, 2, 31], camera motions such as tilting,panning or zooming are disregarded during the saliency es-timation. However, these camera motions ubiquitously existin videos and may have great impacts on the saliency distri-bution, as experimentally validated in this work. Motivatedby these two considerations, in this work, we conduct com-prehensive comparison between the static saliency in stillimages and dynamic saliency in videos. Inspired by the ob-servations in the comparison, we propose to utilize the staticsaliency as a prior information to improve the performanceof dynamic saliency estimation in videos. And we also in-vestigate the role of camera motions in video saliency andintegrate the estimated camera motion information into thesaliency estimation procedure. Extensive experiments ontwo challenging benchmark datasets clearly show that ourproposed saliency detection method outperforms the state-of-the-arts. Apart of proposing a novel method for saliencyestimation, we introduce an interesting application of videosaliency detection, i.e., adaptive video subtitle insertion forassisting the patient with hearing impairment.

Area Chair: Frank Bentley 987

Figure 1: The comparative study of Static Saliencyvs. Dynamic Saliency. We collect eye-tracking dataon both static and dynamic viewing settings viewedby at least 10 observers. The CMASS framework isproposed to improve dynamic saliency detection.

To facilitate the comparative study of static saliency vs.dynamic saliency, we first collect two video datasets for dy-namic saliency estimation, namely the Hollywood and theCamera Motion (CAMO). Each of the two datasets con-tains the videos with camera motions. Then, volunteersare invited to participate the eye fixation map collectionfor these videos. Afterwards, the raw fixation data are con-verted to human fixation maps, which are considered as thegroundtruth for saliency estimation. As aforementioned, inthis work, we consider both the prior information from staticsaliency and camera motions in the video saliency. And wepresent a novel learning framework, called Camera MotionAnd Static Saliency (CMASS), to integrate the valuable in-formation into the video saliency estimation. In particular,we train two neural networks which takes the camera motionparameters and position as inputs and outputs the optimalweights for the static saliency map and dynamic saliencymap. In this way, the two available saliency maps can beadaptively fused to produce an improved dynamic saliencymap estimation.

The proposed framework is shown in Figure 1, which in-cludes the static and dynamic saliency detection for the samevideo and the fusion of the two detected saliency maps. Themajor contributions of this work can be summarized as fol-lows,

1. To the best of our knowledge, we comprehensively con-duct the first comparative study on the static saliencyvs. dynamic saliency detection.

2. This is the first work to investigate the effects of cam-era motions in the dynamic saliency detection.

3. Inspired by the observed relationship between staticand dynamic saliency, we propose a novel learning frame-work, i.e., the CMASS method, for automatically fus-ing these two kinds of saliency maps to improve theperformance of dynamic saliency detection.

4. We introduce a new and useful application for the dy-namic saliency detection, namely adaptive video sub-title insertion for assisting people with hearing impair-ment.

The rest of the paper is structured as follows. In Section2, we discuss the related works. Then, we introduce ourdatabase construction and some observations in Section 3and Section 4 respectively. The proposed CMASS methodis introduced in Section 5. The experiments are presented inSection 6. The new application to insert the caption to thevideo is shown in Section 7. Finally, we give the conclusionin Section 8.

2. RELATED WORK

2.1 Learning to Predict SaliencyThrough preliminary studies [3, 28], at early stages of free

viewing, mainly bottom-up factors attract human attention(e.g., color, intensity, or orientation) and later on, top-downfactors (e.g., humans, objects and interactions) guide eyemovements. Some top-down factors in free-viewing are re-lated to semantic factors. Elazary et al. suggested thatinteresting objects (annotations from LabelMe dataset [27])direct human attention [7]. Einhauser et al. observed thatobjects are better predictors of fixations than bottom-upsaliency [6]. Cerf et al. discovered that the meaningful ob-jects such as faces and text attract human attention [4].Judd et al., further showed that humans, faces, cars, text,and animals attract human gaze [11]. These interesting ob-jects convey more information in a scene. During collectingNUSEF eye fixation dataset [26], Subramanian et al. foundthat fixations are focused on emotional and action stimuli.

Therefore, combining bottom-up and top-down factors mayboost the existing models in order to better predict wherehuman looks [12, 11, 25, 33]. The basic idea is that aweighted combination of features, where weights are learnedfrom a large repository of eye movements over natural im-ages, can enhance saliency detection compared with unad-justed combination of feature maps. [12], [11] and [25] usedimage patches, a vector of several features at each pixel,and scene gist, respectively for learning saliency. Zhao etal. learned optimal weights for saliency channel combina-tion separately for each eye-tracking dataset [33].

2.2 Saliency Prediction Models for Static andDynamic Scenes

Visual attention analysis in static scenes has been longstudied, while there is not much work on the dynamic scenes.In reality, we absorb the rich visual information that con-stantly changes due to dynamics of the world. Due to thelarge amount of information, visual selection is performedon both current scene saliency as well as the accumulatedknowledge from chronological events. In the early works, fewresearchers have extended the spatial attention from static

988

images to video sequences where motion plays an importantrole. Cheng et al. has incorporated the motion informationin the attention model [5]. The motion attention model ana-lyzes the magnitudes of image pixel motion in horizontal andvertical directions. Bioman et al. proposed a spatiotempo-ral irregularity detection in videos [2]. In this work, insteadof reading motion features, textures of 2D and 3D videopatches are compared with the training database to detectthe abnormal actions present in the video. Le Meur et al.proposed a spatiotemporal model for visual attention detec-tion [14]. Affine parameters were analyzed to produce themotion saliency map.

Recently, several researchers have studied modeling tem-poral effects on bottom-up saliency. Some methods fusestatic and dynamic saliency maps to produce the final vi-sual saliency maps (e.g., Li et al. [15] and Marat et al.[17]).A spatio-temporal attention modeling approach for videosis presented by combining motion contrast derived from thehomography between two images and spatial contrast cal-culated from color histograms. Zhai et al. introduced adynamic fusion technique is applied to combine the tempo-ral and spatial models in order to achieve the spatiotemporalattention model [31]. The dynamic weights of the two indi-vidual models are controlled by the pseudo-variance of thetemporal saliency values.

3. FIXATION DATA COLLECTION

3.1 Data CollectionThere are many datasets of still images (for studying static

saliency) and videos (for studying dynamic saliency) [4, 11,26]. However, none of the datasets can be used for studyingthe saliency for still images and videos simultaneously. Thus,in this work, we first construct two new datasets in order forstudying these two kinds of saliency maps together.

3.1.1 Dataset ConstructionTo study the effects of camera motion in video saliency,

we collect a new dataset named CAMO (Camera Motion)which consists of 120 videos of 6 different fundamental cam-era motions in cinematography: dolly, zoom, trucking, tilt,pan, and pedestal motions. Each video contains one singlecamera motion. Similar to the Hollywood dataset, we alsorandomly select one frame from each video for static saliencymap collection. The information of each camera motions islisted as below.

• Tilting : the camera is stationary and rotates in a ver-tical plane.

• Panning : the camera is stationary and rotates in ahorizontal plane.

• Dolly : the camera is mounted to the dolly and thecamera operator and focus puller or camera assistant,usually ride on the dolly to operate the camera.

• Trucking : roughly synonymous with the dolly shot,but often defined more specifically as movement whichstays a constant distance from the action, especiallyside-to-side movement.

• Pedestal : moving the camera position vertically withrespect to the subject.

Figure 2: The fundamental camera motions in cine-matography. Six basic types of motions are shown.

• Zooming : Technically this is not a camera move, buta change in the lens focal length with gives the illusionof moving the camera closer or further away.

Figure 2 illustrates six aforementioned camera motions.In the real world, many camera moves use a combination ofthese above mentioned techniques simultaneously. There-fore, we also collect another dataset named Hollywood. Weselect 500 random videos from Hollywood 2 dataset [18].Hollywood 2 dataset consists of videos with natural humanactions in diverse and realistic video settings. There existsone dataset collecting eye fixation on movies [19]. Therefore,we only collect fixation data on static images of that dataset.For each video, we extract one random frame which is notthe shot boundary and stay close to the center frame of thevideo. The reason we select Hollywood 2 is that it containsrealistic movies.

3.1.2 Human fixation data collection designWe invite 30 participants (students and staff members of

a university), whose age ranged from 21 to 36 years old(μ = 26.9, σ = 3.1), with normal or corrected-to-normalvision, to participate in the fixation map collection. Allparticipants are naive to the purpose of this study and haveno prior exposure to experiments on vision. The participantshave been split equally into three groups. Each group viewonly one of three following categories freely: Hollywood staticimages, CAMO static images and CAMO videos.

We use a block based design and free viewing paradigm.The subject views one of four designed blocks. In order torecord subject eye gaze data, we used an infra-red basedremote eye-tracker from SensoMotoric Instruments Gmbh.The eye-tracker gives less than 1o error on successful cali-bration. The eye tracker was calibrated for each participantusing a 9-point calibration and validation method. Then im-ages were presented in random order for 6 seconds followedby a gray mask for 3 seconds.

Human fixation maps are constructed from the fixations ofviewers to globally represent the spatial distribution of hu-man fixations. Similar to [29], in order to produce a contin-uous fixation map of an image, we convolve a Gaussian filteracross all corresponding viewers’s fixation locations. Someexamples of fixation maps of two new constructed datasetsare shown in Figure 3, the brighter pixels on the fixationmaps denote the higher salience values. These two datasets,

989

Figure 3: The exemplar images and their corresponding saliency maps and heat maps in CAMO and Holly-wood datasets.

CAMO and Hollywood with the stimulus, fixation data andfixation maps shall be released for public usage.

4. OBSERVATIONS

4.1 Camera Motion EffectsUsing the recorded eye tracker data, we mainly investi-

gate whether spatial distributions of fixations are differentin static and dynamic settings. The key observations aresummarized as follows.

1. The fixations of each video form a subset of the ones ofthe corresponding image if there is only single personor object. This observation presents a close relation-ship between the static and dynamic saliency maps.We can use the static saliency map as a good prior toguide the dynamic saliency map estimation.

2. In some cases, for example, pedestal camera move-ment, the fixation lies on the anticipated direction, noton the objects. This observation shows the effect ofcamera motion on the dynamic saliency.

3. In the case that there are multiple persons or objectsin the video, the fixations of the videos are not sameas images. In other words, the fixations on videos andimages focus on different people or objects.

Some examples of the camera motion effects mentioned inObservation 2 are shown in Figure 4. The details of discrep-ancies of each camera motions are summarized as follows.

• Pan: The fixations may be either on the object ofinterest (e.g., face of a walking person) or in the antic-ipated direction of the motion.

• Pedestal : The subject often tends to fixate on the an-ticipated direction of motion.

• Tilt : In case of a tilt shot, the subject also tends tofixate on the anticipated direction of motion.

• Trucking : Fixations in video are either a subset of thefixations in static images or they are in the anticipateddirection of motion.

990

Figure 4: The observations of fixation data on the images (top row) and videos (bottom row). Note thedifference of human fixations from column (c) to (f).

• Dolly shot : In the dolly shot, the cameraman is “mov-ing closer” to the center or the object of focus. There-fore, the anticipated direction of motion can be consid-ered to be the center or object of motion. While, thesubject does fixate on the object of interest, it is notlike the dolly shot which causes the subject to fixatemore/less on the object of interest. Thus, in case ofdolly, the movement of the camera does not cause thesubject to fixate on the anticipated direction of motionas “Pan”, “Pedestal”, or “Tilt”.

• Zoom: We notice that the fixations are either on theobject of interest or the peripheral motion of the cam-era.

4.2 Central Bias InvestigationWe compute the average fixation maps to investigate the

central bias of the fixation maps. Due to different sizesof testing images, the average fixation maps have cross-likeshape. As can be seen in Figure 5, the center bias remainsstrong in the average fixation map of original images in thestatic part of both datasets. This agrees with the findingin [11]. The average map for video part of CAMO datasetis not center-biased due to the strong effects of the cameramotions. Meanwhile, the average fixation map of Hollywoodvideo is not so strong as the static version. In summary, thecentral bias is not significantly observed in video fixation.

5. THE PROPOSED FRAMEWORKIn this section, we first explain the feature extraction ap-

plied on the given image or a certain frame in the video,followed by a novel framework which learns the mappingbetween image saliency and video saliency simultaneously.

5.1 Features

5.1.1 Static FeaturesTo well describe the content of the images, we extract

multiple static features and combine them together. The ex-tracted features together describe both low-level appearanceand high-level semantics. In particular, we use followinglow-level features: 13 local energy of the steerable pyramidfilters in 4 orientations and 3 scales; 3 intensity, orientation,and color contrast channels (Red/Green and Blue/Yellow)

Figure 5: The average fixation static and dynamicmaps from the two datasets. Warmer color indicatesstronger fixation.

as calculated by Itti and Koch’s saliency method; 3 valuesof the red, green, and blue color channels as well as 3 fea-tures corresponding to probabilities of each of these colorchannels; 5 probabilities of above color channels as com-puted from 3D color histograms of the image filtered witha median filter at 6 different scales; 4 saliency maps of Tor-ralba, SIM, SUN, and GBVS bottom-up saliency models.And we extract following high-level features: the horizontalline due to tendency of photographers to frame images andobjects horizontally; person and car detectors implementedby Felzenszwalb’s Deformable Part Model (DPM); face de-tector using the Viola and Jone’s code.

5.1.2 Dynamic FeaturesIn the temporal attention detection, saliency maps are

often constructed by computing the motion contrast be-tween image pixels. In this work, we generate dense saliency

991

Figure 6: The learning framework. The upper panelshows the learning process, including the neural net-work parameters learning. The bottom panel showsthe testing phase.

maps based on pixel-wise computations, mostly dense opti-cal flow fields. Here, we first resize each image/video frameto 200 × 200 pixels and then extract a set of features asaforementioned for every pixel.

5.2 CMASS for Dynamic Saliency Detection

5.2.1 Learning to Predict Image/ Video SaliencyIn this subsection, we provide a simple linear regression

based saliency estimation method. In the training phase,each training sample contains features at one pixel alongwith a +1 (salient) or −1 (non-salient) label. Positive sam-ples are taken from the top p percent salient pixels of thehuman fixation map (smoothed by convolving with a Gaus-sian filter with window size σ = 0.1) and negative samplesare taken from the bottom q percent. We chose samplesfrom the top 20% and bottom 40% in order to have samples

that were strongly positive and strongly negative. Trainingfeature vectors are normalized to have zero mean and unitstandard deviation. Assuming a linear relationship betweenfeature vector f and saliency map s, we solve the followingoptimization problem to obtain the linear model W :

min ‖FW − S‖2 + λ‖W‖2,where F and S are matrices by column-wisely stacking thevectors f and s of the training data. W is obtained in aclosed-form manner, W = (FTF +λI)−1FTS. For a testingimage, features are first extracted and then the learned map-ping was applied to generate a vector which is later resizedto a 200× 200 saliency map.

5.2.2 CMASS Video Saliency PredictionInspired by the observations given in Section 4, we pro-

pose a novel learning-based method, i.e., Camera MotionAnd Static Saliency (CMAS), to improve the performanceof dynamic saliency prediction by utilizing the informationfrom camera motion and static saliency results. Each framein the videos is divided regularly into patches with the sizeof 9×9 pixels. For the jth patch in the training samples, letpji denote the saliency map vector obtained from the imageand pjv denote the saliency map vector from the video. Thegroundtruth saliency map for the jth patch is denoted aspj . The camera motion parameter is denoted as CM j . Theposition of the patch in the image is denoted as (xj , yj). Ac-cording to the Observation 1, the generated saliency map isa weighted combination of the static and dynamic saliencymaps. According to the Observation 2, camera motion hasgreat impact on the saliency map. For different patches,the camera motion and spatial position of the patches aredifferent. Thus, the weights for their two kinds of saliencymaps should be different. Based on these two considerations,in CMAS, we construct two neural networks to weight thestatic saliency map and dynamic saliency map in the finalsaliency estimation, respectively. The input to the neuralnetwork is the camera motion parameters and the positionof the patches, and the output is the weight for the saliencymap. The function of the neural networks are denoted asφi(CM j , xj , yj) and φv(CM j , xj , yj). And they are learnedby minimizing the following loss function,

L(φi, φv) =∑j

‖φi(CM j , xj , yj)pji+φv(CM j , xj , yj)pjv−pj‖22.

After learning the functions of the neural network φi andφv, we can directly obtain the saliency map for each newsample through

p = φi(CM,x, y)pi + φv(CM,x, y)pv,

where CM,x, y are the motion and position parameters forthe input patch, and pi and pv are its two saliency maps.

However, directly training the neural network involvesquite complicated optimization procedure, which damagesthe efficiency of the proposed method. In this work, we in-troduce two auxiliary variables wj

i and wjv for φj

i and φjv to

simplify the optimization procedure. Then the loss functionis formulated as:

L =∑j

‖wji p

ji +wj

vpjv − pj‖22 + λ{(φj

i −wji )

2 + (φjv −wj

v)2}.

(1)The above optimization problem can be solved by variousmethods. And the algorithm iteratively learns two phases

992

Algorithm 1 Solving Problem (1)

Input: Saliency map vectors pji , pjv, p, parameters λ0, ρ =

1.5.Initialize: t = 0, wj

i

(t)= 0.5, wj

v(t)

= 0.5, λ(t) = λ0.while not converged do

1. t← t+ 1

2. f(t+1)i ← φi(CM j , xj , yj), f

(t)v ← φv(CM j , xj , yj)

3. Update the auxiliary variables:

wji

(t+1)= (pji

Tpji + λI)−1 ×(

pjipjT − pjip

jvTwj

v + λ(f(t)i + f (t)

v − wjv))

.

wjv(t+1)

= (pjvTpjv + λI)−1 ×(

pjvpjT − pjvp

ji

Twj

v + λ(f(t)i + f (t)

v − wji ))

.4. Train φi, φv.

5. Update parameters wφi of φi, wφv of φv.

6. λ(t+1) ← ρλ(t)

end whileOutput: The learned neural network φi and φv.

within the same objective function. The solver is used for ef-ficiency and outlined in Algorithm 1. Step 1 of the algorithmhas closed form solution. Step 2 is solved via the optimiza-tion of the neural network. To ensure that the auxiliaryvariables approximate the original variables, the trade-offparameter λ will be increased in each iteration.For the camera motions, we extract homography matrix

of 15 frames with the selected frame is the middle one. Toavoid the motion of the human or object, we use the workof [8]. Then, 8 values of the homography matrix (exceptthe last element on the diagonal line) represent the cameramotions.

For the implementation, we utilize 2 hidden layers withTransfer functions are ‘tansig’, and ‘purelin’, respectively.Backpropagation network training function is Levenberg-Marquardt. Note that we initialize NN of this current stepby using the weights of its previous step. λ is set as 0.1which takes the role of controlling the convergence speed.40 to 50 iterations are required for convergence.

6. EVALUATIONIn this section, we describe the extensive experiments con-

ducted on the new collected datasets for the better under-standing about the performance of the proposed learningframework.

6.1 Learning to Predict SaliencyTo quantitatively measure how well an individual saliency

map predictors on a given frame, we compute the area un-der the receiver operating characteristic (ROC) curve (AUC)and linear correlation coefficient (CC) values. As the mostpopular measure in the community, ROC is used for the eval-

uation of a binary classifier system with a variable threshold(usually used to classify between two methods like saliencyvs. random). Using this measure, the model is treated asa binary classifier on every pixel in the image; pixels withlarger saliency values than a threshold are classified as fix-ated while the rest of the pixels are classified as non-fixated.Human fixations are then used as ground truth. By varyingthe threshold, the ROC curve is drawn as the false positiverate vs. true positive rate, and the area under this curveindicates how well the saliency map predicts actual humaneye fixations. Meanwhile, CC measures the strength of a lin-ear relationship between human fixation map and predictedsaliency map.

Table 1: AUC and CC of saliency detection on thetwo datasets.

Dataset AUC CCCAMO - images 0.74 0.52CAMO - videos 0.64 0.20Hollywood - images 0.71 0.45Hollywood - videos 0.75 0.30

Table 1 shows the predicted results on our collected datasets,Hollywood and CAMO. We used static features to predictstatic saliency. Similarly, we used static features and dy-namic features to predict dynamic saliency. The perfor-mance of dynamic saliency prediction is worse than the staticsaliency prediction based on static features only. That showsthe need to improve the performance of dynamic saliencyprediction.

6.2 Dynamic Saliency EvaluationWe compare the performance of CMASS framework with

the following four baseline methods:

1. Video saliency prediction from visual features [11].

2. Video saliency prediction from visual and motion fea-tures.

3. Fixed mapping weight to fuse static saliency and videosaliency.

4. Adaptive mapping weight to fuse static saliency andvideo saliency [31].

Their performance comparison in terms of AUC and CC areshown in Table 2. As can be seen in Table 2, the resultsof the dynamic saliency prediction method from static in-formation only are the worst in all cases. Combining thevisual and motion features improves the performance gener-ally across the two dataset. Fixed mapping weight is learnedfrom a simple linear regressor. And the performance is fur-ther improved incrementally. The adaptive weight methodof Zhai et al. achieves better performance than the restof baselines, improving the performance over fixed mappingweight by around 2 to 3 percentage. Our proposed CMASSachieves the best performance in terms of AUC and CC forboth datasets, Hollywood and CAMO. Generally it outper-forms the results from Zhai et al. by 4 to 6 percentages.This improvement is rather significant. It shows the advan-tages of our combined static saliency and camera motion inboosting the performance of dynamic saliency prediction.

993

Table 2: Performance of CMASS on video saliency prediction on CAMO and Hollywood datasets.Method Hollywood CAMO

AUC CC AUC CCJudd et al. [11] 0.72 0.25 0.61 0.18

Visual and motionfeat.

0.75 0.30 0.64 0.20

Fixed mappingweight

0.74 0.28 0.63 0.19

Zhai et al. [31] 0.76 0.31 0.64 0.22CMASS 0.80 0.37 0.69 0.28

7. APPLICATION TO VIDEO CAPTIONINGAssisting the disabled persons by applying computer vi-

sion/multimedia techniques consistently attracts the atten-tion from many researchers. Recently, a technique for as-sisting hearing impairment patients in watching videos [10]is developed, which automatically inserts the dialogue near-ing the talking persons to help the patients understand whois talking and the content of the dialogue. However, thereis often a need to insert the subtitle into the video with-out human appearance (i.e., only narration appears in thevideo), such as documentary and introductory films. In thissection, we introduce the new application which automati-cally insert the subtitle into such videos based on the videosaliency map intelligently, in order to help the patients un-derstand the content of the narration.

The basic criteria of the subtitle insertion are two-folds.Firstly, the selected position of the frame to insert the sub-title should have low saliency score. Otherwise, the insertedsubtitle will overlap with the salient objects and worsen thewatching experience of the audience. Second, the selectedposition should be near to the high saliency position. Thusthe inserted subtitle will not distract the audience’s atten-tion.

The technique for the suitable position detection basedon saliency map is introduced as follows. The predictedsaliency map is first split into multiple blocks, each of whichhaving the size of 10× 10 pixels. Each small block i has themean saliency value si. We transform the saliency map tothe response map for the use of determining the position ofinserted subtitles. The response value of a certain pixel k inblock i is computed as below.

rk = α1

∑j∈N (i)

|si − sj | − α2si, (2)

where N (i) represent the neighboring blocks of block i andsi, sj are the saliency values of the block i and block j re-spectively. rk is the calculated response value for the kthpixel. The weights α1, α2 are empirically set as 0.5 and 0.5throughout our implementation. The first term in the re-sponse calculation characterize the saliency contrast whilethe second term encourages to find the position with lowsaliency. The size of inserted text will be calculated basedon its length. Then we perform the exhaustive search on theresponse map in order to find the most suitable position withthe largest response value. Figure 7 and 10 illustrates theexamples of inserting subtitle into the documentary videowithout human appearance.

To evaluate the quality of the inserted subtitle and whetherthe watching experience is improved, we conduct the user

Figure 7: The usage of response map for insertingsubtitles. The first row shows the frames of thevideo. The second row shows the saliency map fromdifferent saliency detection methods. The third rowshows the found position for inserting the subtitles.And the last row shows the final results.

study on both the content comprehensive and user impres-sion. There are 24 users participating in the study. Theirages vary from 22 to 30 years old. We prepare 5 video clipswith embedded caption for the evaluation.

For the content comprehension study, we randomly divideall the participants into four groups (each group has 6 par-ticipants) to avoid the repeated playing of a video whichwill cause knowledge accumulation. Therefore, each groupmerely evaluates one of the four paradigms for each videoclip. We have designed five questions related to captioncontent comprehension. These questions are carefully de-signed to broadly cover the content in the video clips. Theparticipant watches the clips under the task-free setting.Their results are the converted percentage of the correct an-swers. We compare the proposed high-contrast (HC) drivenmethod with the following three methods. The first one isthat the position of the subtitle is fixed at the bottom of

994

the frame. The second one is that the subtitles are insertedinto the position with low saliency value, which is calledlow-saliency (LS) driven method. And the third one is alsobased on high-contrast but the saliency map is estimatedfrom static saliency detection (Static). As shown in Figure8, our method outperforms all of the other saliency detec-tion methods for subtitle insertion. It demonstrates that thevideo saliency estimation method proposed in this work canfind the most suitable position to insert the subtitle, wherethe subtitle is informative to the audience. In contrast, theimage saliency based one performs worse since the insertedsubtitle is not close enough to the salient regions. And thefixed caption performs worst as the audience cannot focuson the subtitle and video content at the same time.

Figure 8: Results of evaluation of four methods interms of the content comprehension. The comparedmethods include Fixed, Low Saliency Driven (LS),Static image saliency detection based and High Con-trast Driven (HC). The vertical axis represents thesum of the scores obtained by each group of partic-ipants. Higher score indicates better performance.

We further compare the four subtitle insertion schemes,i.e., fixed, LS, Image Saliency and HC, in terms of the userimpression. We invite another 15 evaluators who are re-quested to indicate their satisfaction with respect to the fol-lowing perspectives: 1) Enjoyment: How do you feel thatthe video is enjoyable? 2) Convenience: How do you feel thevisual appearance of subtitle is convenient? 3) Preference:How do you prefer that captioning method? 4) Experience:How does the caption help you experience the video? Foreach sample, the participant rates each method on a 5-pointscale from the best (5) to the worst (1)[16]. The video orderis randomized. Figure 9 depicts the results of user impres-sion evaluation. Generally, our method outperforms oth-ers in all aspects since it optimizes allocated position. Lowsaliency driven captioning yields relatively low score due touncommon appearance in the video frames.

8. DISCUSSIONS AND FUTURE WORKIn this work, we have conducted a comparative study be-

tween the static saliency and dynamic saliency. To the bestof our knowledge, this is the first research attempt to inves-tigate this problem in depth. We first build the datasets ofhuman fixation on both images and videos for the compari-son purpose. Then we report several important observationsof the relationship of static and dynamic saliency. Inspiredby these observations, we propose the noval CMASS learning

Figure 9: Results of evaluation on the user impres-sion. Four methods are compared, namely Fixed,Low Saliency Driven (LS), Static image saliency de-tection based and High Contrast Driven (HC). Themethods are compared in terms of four criteria,namely Enjoyment, Convenience, Experience andPreference. Each user has been asked to assign ascore between 1 (most unsatisfactory) and 5 (mostsatisfactory) for each criterion.

framework to fuse static saliency into dynamic saliency esti-mation to improve the video saliency prediction. Extensiveexperimental evaluations on the constructed datasets welldemonstrate the effectiveness of the proposed method forvideo saliency prediction. We also apply the video saliencyprediction method to the application of helping patients withhearing impairment in watching videos with narration. Sug-gested future work includes extensive user studies as a meansto explore the potential of our approach under different con-ditions and for different application domains.

9. ACKNOWLEDGEMENTThis work is partially supported by Singapore Ministry of

Education under research Grant MOE2010-T2-1-087. Thiswork is also partially funded by the Singapore A�STARSERC grant, “Characterizing and Exploiting Human VisualAttention for Automated Image Understanding and Descrip-tion”. Dr. Qi Tian is supported by ARO grant W911BF-12-1-0057, NSF IIS 1052851, Faculty Research Awards byGoogle, FXPAL, and NEC Laboratories of America, and2012 UTSA START-R Research Award respectively. Thiswork is supported in part by NSFC 61128007. The authorswould like to thank Jiashi Feng and Prerna Chikersal forvaluable discussions.

10. REFERENCES

[1] S. Avidan and A. Shamir. Seam carving forcontent-aware image resizing. TOG, 2007.

[2] O. Boiman and M. Irani. Detecting irregularities inimages and in video. IJCV, 2007.

[3] R. Carmi and L. Itti. Visual causes versus correlates ofattentional selection in dynamic scenes. VisionResearch, 2006.

[4] M. Cerf, J. Harel, W. Einhauser, and C. Koch.Predicting human gaze using low-level saliencycombined with face detection. In NIPS, 2007.

995

Figure 10: The examples of inserting subtitle intothe documentary video. The original frames, thedetected saliency maps, calculated response mapsare shown from top to down. The top panel showsthe result from the dynamic saliency detection. Andthe bottom panel shows the results from the staticsaliency detection.

[5] W. Cheng, W. Chu, J. Kuo, and J. Wu. Automaticvideo region-of-interest determination based on userattention model. In ISCAS, 2005.

[6] W. Einhauser, M. Spain, and P. Perona. Objectspredict fixations better than early saliency. Journal ofVision, 2008.

[7] L. Elazary and L. Itti. A Bayesian model for efficientvisual search and recognition. Vision Research, 2010.

[8] B. Ghanem, T. Zhang, and N. Ahuja. Robust videoregistration applied to field-sports video analysis.ICASSP, 2012.

[9] J. Harel, C. Koch, and P. Perona. Graph-based visualsaliency. In NIPS, 2006.

[10] R. Hong, M. Wang, M. Xu, S. Yan, and T. Chua.Dynamic captioning: video accessibility enhancementfor hearing impairment. In ACM Multimedia, 2010.

[11] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba.Learning to predict where humans look. In ICCV,2009.

[12] W. Kienzle, M. Franz, B. Scholkopf, andF. Wichmann. Center-surround patterns emerge asoptimal predictors for human saccade targets. Journalof Vision, 2009.

[13] C. Lang, T. Nguyen, H. Katti, K. Yadati,M. Kankanhalli, and S. Yan. Depth matters: Influenceof depth cues on visual saliency. In ECCV, 2012.

[14] O. Le Meur, P. Le Callet, and D. Barba. Predictingvisual fixations on video based on low-level visualfeatures. Vision Research, 2007.

[15] J. Li, Y. Tian, T. Huang, and W. Gao. Probabilisticmulti-task learning for visual saliency estimation invideo. IJCV, 2010.

[16] R. Likert. A technique for the measurement ofattitudes. Archives of Psychology, 1932.

[17] S. Marat, M. Guironnet, and D. Pellerin. Videosummarization using a visual attention model. ESPC,2007.

[18] M. Marszalek, I. Laptev, and C. Schmid. Actions incontext. In CVPR, 2009.

[19] S. Mathe and C. Sminchisescu. Dynamic eyemovement datasets and learnt saliency models forvisual action recognition. In ECCV, 2012.

[20] T. Mei, L. Li, X. Hua, and S. Li. Imagesense: Towardscontextual image advertising. TOMCCAP, 2012.

[21] N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga.Saliency estimation using a non-parametric low-levelvision model. In CVPR, 2011.

[22] T. Nguyen, S. Liu, B. Ni, J. Tan, Y. Rui, and S. Yan.Sense beauty via face, dressing, and/or voice. In ACMMultimedia, 2012.

[23] T. Nguyen, S. Liu, B. Ni, J. Tan, Y. Rui, and S. Yan.Towards decrypting attractiveness via multi-modalitycues. In TOMCCAP, 2013.

[24] A. Oliva and A. Torralba. Modeling the shape of thescene: A holistic representation of the spatialenvelope. IJCV, 2001.

[25] R. Peters and L. Itti. Beyond bottom-up:Incorporating task-dependent influences into acomputational model of spatial attention. In CVPR,2007.

[26] S. Ramanathan, H. Katti, N. Sebe, M. Kankanhalli,and T. Chua. An eye fixation database for saliencydetection in images. In ECCV, 2010.

[27] B. Russell, A. Torralba, K. Murphy, and W. Freeman.Labelme: A database and web-based tool for imageannotation. IJCV, 2008.

[28] B. Tatler. The central fixation bias in scene viewing:Selecting anoptimal viewing position independently ofmotor biases and image feature distributions. Journalof Vision, 2007.

[29] B. Velichkovsky, M. Pomplun, J. Rieser, andH. Ritter. Eye-movement-based research paradigms. InVisual Attention and Cognition, 2009.

[30] J. Wang, L. Quan, J. Sun, X. Tang, and H. Shum.Picture collage. In CVPR, 2006.

[31] Y. Zhai and M. Shah. Visual attention detection invideo sequences using spatiotemporal cues. In ACMMultimedia, 2006.

[32] L. Zhang, M. Tong, T. Marks, H. Shan, andG. Cottrell. Sun: A bayesian framework for saliencyusing natural statistics. Journal of Vision, 2008.

[33] Q. Zhao and C. Koch. Learning visual saliency bycombining feature maps in a nonlinear manner usingAdaBoost. Journal of Vision, 2012.

996

Documents

[ACM Press the 21st ACM international conference - Barcelona, Spain (2013.10.21-2013.10.25)] Proceedings of the 21st ACM international conference on Multimedia - MM '13 - Static saliency