IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...€¦ · mobile video mashup. In Table I, we provide a comparison of our proposed system with the two existing mashup systems

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

MoVieUp: Automatic Mobile Video MashupYue Wu∗, Tao Mei†, Senior Member, IEEE, Ying-Qing Xu‡, Senior Member, IEEE,

Nenghai Yu∗, Member, IEEE, and Shipeng Li†, Fellow, IEEE

Abstract—With the proliferation of mobile devices, peopleare taking videos of the same events anytime and anywhere.Even though these crowdsourced videos are uploaded to thecloud and shared, the viewing experience is very limited dueto monotonous viewing, visual redundancy, and bad audio-videoquality. In this paper, we present a fully automatic mobile videomashup system that works in the cloud to combine recordingscaptured by multiple devices from different view angles and atdifferent time slots into a single yet enriched and professionallooking video-audio stream. We summarize a set of computationalfilming principles for multi-camera settings from a formal focusstudy. Based on these principles, given a set of recordings of thesame event, our system is able to synchronize these recordingswith audio fingerprints, assess audio and video quality, detectvideo cut points, and generate video and audio mashups. Theaudio mashup is the maximization of audio quality under theless switching principle, while the video mashup is formalized asmaximizing video quality and content diversity, constrained bythe summarized filming principles. Our system is different fromany existing work in this field in three ways: 1) our system isfully automatic, 2) the system incorporates a set of computationaldomain-specific filming principles summarized from a formalfocus study, and 3) in addition to video, we also consider audiomashup which is a key factor of user experience yet oftenoverlooked in existing research. Evaluations show that our systemachieves performance results that are superior to state-of-the-artvideo mashup techniques, thus providing a better user experience.

Index Terms—Mobile video, video-audio mashup, filming prin-ciple, cloud media computing

I. INTRODUCTION

We are now witnessing a rapid proliferation and renovationof cloud computing techniques. In this cloud computing world,people are taking and sharing more and more videos withfew to no restrictions on mobility and high-speed connectionto the cloud [1]. Among them, there exist multi-camerarecordings which are captured simultaneously at the sameevent and partially overlap in time [2]. The viewing experienceof such multi-camera recordings is very limited. First, it istime consuming to watch all of them one-by-one to get anoverview of the event. Viewers will lose interest due to themonotonous view of a mobile video. Second, the contents ofvideos captured by different people can be similar, makingthem bad choices either for collecting or sharing. Third, the

Copyright (c) 2014 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected].

Y. Wu and N. Yu are with the Department of Electronic Engineering andInformation Science, University of Science and Technology of China, Hefei230027, China (e-mail: [email protected]; [email protected]).

T. Mei and S. Li are with Microsoft Research, Beijing 100080, China (e-mail: [email protected]; [email protected]

Y.-Q. Xu is with the Department of Information Art & Design, TsinghuaUniversity, Beijing 100084, China (e-mail: [email protected])

MashupSystem

Fig. 1. An illustration of a video mashup. Given a set of recordings ofthe same event, our system is able to generate a mashup video based ona set of summarized filming principles. Selected parts of the source videorecordings are shown in deep colors. The automatically generated mashupvideo provides a richer and much more professional view of the event thanany single recording, thus significantly improving user experience.

quality of these videos is not guaranteed as they are oftencaptured under poor conditions by amateurs using handhelddevices. To deal with these challenges, mobile video mashup,which synchronizes and combines multi-camera recordingsinto a single yet enriched and professional looking video-audiostream, has become an emerging research topic. Fig. 1 showsan illustration of a typical video mashup, where the mashupis only based on video signal.

Numerous challenges have created barriers to producing asuccessful mobile video mashup. The first challenge comesfrom the quality of the input videos. Though imaging tech-niques have taken an enormous leap forward, video quality isaffected significantly by shakiness, blurriness and many otherfactors occurring during the capturing. Besides, audio qualityis limited by both surroundings and the microphone itself. Tothe best of our knowledge, there has been limited researchon assessing non-intrusive audio quality. Another challenge isthat we do not know how human editors will create the videomashup. Different editors may have different editing styles,making it hard to learn a general model from existing videos.Even though we know the process of human editing, we stillneed to transfer it to computable formulations. Such a transferis the foundation to overcoming the two basic challengesin video mashup: when is the appropriate time to switch toanother video/audio source, and how does the mashup systemselect the best video/audio source among all the candidates.

Some works addressing these challenges have been pub-lished previously. These approaches can be summarized intothree categories: rule-based [3], optimization-based [2], andlearning-based [4], [5]. Rule-based methods imitate the pro-cess of human editors. However, the requirements in mobilevideo mashup are more like user preferences rather than strictconditions. Shrestha et al. propose conducting video mashupby optimization, in doing so only visual quality and diversityare actually optimized in their system [2]. Additionally, theoptimization is only a local approximation through greedysearch. Jiku Director is proposed to learn a transition matrix

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

(shooting angle, distance to the target, and shot length) toselect video shots [4], [5]. The learned editing rule is not videocontent-based. We argue that automatic video editing is anorganic combination of the film grammar and the computableelements, while none of these methods solve the mashupproblem in such a comprehensive way. Besides, they paylittle attention to audio [2], [4], [5], which plays an importantrole in both visual and aural experience. For video mashup,audio can be utilized to determine the switching frequencyof shots (video cut point detection). The reason to do audiomashup is three-fold: 1) The final output is comprised of bothvisual and aural signals. Thus the audio source with the bestquality should be selected as the final output. 2) Almost noneof the audio sources record the whole event in real cases.3) The Quality of a video is good does not mean that thecorrespondent audio is also good. Hence we cannot use thecorresponding audio stream when a video is selected.

In this paper, we propose a new mobile video mashupsystem—MoVieUp which integrates film grammar into an op-timization problem with rule-based constraints. We summarizea set of computational filming principles from a focus study.Based on these principles, we design a system that generatesa mashup of both audio and video. Specifically, source videosare synchronized with audio fingerprints. For audio mashup,we select the best quality parts of different audio sourcesand stitch them together to generate a full aural recordingof the whole event under the less switching principle. Forvideo mashup, we first detect cut points by measuring temposuitability and semantic suitability on audio. Given these cutpoints, we formalize video mashup as the optimization ofvisual quality and diversity under the constraints of cameramotion consistency. A post-processing step is performed forsemantic completeness of videos. The final output is thecombination of the video and audio mashup.

Our main contributions in this work are as follows:1) We propose a fully automatic system for mobile video

mashup, which is significantly different from previousworks requiring kinds of manual intervention.

2) The system incorporates a set of computational domain-specific filming principles summarized from a formalfocus study. These computational principles provide theguidance for switching cameras and selecting the bestcameras in the multi-camera setting.

3) In addition to video, we also consider audio mashupwhich is a key factor of user experience yet oftenoverlooked in existing research. This has significantly im-proved the overall watching experience of mobile video.

The rest of the paper is organized as follows. We discusssome related works in Section II. Section III presents the focusstudy. Section IV describes our mobile video mashup system.We conduct some experiments to evaluate the performance ofthe proposed system in Section V , followed by conclusionsand discussions in Section VI.

II. RELATED WORK

Mobile video mashup has emerged as a popular researchtopic in recent years. Beside researching video mashup itself,there are also connections with video editing processes.

TABLE ICOMPARISON OF VIDEO MASHUP SYSTEMS

MoVieUp(this paper) VD [2]1 Jiku [5]

diversity Yes Yes Yesshakiness Yes Yes Yes

tilt Yes No Yesocclusion Yes No Yes

audio mashup Yes No Nocut point Audio+Video Manual Learning2

1 VD is short for Virtual Director [2].2 Transition matrix for cut points learnt by Jiku Direc-

tor is the same to all videos, thus not content-based.

A. Video mashup

There have been a few works on video mashup in recentyears. Shrestha et al. propose an automatic mashup generationsystem from multiple-concert recordings [2]. They formulatemashup generation as an optimization of many factors likevideo quality, diversity, and cut point suitability. However,the authors do not present a practical method for measuringcut point suitability, making the system not fully automatic.Some video quality factors which are common in mobilevideos, like tilt and occlusion, are not considered either.Saini et al. propose the Jiku Director to do online mobilevideo mashups [4], [5]. They learn a Hidden Markov Model(HMM) for shot selection and shot length determination. Theproblem with this approach is that there are many kinds ofediting styles, both linear and non-linear. Shot selection andshot length should be content based, influenced by motionintensity, audio tempo, and many other factors [6]. Besides,the system is not fully automatic in that the accuracy of theautomatic classification of camera locations is limited. Manualintervention is preferred, especially in the learning phase.Neither of the above methods consider audio mashup, nor dothey pay enough attention to the principles of video editing.Arve et. al have recently proposed an automatic editing systemfor footage from multiple social cameras [7]. However, thesystem is highly dependent on 3D reconstruction of scenes,which often fails with mobile videos as we considered. Thereare a few other systems called mashup [8], [9]. However, theyare dealing with selection of video clips from different movies,which is quite different from the multi-camera settings formobile video mashup.

In Table I, we provide a comparison of our proposed systemwith the two existing mashup systems from the perspective ofboth audio and video mashup.

B. Video editing

Mobile video mashup is also related to video editing, includ-ing video summarization, camera selection, and home/musicvideo editing.

Video summarization. Video summarization shares a goalwith video mashup in that both of them aim to maximize infor-mation content. Sundaram et al. propose a utility framework

YUE WU et al.: MOVIEUP: AUTOMATIC MOBILE VIDEO MASHUP 3

for automatic skim generation from computable shots [10].They apply visual film syntax to the arrangement of shots(selection, scale, duration, order, etc.) [11], which is referentialin our mobile video mashup. Detection of computable shots isoften based on a mimic model of human memory [12]. Similarmodels can be used to diversify mashup video.

Camera selection. Camera selection has been widely ex-plored in some specific scenarios like lectures and meetings.Cameras are often selected by recognizing speakers or detect-ing faces [13], [14]. Tracking and audio based localization areused in related systems [15], [16]. All the above mentionedmethods can be classified as speaker-based. However, forvideo-audio mashup, we consider more general cases likea concert where speakers are not the only focus. A noisyenvironment and bad visual quality make it hard to do audiolocalization or face detection.

Home/music video editing. There have been many worksthat focus on home/music video editing. Hua et al. presentAVE—an automated home video editing system, to extract aset of highlight shots from a set of home videos [17]. Theypropose two sets of rules to ensure representativeness of theoriginal video, as well as the coordination between video andaudio. A similar approach is extended to automatic musicvideo generation, in which temporal structures of videos andmusic are analyzed for matching [6]. However, these systemscannot be applied to multi-camera recordings since mobilevideo mashup requires synchronization and video diversity.

III. FOCUS STUDY

Mobile video mashup is highly related to the film editingconducted by human beings. In film editing, editors use thefilm grammar to grasp the attention and emotions of an audi-ence with fragments of recorded time, arranged shadows andsounds to convey a story. “Metaphorically, the “grammar” ofthe film refers to theories that describe visual forms and soundcombinations and their functions as they appear and are heardin a significant relationship during the projection of a film.Thus, film grammar includes the elements of motion, sound,pictures, color, film punctuation, editing, and montage.“ [18].Basic elements of the film grammar include shot, movement,and distances (full, medium, and close-up) [11]. They usevarious shot lengths, shooting angles, and arrangement ofdistances to express different meanings of the shots. Forexample, local object motion can be expressed with mediumshot, while close-up is often used for static shots or shots withmoderate motion.

Mobile video mashup mimics the process of film editingto convey the event originally expressed by the multi-camerarecordings. Cut points in mobile video mashup correspondto the boundaries of different shots. Selection of video shotscorresponds to the selection of camera positions. To createrich and professional looking videos, it is required to know thegrammar of the film language and how human editors applythe grammar to video editing.

A major barrier to apply the film language to mobilevideo mashup is that the film language is not strict rules.Though there have been many works talking about the film

grammar and some of the observations in this section arenot new, it is still essential to summarize existing filmingprinciples and explore new guidelines that are both specificallyrelated to mobile video mashup and feasible to be interpretedinto computational rules, which lays the foundation for theproposed system. We survey previous user study [2] and theliterature on video editing [11], [19]. Further, we conduct aformal focus study centered on the two basic problems ofvideo mashup:

• Switching of shots: when should the mashup switch toanother video source?

• Selection of cameras/recorders: which video/audiosource should be chose next?

A. Participants and procedure

We invited a professor of Arts & Design and a graduatestudent in cinematography, to attend the focus study. Theprofessor has been working on video-related research for morethan 20 years. The graduate student has much experience invideo editing, particularly in collaboration with TV stations.

The focus study is loosely structured and conducted in adiscussion format. We first show the typical capturing sce-narios (like concerts and competitions) and major drawbacks(lighting, shakiness, occlusion, etc.) of mobile videos. Then wetry to explain what is mobile video mashup and ask the twoof them some questions about it. The questions are structuredas follows:

• Video mashup discussion: This discussion is to investigatevideo switching frequency and video shot selection. Weask questions such as: Are there any requirements forthe duration of a video shot? What factors will affect theduration and how do they affect it? Can we draw a certainconnection between these factors and the duration? Arethere any requirements for the switching? What willaffect video quality? How does one avoid the monotonyproblem of mobile videos? How does one switch betweenvideo shots smoothly? Do you have any suggestions onhow to improve the viewing experience?

• Audio mashup discussion: The discussion concerns theselection of audio with questions such as: When doesone need to switch from one audio source to another?What kind of audio do you think is better? Are there anydifferences between video mashup and audio mashup?How can we concatenate audio fragments from differentsources together?

B. Results of video mashup

Switching of shots. The two editors said that there shouldbe a lower and upper bound to the duration of shots. Too shorta video is incomprehensible, while too long will be boring. Thebounds are not constant. Shrestha et al. choose a minimum of3 seconds and a maximum of 7 seconds for concert videos [2].However, the bounds may not be so strict. Longer shots canbe used for moving shots or establishing shots. Moving shotsbring new content to viewers, thus avoiding boringness. Inestablishing shots, a complete view with rich content is shownand boringness is not a concern either.


Sync. w/ Audio

Fingerprint

Video Structuring

Cut Point

DetectionOptimization-

based Video

Selection

Audio Analysis

(tempo, semantic

suitability)

Video Analysis

(quality, motion, and diversity)

Audio

Stitching

Audio Quality

Assessment

Optimization-

based Audio

Mashup

Mashup Video

Au

dio

Vid

eo

MultiplexCamera #1 Camera #2

Camera #3 Camera #4

Demultiplex

Fig. 2. Framework of the proposed mobile video mashup system. The system consists of two main parts: video mashup and audio mashup.

Switching frequency of videos depends on both audio andvideo. Frequent switching is preferred for fast-paced audio,intense object motion, and rapid change in brightness. Smoothshots should be paired with less switching. There is no definiterelationship between switching frequency and these factors.Whether it is a linear or non-linear correlation is up to thedirecting style.

Suitable switching should occur near speaking/singing in-tervals, beginning of speaking/singing captures viewers’ at-tention, and ending releases it. To avoid interruption of theviewers’ attention, switching near speaking/singing intervalsis often a good choice.

Selection of cameras. Computational factors related tovideo shot selection include: video quality, diversity, cameramotion, and semantic completeness.

• Video quality. Video quality ensures clarity and a pleasantviewing experience. Observations include: too dark or toobright frames should be excluded; blurred frames shouldbe avoided; occluded frames will damage viewers’ inter-est; tilting makes viewers uncomfortable, though it canbring some special effects in certain cases. The editorsespecially made a point of reminding us that bad effectscaused by erratic and unprofessional camera motions(hand-shaking, rapid movement of the camerawork, etc.)must be excluded.

• Diversity. Diversity means an enriched and entertainingview of an event. According to the study, there areno definite principles to guide the selection of camerapositions. Different editors may have different editingstyles and thus different choices if many candidate camerapositions are available. However, there exist some rulesthat should not be violated. For example, a core guidelineis to avoid “Jump Cuts”, which means that two sequentialshots of the same subject are taken from camera positionsthat vary only slightly. The 30 degrees rule indicates thatthere should be at least 30 degrees’ difference betweenshooting angles to avoid a noticeable portion of overlapbetween adjacent shots. The editors recommend that ifcamera positions are unknown, frame difference shouldbe large enough to present some fresh content to viewers.

• Camera motion. The change of camera motion betweenneighboring shots should be smooth for a comfortableviewing experience. Unexpected camera motion can in-cur annoying visual impact. Some common principlesinclude: (1) static shots should be connected with staticshots; (2) moving shots are less placed near each other;

(3) visual impact due to the connection of static andmoving shots can be alleviated by the slowing down ofcamera motion, object motion in static shots, or bridgingshots.

• Semantic completeness. The two editors mentioned somesemantic concerns about video mashup. Each recordingis comprised of many self-contained semantics (subshotsas we called them later). These self-contained semanticsshould not be interrupted before viewers get a basicknowledge of them. Otherwise, the switching will bequite obtrusive.

C. Results of audio mashup

Unlike video mashup, switching between audio recordingsshould be as minimal as possible, which we note as LessSwitching Principle. There is no monotony problem if thereare no or little switching between different audio sources.Conversely, stitching audio shots between different sourceswill create more inconsistency problems due to variancesin volume and tone, even in the case that the shots aresynchronized exactly. The inconsistency can be caused eitherby the microphones or a recorder’s surroundings, degradingthe overall quality of the mashup audio. A high qualityaudio generated from one or a few sources is more preferred.Considering the mobile scenario, clamorous audio should beavoided. Selected audio or audio fragments are expected to beclear and clean. Audio mashup should distinguish good audiofragments from other noisy ones.

D. Computational filming principles

We summarize some computational filming principles fromthe focus study. These principles will be the foundations ofthe proposed mobile video mashup system.

• For video cut point detection, the following standardsshould be met: 1) there should be a lower and upperbound on shot duration; 2) switching frequency of videosshould be consistent with audio tempo; 3) switchingshould take place near speaking/singing intervals.

• Selected video shots should fit the following criteria: 1)selected video shots should be clear and stable (withoutblurriness, occlusion, shakiness, etc.); 2) frame differenceof adjacent shots should be large to avoid Jump Cuts; 3)camera motion around cut points should be smooth andnatural; 4) each selected shot should be complete in termsof semantics.


• Selected audio fragments should be: 1) clear and clean,and 2) with the Less Switching Principle. That is, switch-ing between audio fragments should be as minimal aspossible.

IV. MOVIEUP SYSTEM

In this section, we describe the framework of the MoVieUpsystem based on the filming principles summarized above.Note that in later sections, video and audio denote the visualand aural content respectively. We use recording to denoteboth of them.

A. Overview

As analyzed before, the MoVieUp system consists of thegeneration of both audio mashup and video mashup. To clearlypresent the system framework, the following terms are clarifiedfor video mashup:

• Shot: a shot is a fragment of a video that is selected aspart of the mashup video.

• Subshot: a subshot is the basic unit of video. It con-tains consistent camera motion and self-contained seman-tics [20].

• Cut point: a cut point is the time point where the mashupvideo switches from one source to another. Note thatat each cut point, there can be multiple candidate videosources.

1) Framework: Fig. 2 shows the framework of the pro-posed system. Given a list of recordings, MoVieUp is ableto demultiplex them into audio and video, synchronize therecordings, generate the mashup video and audio according tothe summarized filming principles, and multiplex them intothe final video-audio stream. For audio mashup, we selectaudio fragments based on quality under the less switchingprinciple. Video mashup is comprised of two steps: cut pointdetection and optimization-based video shot selection. Thesystem detects cut points by matching switching frequencywith audio tempo and avoiding speaking/singing interrup-tion [6]. Video analysis is performed at the granularity level ofsubshots. Given the detected cut points, video shot selectionis formalized as maximizing video quality and diversity underconstraints of camera motion consistency. Results of videomashup are fine-tuned with semantic completeness. Videostabilization is presented as an optional step to further improvethe viewing experience. After the system finishes video andaudio mashup, the two separate results are multiplexed intothe final output.

2) Notations: Suppose there are N source recordings R ={r1, r2, . . . , rN}. Each recording ri is demultiplexed intoaudio ai and video vi. ri starts at time t(s)i and ends at timet(e)i (so is vi and ai). Suppose there are Ma and Mv shots in

the mashup audio and video, respectively. We denote the j-thselection by saj and svj (th j-th shot). The superscript a or v isto distinguish audio and video. The mashup audio and videoare described as:

Ma = (sa1 , sa2 , . . . , s

aMa)

Mv = (sv1, sv2, . . . , s

vMv ).

MoVieUp

time

Mashup video

Mashup audiotime

Fig. 3. Illustration of notations. Note that we use tacj to denote the switchingtime of audio for convenience. svj denotes the j-th shot.

The duration of the j-th selection sj is denoted by d(sj).For video mashup, there are an upper bound dmax and a

lower bound dmin for the duration of each selection: dmin ≤d(svj ) ≤ dmax. A cut point cj is the time point where thevideo switches from svj−1 to svj . We denote this time point bytcj . The beginning of an event is regarded as a cut point forconvenience, thus tc1 = 0. Figure 3 illustrates the meaning ofthese symbols.

B. Pre-processing

Before processing, the audios are sampled to 8kHz forsynchronization and quality assessment. The video frame rateis normalized to 25 frames/second. The resolution of framesis resized to the same(640× 360 for example).

Synchronization. Input recordings lie on different periodsof an event. We need to synchronize them for further pro-cessing, that is, to determine the start time t(s)i and end timet(e)i ,∀ri ∈ R. The basic assumption is that there exists at least

one candidate recording anytime during the whole event. Inour system, we adopt the synchronization method with audiofingerprints presented in [21], [22]. We first extract audiofingerprints and compare them to calculate the time offsetsfor each pair of audio. A voting scheme is then performed tomutually determine the time offsets of all the recordings.

Both video and audio mashup must follow a synchronizationconstraint: the starting time of the selected item must be earlierthan the current cut point, and the end time later than the nextcut point.

t(s)sj ≤ tcj ≤ tcj+1≤ t(e)sj (1)

If a candidate audio/video does not satisfy this constraint at aswitching time point, it will be excluded from consideration.

Video Structuring. Operation on videos can be performedon three temporal layers: frame, subshot, and shot. As talkedin [23], frame layer operation is not only time-consumingbut also difficult for further content analysis, since frame isnot the most informative semantic unit of videos. Shot is aphysical video structure resulting from the users’ start and stopoperation. It usually lasts a relatively long period of time andcontains inconsistent content. In our mashup system, subshotis the basic unit of video mashup, as it contains consistentcamera motion and self-contained semantics [20]. We applythe color and motion threshold based algorithm as describedin [20] to structure the videos.


C. Audio mashup

Audio mashup is the process of combining all the audiorecordings into a single but complete sound recording duringthe whole event. As we stated, each audio may only recordspart of the whole event. It works by assessing audio qualityand selecting the highest one under less switching principle.

According to the results in our focus study, the main goal ofaudio mashup is to maximize the overall quality of the selectedaudio fragments Q(Ma). Recall that Q(Ma) does not simplyequal the sum of all selected audio fragments, due to the factthat the switching of audio will degrade the overall quality.We studied the mobile audio carefully and found that thequality does not fluctuate dramatically and frequently for eachaudio recording. Based on this observation, we embody theless switching principle to a hard constraint when switchingto another audio source. Let qsaj (t) represents the quality scoreof saj at time t. Switching is only performed when the qualityof some other audio is much better than the current one.

qsaj+1(t) > γ · qsaj (t), (2)

where γ is to penalize the switching and is set to 1.2 in ourexperiments.

We adopt a greedy search strategy to generate the mashupaudio by checking the quality of candidate audios every asecond. Switching to a new audio source happens when thecurrent one ends or another audio is much better than thecurrent one (equation (1) and equation (2)).

1) Quality assessment: In the above solution, we need toevaluate audio quality. As far as we know, few works havebeen focused on non-intrusive audio quality assessment, exceptfor some on speech signals [24]. Li et al. model non-referenceaudio quality assessment as a learning-to-rank problem [25].It seems to be the first approach to music audio as they claim.This approach is not applicable in our scenario, as we expectmeaningful quality scores in equation (2).

In our system, we use P.563, a non-intrusive speech qualityassessment algorithm [26], to assess audio quality. Such anapproximation is based on the assumption that mobile multi-camera recordings are often captured at events where speakingor singing is the major signal. We assess audio quality on afive second sliding window with a time step of one second.

To further verify whether P.563 works in our settings, werandomly select four mobile concert audio recordings anddownload the corresponding music, which we call as thereference audio. We evaluate the quality of these two types ofaudio with P.563 respectively. The quality scores are shown inFig. 4. In audio mashup, we want to select audio fragmentsthat have good quality and last long, like the beginning ofAudio 2 in Fig. 4.

2) Audio stitching: In the final step, we need to stitchthe audio fragments to the mashup audio. Audio is differentfrom video in that people are sensitive to sudden changes.As input audio recordings are recorded in different locationsand different surroundings, directly concatenating these audiofragments will result in the sudden change due to variantvolume and tone. To overcome this problem, we first applyDC offset correction to adjust audio volume gain. To alleviate

1

1.5

2

2.5

3

3.5

4

4.5

5

1

10

19

28

37

46

55

64

73

82

91

100

109

118

127

136

145

154

163

172

181

Quality of Reference Audios

Audio 1 Audio 2 Audio 3 Audio 4

1

1.5

2

2.5

3

3.5

4

4.5

5

1

10

19

28

37

46

55

64

73

82

91

100

109

118

127

136

145

154

163

172

181

Quality of Mobile Audios

Audio 1 Audio 2 Audio 3 Audio 4

MO

S

MO

S

time(s)time(s)

Fig. 4. An example of audio quality assessment. We randomly select fourmobile concert audio recordings and download the corresponding music,which we call as reference audio. We evaluate the quality of these two typesof audio with P.563 respectively. The left plot is the quality of the four mobileaudio recordings. The right plot is the quality of the reference audios. Thehorizontal axis is the timeline. And the vertical is the quality. MOS is shortfor Mean Opinion Score. We can find that score of mobile audio is muchlower than that of reference audio, which verifies the efficacy of P.563 onmobile audios.

the seam effect in concatenating audio, an image blendingalgorithm with a laplacian pyramid is adopted to stitch audiofragments together.

D. Video cut point detection

Video mashup is to select shots from the input videosand combine them together for a single yet enriched andprofessional looking video stream. It is divided into two steps:cut point detection and video shot selection. In this section,we talk about cut point detection which is to determine whenthe mashup video switches from one video source to another.

1) Formulation: According to the filming principles, videocut point detection is a comprehensive concern of both audio(tempo, speaking/singing interval) and video (camera motion,subshot integrity). The system first detects candidate cut pointsfrom the mashup audio. As to video, we require motionconsistency and semantic completeness at the candidate cutpoints later. We propose two suitability scores for cut pointdetection: tempo suitability ST (t) and semantic suitabilitySS(t). Tempo suitability controls switching frequency withrespect to audio tempo. Semantic frequency avoids interruptionof speaking/singing. A new cut point is detected on theassumption that cut points earlier than that have already beendetermined. The problem is formulated as:

tcj = argmint

{ST (t|tcj−1

) + SS(t|tcj−1)

}, s.t.

dmin ≤ t− tcj−1 ≤ dmax.(3)

2) Tempo suitability ST (t|tcj−1): Video switching fre-

quency should be consistent with audio tempo. Fast-pacedaudio should be paired with frequent switching. Less switchingis preferred in smooth events. We use the interval betweenaudio onsets to approximate audio tempo [17], [27]. As wediscussed in Section III, there is no definite relationshipbetween audio tempo and switching frequency. We map thetempo b(t) at time t to the expected duration d(b(t)) linearlyas:

d(b(t)) = dmax −dmax − dmin

bmax − bmin(b(t)− bmin), (4)

where bmax and bmin are the maximum and minimum tempoof the audio.


0

50

100

150

200

250

1 6 11 16 21 26 31 36 41 46 51 56 61

Cu

t P

oin

t T

ime

(se

co

nd

)

Cut Point Index

Cut Points of Audio 2

delta (0.025) delta (0.05)

delta (0.075) delta(0.10)

delta(0.15) delta(0.20)


delta(0.80)

0

50

100

150

200

250

300

350

400

450

1 11 21 31 41 51 61 71 81 91 101 111 121 131

Cu

t P

oin

t T

ime

(se

co

nd

)

Cut Point Index

Cut Points of Audio 1

delta (0.025) delta (0.05)

delta (0.075) delta(0.10)



delta(0.80)

Fig. 5. Comparison of video cut points detection with different values of δon two randomly selected audios. The values are in terms of seconds. It canbe observed that detected cut points are similar when δ is less than 0.2.

The tempo suitability at time t is then measured by

ST (t|tcj−1) = |∫ t

tcj−1

1

d(b(t))dt− 1|. (5)

3) Semantic suitability SS(t): Semantic suitability SS(t)means that we should avoid switching which attracts userattention. We achieve this goal by selecting points whereaudio energy is low, like the singing or speaking intervals.Semantic suitability is thus measured by the audio energy e(t),normalized to [0, 1].

SS(t) = e(t). (6)

4) Problem solution: It is hard to give a closed-form solu-tion to the continuous objective function in (3), as the temposuitability and semantic suitability are not smooth functions.Instead, the timeline is discretized into a step of δ seconds.The system enumerates all candidate time points in the timerange [dmin, dmax] from the previous cut point. The problemsolution is then:

tcj = tcj−1+K ∗ δt,

where K = argminK

{ST (K) + SS(K)

},

ST (K) =∣∣ K∑k=1

δt

d(b(k))− 1∣∣,

SS(K) = e(K) = e(tcvj−1+K ∗ δt),

b(k) = b(tcvj−1+ k ∗ δt).

(7)

In the ideal case, the value of δ should be small enoughto approach the continuous objective function. To select anappropriate value for δ, we randomly select some audios anddetect cut points with different values of δ. Fig. 5 shows theexample of two randomly selected audios. We find that whenδ is less than 0.2 seconds, the detected cut points are verysimilar. As a result, we set δ to 0.1 seconds in our experiments.Let d(e) denote the duration of the event, number of candidatetime points to enumerate in the above function is d(e)/δ,which is often in thousands of scale for a few minutes ofvideo and costs little time in the cloud.

E. Video shot selectionVideo shot selection is to determine which video source

to switch to at each cut point, given the detected cut pointsabove. In this section, we formulation video shot selectioninto an optimization problem constrained by camera motionconsistency.

1) Formulation: As observed in focus study, video shotselection is a comprehensive consideration of video quality,diversity, camera motion and semantic completeness. As se-mantic completeness depends on selected shots, we will con-sider it in post-processing. Video shot selection is formalizedas maximizing video quality and diversity under the constraintsof camera motion consistency. Specifically, we do not allowcamera motion around cut points to simplify the cameramotion consistency constraint. Such a setting can: (a) alleviatevisual impact caused by the connection of moving shots; (b)smoothen the change of camera motion as the constraints donot cause any interruption in camera motion.

We denote the camera motion vector at the left and rightside of the j-th selected shot svj by m−(svj ) and m+(svj ). Themotion consistency constraint is represented by:

m−(svj ) = m+(svj ) = 0, (8)

Let Q(Mv) represents the quality of the mashup video, andD(Dv) the diversity. The formulation of video shot selectionis:

Mv = argmax(sv1 ,...,s

vMv )

{Q(Mv) +D(Mv)

}, s.t.

m−(svj ) = m+(svj−1) = 0,∀j ∈ [2,Mv].(9)

2) Video quality assessment: Unlike audio, video qualitydoes not degrade with switching. The overall video quality iscalculated as:

Q(Mv) =

Mv∑j=1

Q(svj ), (10)

where Q(svj ) is the quality score of shot svj at period[tcj , tcj+1

].We employ a non-reference video quality assessment tech-

nique to measure mobile video quality. Video quality ismeasured based on six aspects in two categories: temporaland spatial [23]. Temporal factors, including unstablenessand jerkiness, are caused by erratic camera motion, whereasspatial factors (infidelity, brightness, blurring, and tilting) aredue to poor capturing environment. The six factors of videoshot svj , denoted by uvj,i ∈ [0, 1], i ∈ [1, 6], are evaluatedas unsuitability scores at the granularity of subshot. Overallunsuitable score U and quality score Q of a subshot arecombined in two ways:

• Pre-filter bad quality candidate subshots. Any subshotwith quality score of any factor lower than its thresholdis regarded as unacceptable, with the unsuitability scoreset to a large value (1,000 in our experiment) to ensurethat the shot is not selected, except in some cases wherethe video is still the best choice.

• An objective term in the optimization formulation. Over-all score is calculated with a rule-based method [23]:

U(svj ) = E(uvj ) +1

10 + 6γ

6∑i=1

(uvj,i − E(uvj )),

Q(svj ) = 1− U(svj ), (11)

where E(uvj ) is the average of the six quality scores of shotsvj . γ is a predefined constant which controls the amount ofdifference between uvj,i and E(uvj ) and is set to an empirical


value of 0.20 as the same to [23]. Quality of a video shotQ(svj ) is the minimum subshot quality in the shot.

3) Diversity: The overall diversity is calculated as thecumulative diversity of all the selections:

D(Mv) =

Mv∑j=1

D(svj ), (12)

where D(svj ) is the diversity score of the jth selection. It ismostly related to the content that viewers have seen and is stillfresh in their memory. We measure diversity by how much acandidate selection will recall such content in memory. D(svj )is calculated as:

D(svj ) = D(svj , svj−1). (13)

Similar to the memory model in [12], recall and diversitybetween two shots a and b are calculated as:

R(a, b) = s(a, b) · fa · fb · (1−∆t

Tm),

D(a, b) = 1−R(a, b),(14)

where Tm is the memory size. s(a, b) is the similarity oftwo shots, which we measure by the SSIM [28] distance ofkeyframes of the two shots. fa and fb are the ratio of shotlength to the memory size Tm. ∆t is the time differencebetween two subshots.

4) Problem solution: To solve the optimization problemin (9), constraints are represented by an indicator functionI(svj ), which is zero when constraints are satisfied. Otherwise,if the camera motion constraint is not satisfied, it is a largerpenalty (1,000 in our experiment) to ensure the shot is notselected, except when it is the only choice. The problem isthus formulated as:

Mv = argmin(sv1 ,...,s

vMv )

{ Mv∑j=1

{U(svj ) + I(svj )

}+

Mv∑j=2

R(svj , svj−1)

}.

(15)The above optimization problem can be defined recursively.

Let f(svm : svn) denote the optimal value of the above objectivefunction from cut point m to n (m is not included), asequation (16) shows.

f(svm : svn) =

min(svm+1,...,s

vn)

{R(svm, s

jm+1) +

n∑j=m+1

(U(svj ) + I(svj )

)+

n∑j=m+2

R(svj−1, svj )

}= min

svm+1

{U(svm+1) + I(svm+1) +R(svm, s

jm+1)

+ f(svm+1 : svn)

},

where 1 ≤ m ≤ n ≤Mv.

(16)

The recursive equation (16) has the optimal substructureproperty. In other words, to optimize f(svm : svn), we haveto optimize f(svm+1 : svn) for every possible choice of svm+1.Taking advantage of the optimal substructure property and the

recursive function, we can solve the optimization problem withdynamic programming to approach the global optimization.

We can assume there exists a virtual sv0 which is uniqueand satisfies all the constraints. The original optimizationproblem (15) is then solved by optimizing equation (17) withdynamic programming and the recursive equation (16).

Mv = argmin(sv1 ,...,s

vMv )

f(sv0 : svMv )

U(sv0) = I(sv0) = 0, R(sv0, sv1) = 0

(17)

5) Post-processing: As we observed in the focus study,each recording is comprised of many self-containedsemantics—subshots. The previous procedures do not incor-porate the constraints of such semantic completeness. Selectedsubshots next to the cut points may be too short to be compre-hensible. Besides, shakiness caused by erratic camera motionshould be further reduced for better viewing experience.

As a result, we apply two post-processing steps: semanticcompleteness and video stabilization to the output of videoshot selection. For semantic completeness, we require an-other duration constraint on the subshots next to cut points.Specifically, if a subshot lasts less than one second, wewill tune the corresponding cut point a little to satisfy thisduration constraint. For stability, we apply video stabilizationto alleviate the shakiness in mobile videos [29].

V. EXPERIMENTS

In this section, we evaluate the performance of the proposedmobile video mashup system. Fig. 6 shows an example of mo-bile video mashup. Along the timeline are the candidate videoscaptured by mobile devices. The system selects different shotsof these candidates, as the green borders show.

The main goal of the experiments is to evaluate the systemsbased on three criteria: 1) The first criteria is whether audiomashup improves audio quality. A comparison is conductedwith the approach in the Virtual Director system [2], whichselects the audio fragment of each selected video shot. 2)The second is to evaluate whether the proposed cut pointdetection works compared with manually labelling and thelearning algorithm [5]. 3) The third is whether the proposedvideo mashup algorithm achieves better viewing experience.The baselines are the two previous mashup systems: VirtualDirector [2] and Jiku Director [5]. Evaluation focuses on videoquality, diversity, stability, and overall viewing experience.

A. Dataset

We collect 46 mobile recordings of six events fromYoutube1, which is the largest dataset in existing works.Each recording contains both audio and video streams. Theserecordings are all captured by non-professionals during con-certs using mobile devices. Quality issues mentioned previ-ously are common in the dataset. 14 recordings of the firstthree events are the same as those used in Virtual Director. Weuse the output videos provided by the authors2 for comparison.The remaining 32 recordings are submitted to Jiku Director for

1http://www.youtube.com2Test videos: http://www.youtube.com/AutomaticMashup


1'00'' 2'00'' 3'00'' 4'00''

timeline

Audio 1

Audio 2

Audio 3

Audio 4

Video 1

Video 2

Video 3

Video 4

1'30'' 2'30'' 3'30'' 4'30''

Fig. 6. An example of mobile video mashup. Recordings are captured by mobile devices. Above the timeline is audio mashup. The waves represent the audioenergy. Only Audio 2 and Audio 1 are selected for audio mashup. Video mashup is shown below the timeline. Each recording is represented by frames fromtime 1’00” to 4’30’ sampled every 30 seconds. Selection of cameras at these sampling points are represented by the green borders and lines.

TABLE IIDETAILS OF THE DATASET AND THE OPTIMIZATION TIME

Event #Recordings Duration Sync.Audio Cut Point Video

Mashup Detection Mashup

E1 5 4’37” 1’44” 0.11” 1.58” 26.19”

E2 5 7’01” 2’38” 0.14” 2.26” 36.26”

E3 4 5’15” 27” 0.11” 1.90” 10.26”

E4 8 6’25” 2’55” 0.10” 2.11” 2’49”

E5 11 3’32” 6’07” 0.11” 1.36” 6’26”

E6 27 5’22” 33’27” 0.19” 1.90” 19’24”

the mashup videos. Table II shows the details of the dataset.We process the recordings in a Windows server which features8 CPU cores with 3.40GHz speed and 16GB RAM. Averagely,for a one minute audio, it takes 45 seconds to evaluate audioquality. For a one minute video, it takes 240 seconds for videostructuring, 65 seconds for motion analysis, and 13 secondsfor quality assessment. Optimization time of the six events arereported in Table II.

B. Experimental settings

For audio mashup evaluation, we ask users to listen to pairsof audio recordings generated by the proposed method and theVirtual Director approach that stitches audio fragment of theselected video shots. We follow the evaluation scheme of MOS(Mean Opinion Score) [30]. The score choices are: 1 (Bad),2 (Poor), 3 (Fair), 4 (Good), 5 (Excellent). The methods usedto create the audio mashups are not disclosed to users. Theorder of the two audio mashups in each pair is randomizedfor different users.

To evaluate the cut point detection, we invite another twoprofessional video editors (different from those in focus study)to assess the cut point suitability, as average users are lesssensitive to the quality of cut points. We also ask the editorsto give their comments to the cut points. The assessment isaround two aspects:

Video 1 Video 2

Fig. 7. Video mashup evaluation. Two videos created by two methods areplaced on the left and right side respectively.

• Is the switching frequency of the video appropriate?• Do the cut points appear at the right time?

Mashup videos created by Virtual Director, Jiku Director, andours (12 videos in total) are shuffled. The editors are asked towatch them and give scores to the above two questions oneby one. The score choices and their meanings are the same tothose of audio mashup evaluation, ranging from 1 to 5.

For video mashup, we conduct online user studies to com-pare the viewing experience. Before the study, an instructionpage is shown to guide the evaluation. The page shows fourmajor factors users should consider: diversity, image quality,stability, and overall rating. We ask users a question for eachaspect as follows:

• Diversity. Does the video give a rich overview of theevent? This question determines whether users will bebored with the video due to the monotonous view.

• Visual Quality. Is the visual quality of the video good?Visual quality here is mainly about the spatial factorsmentioned before.

• Stability. Is the video shown stable? We highlight sta-bility since shakiness is a major quality issue in mobilevideos, especially in videos captured by amateurs.

• Overall Rating. Do you think the video is well edited?Users are asked to answer the above four questions for eachvideo, with scores ranging from 1 (No, I do not agree at all)to 7 (Yes, I completely agree).

We ask an user experience (UX) designer to help design


TABLE IIINUMBER OF SWITCHING TIMES OF MASHUP AUDIO

MethodSetting 1 Setting 2

Audio 1 Audio 2 Audio 3 Audio 4 Audio 5 Audio 6

Mashup 1 2 1 2 1 4VD 5 5 4 86 23 76

the layout of the evaluation web pages. Unlike previousexperiments where videos are shown one after another [2],[5], the designer suggests that we show two mashup videossimultaneously. One mashup video is placed on the left sideand the other on the right, as Figure 7 shows. Placement ofvideos is random. Users give scores to both the left and rightvideo, without knowing the methods that create them. Thebenefits of such a setting include:

• It is easy for users to notice the difference between twovideos. In previous settings, users are expected to keepall the information in memory.

• Users will not feel confused when watching two videossimultaneously. Too many videos playing simultaneouslywill distract users’ attention.

According to the above setting, the proposed system is com-pared with Virtual Director on the first three events and JikuDirector on the latter three events respectively.

C. Evaluation of audio mashup

In evaluating audio mashup, we conducted two experimentswith different settings. In the first experimental setting, weevaluate whether the proposed stitching method works. Due tothe less switching frequency principle, mashup audio generatedby our system switches just a few times (see Table III). Hencewe select 30 seconds of the three mashup audio fragmentsin which switching between audio sources happens often.The counterparts of audio recordings generated by the VirtualDirector approach are selected.

We invited 18 users, including 14 males and 4 females, toattend the user study. Users’ ages range from 22 to 26 yearsold. Among them, frequency of watching video recordings(concerts, competition, etc.) are distributed as follows: 2 rarely,3 monthly, 6 weekly, and 7 daily. The results are shown inTable IV (Setting 1). The proposed stitching method is ableto alleviate the aural impact caused by switching. The benefitis even obvious in cases where source audio fragments differmuch from each other on volume and tone(Audio 2).

In the second setting, we present the whole mashup audiorecordings to participants. The purpose of this setting is to findout whether the proposed audio mashup scheme can improvethe quality of mashup audio recording. We invited another 15users, with ages ranging from 20 to 33 years old to evaluateanother three pairs of mashup audios. Among the users, thereare five females and 11 males. The frequency with which theywatch video recordings is distributed as follows: 2 rarely, 1monthly, 10 weekly, and 3 daily. As we can see from Setting 2in Table IV, audio fragments generated by the mashup systemare much better than those generated by Virtual Director.

According to the results of the above two settings, oursystem generates better mashup audio. The improvement is

TABLE IVSUBJECTIVE EVALUATION (MOS) OF AUDIO MASHUP

MethodSetting 1 Setting 2

Audio 1 Audio 2 Audio 3 Audio 4 Audio 5 Audio 6

Mashup 2.56 2.28 3.56 3.00 3.81 3.0VD 2.33 1.28 3.5 1.75 2.81 2.38

3.0

4.0

2.5

2.0

2.5

2.0

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Switching Frequency

MoVieUp VD

3.5 3.5

4.0

3.0 3.0 3.0

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Switching Frequency

MoVieUp Jiku

Fig. 8. Evaluation of switching frequency. E1 to E6 are the six events inTable II.

three-fold: 1) The proposed algorithm selects better audio frag-ments at each switching; 2) The proposed algorithm reducesswitching frequency significantly; 3) The proposed algorithmstitches shots from different sources smoothly to reduce theobtrusive aural effect.

D. Evaluation of Cut Point Detection

In this section, We compare our proposed cut point detectionmethod with Virtual Director and Jiku Director. Note that cutpoints in Virtual Director is manually annotated by listeningto the highest quality audio among all recordings. While JikuDirector and MoviMashup select automatically.

Fig. 8 shows the comparison result of the three systems onswitching frequency. We can easily observe that MoVieUp ismuch better than Virtual Director. This is a little surprisingas the cut points in Virtual Director are manually annotated.To investigate the reasons, we talked with the editors andfound that there are too much meaningless switching in videosgenerated by Virtual Director. Such meaningless switchingis often due to some bad effects in video (like occlusionand shakiness). Though our system also selects candidate cutpoints from audio, it works better on video quality to avoidsuch meaningless switching. The editors remind us that noswitching is even better if no appropriate video is available.Jiku Director gets a Fair score on all the three events.This is an average score and accords with its non-content-based learning approach. MoVieUp, with consideration oftempo suitability, semantic suitability, shot quality, and motionconsistency on cut points, achieves the best performance.

As to the cut point suitability comparison shown in Fig. 9,Virtual Director gets poor scores due to the same meaninglessswitching problem analyzed above. The results verify thatcut point selection is dependent on both audio and video.According to the scores, the proposed MoVieUp is slightlybetter than Jiku Director. In our focus study, editors suggestswitching videos at speaking/singing intervals. We also findliterature that switch at the music beats [17]. The editorsthemselves even cannot tell a definite rule to select cut points.


3.5 3.5

4.0

3.0

3.5 3.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Cut Point Suitability

MoVieUp Jiku

4.0

3.5 3.5

2.5

3.0

2.0

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Cut Point Suitability

MoVieUp VD

Fig. 9. Evaluation of cut point suitability. E1 to E6 are the six events inTable II.

Even though, the experimental results can still show thatMoVieUp provides a practical way to detect cut points as itperforms better than Fair or even achieves the Good level.

E. Evaluation of Video Mashup

We evaluate the viewing experience of our proposedsystem—MoVieUp, to Virtual Director [2] and Jiku Direc-tor [4]. Since Jiku Director is an online system, we mainlyfocus on the viewing experience of the videos, rather thantheir application scenarios. Note that we do not apply videostabilization in MoVieUp during the experiments to make afair comparison with the two existing systems.

1) Comparison with Virtual Director: The 18 users in thefirst setting of audio mashup attended this evaluation study.Results are shown in Fig. 10.

We find that our method is better than Virtual Director ondiversity, but the advantage is not as significant as with theother three factors. This is reasonable since both methodsselect different video sources at each cut point to avoid theproblem of monotony. The difference is that we considertemporal issues, compared to Virtual Director which measuresdiversity with only adjacent frames. Additionally, we analyzethese videos and find that in each event there are only fourto five source videos. They are captured from different view-ing angles and distances. This also alleviates the monotonyproblem.

Our method performs much better on visual quality. VirtualDirector considers four quality factors: blockiness, blurriness,brightness, and shakiness. We employ more spatial and tem-poral video quality factors (tilting, infidelity, and jerkiness).The pre-filtering step helps to filter out very bad video shots.To further study the reasons, we invited three professionaleditors to label the shot quality of mashup videos to be“Good” or “Bad” from the aspects of Shakiness, Darkness, andInfidelity (including those caused by occlusion or polluted bystrong lighting). Tilting rarely appears in our experiments andblurriness is not labeled individually as it is closely related toShakiness. The results are as Table V shows. Among 46 shotsselected by Virtual Director in the first event, 24 shots are toodark (more than half of the screen are black). Correspondingly,there are only seven dark shots in the 27 shots selected byour system. In the second event, 14 of the 55 shots selectedby Virtual Director suffer irregular camera motion, resultingin the jerkiness effect observed in the mashup video, whilecamera motion of only two shots are erratic in our system.We take a further look at the dataset and find that four out of

4.1 4.13.8

4.1

3.5 3.4

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Diversity

MoVieUp VD

4.85.0

4.3

3.4

3.02.7

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Visual Quality

MoVieUp VD

4.2

4.8

3.83.7

2.93.2

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Stability

MoVieUp VD

4.44.6

4.1

3.13.3

2.9

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E1 E2 E3

Overall Score

MoVieUp VD

Fig. 10. Comparison of the proposed method with Virtual Director. E1, E2,E3 are the three events in Table II. These videos are the same to those usedby Virtual Director [2].

TABLE VVISUAL QUALITY COMPARISON OF MOVIEUP AND VIRTUAL DIRECTOR

FactorsMoVieUp VD

Event 1 Event 2 Event 3 Event 1 Event 2 Event 3

Shot 27 27 29 46 55 43

Shaky 0 2 1 2 14 8

Dark 7 0 0 24 1 0

Occluded 0 2 5 0 5 8

the five videos suffer from a shakiness problem. The sevenunstable shots are possibly selected for diversity. Besides,infidelity is also a big concern. There are five shots of VirtualDirector polluted by strong lighting (half of the screen). Thenumber is only two in our system. In the third event, VirtualDirector selected eight out of 43 unstable shots. In our system,only one shot are unstable. According to the resulted mashupvideo and user feedback, we concluded that our system is ableto select high quality shots from both spatial and temporalaspects.

The proposed mashup system outperforms Virtual Directoron all three specific factors in all the three events. It verifiesthe efficacy of the employed video diversity and qualityassessment methods. It also shows that the proposed algorithmcan achieve a better optimization of diversity, image quality,and thus better viewing experience. This is why users givehigher overall scores to all three mashup videos generated byour system.

2) Comparison with Jiku Director: The 15 participants inthe second setting of audio mashup attended the comparisonstudy with Jiku Director, Evaluation results are shown asFig. 11.

Our system generates better results than Jiku Director ondiversity, but like the comparison with Virtual Director thisis also not by a significant amount. Both MoVieUp and JikuDirector employ key frame similarity. The level of interest in


4.6

5.0

4.64.5 4.6

4.1

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Diversity

MoVieUp Jiku

4.4

5.1

4.14.13.9 4.0

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Visual Quality

MoVieUp Jiku

4.1

4.9

4.13.9 4.0

3.6

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Overall Score

MoVieUp Jiku

4.3

5.1

4.7

3.6

4.3

3.4

2.0

2.5

3.0

3.5

4.0

4.5

5.0

E4 E5 E6

Stability

MoVieUp Jiku

Fig. 11. Comparison of the proposed method with Jiku Director. E4, E5, E6are the three events in Table II. Mashup videos are generated by Jiku Directorand MoVieUp respectively.

content for both systems decreases over time. The differenceis that Jiku Director selects viewpoint (shooting distance andangle) first. Only videos in the selected viewpoint will beconsidered further. Such a strategy may reduce diversity whenconsecutive viewpoints are selected close to each other. Incontrast to this approach, our method selects shots with amemory model.

For visual quality, the two systems are similar in Event 4 andEvent 6. Both systems consider spatial and temporal qualityfactors. We also evaluate the shot quality as described incomparison with Virtual Director. The results are as Table VIshows. Though our system selects fewer unstable shots thanJiku Director in Event 4, the occluded shots may be the reasonof the comparable scores in Fig. 11. In Event 5, MoVieUpperforms much better. We find that the video generated byJiku director is affected by occlusion, strong lighting, anderratic camera motion. As to Event 6, the editors respond thatboth videos suffer from blurriness. According to Fig. 11 andTable VI, we can still conclude that MoVieUp performs betterthan Jiku Director in terms of visual quality.

The overall rating again shows that the proposed system isable to provide a better viewing experience. It can achievebetter diversity and video quality. The stability improves byan especially large margin over previous methods.

F. Video Mashup Example

Due to the fact that the stability of mobile videos is notguaranteed, we apply video stabilization as a post-processingstep to the mashup video for better quality. Some examplesof the final output of our mashup system are shown online3.Generally, viewing experience is further improved comparedwith non-stabilized videos.

3http://www.youtube.com/user/AutoMoVieUp/videos

TABLE VIVISUAL QUALITY COMPARISON OF MOVIEUP WITH JIKU DIRECTOR

FactorsMoVieUp Jiku Director

Event 4 Event 5 Event 6 Event 4 Event 5 Event 6

Shot 49 21 37 49 26 49

Shaky 3 0 0 11 3 3

Dark 1 0 0 0 0 1

Occluded 5 1 1 7 3 4

VI. CONCLUSION AND DISCUSSION

In this paper, we present a fully automatic mobile video-audio mashup system that works in the cloud to generateboth mashup audio and mashup video from multi-camerarecordings uploaded from various handheld clients. The sys-tem achieves viewing experience superior to state-of-the-artvideo mashup techniques. The system is based on the filmingprinciples concluded from a focus study. To generate highquality mashup audio, we evaluate the quality of the sourcesand select the best one under the less switching principle.We detect video cut points by measuring tempo and semanticsuitability from audio. Motion consistency is considered tomake smooth switching. To ensure the quality of video, weconsider both spatial and temporal quality factors. To enrichvideo content, we use a memory model to resolve the problemof monotony. Video mashup is formulated as a constrainedoptimization problem.

One limitation of this work is that we assume that thequality of mobile recordings can be assessed by speech qualityassessment algorithms. This may limit the applications in somemobile scenarios. As far as we know, quite few works havebeen done on non-intrusive quality assessment of general au-dio. Another limitation of this work is that it cannot reconstructthe 3D position and orientation of the cameras, which hindersthe employment of film grammars like the 30 degrees rule,avoiding Jump Cuts, and many others about selecting videoshots. 3D reconstruction is time consuming and often fails onmobile videos. Sensor-based analysis is a potential solutionbut requires the encoding of such 3D information along withvideo frames, which is still unavailable currently.

There are a number of possible improvements for mobilevideo mashup. Localization of mobile cameras in the cap-tured event [31] will bring in more computational filmingprinciples and help improve video diversity. Besides, non-intrusive objective audio quality assessment is still a rareinvolved research topic. Visual factors can be incorporated intothe detection of cut points approach more natural switching.Another possibility in future work is to explore more semanticinformation in videos, so that we can take advantage of objectmotion, camera motion, and many other elements in the filmlanguages to improve the viewing experience. Furthermore,users are now browsing and searching in the internet withmultimodal queries [32]. More interactive user experiencebetween clients and clouds is expected by extending the videoediting techniques to more general videos from the internet.


ACKNOWLEDGMENT

The work was partially supported by National NaturalScience Foundation of China ( No.61371192 ). Ying-Qing Xuwas supported by the National Basic Research Program ofChina (Grant No. 2012CB725300) and the National NaturalScience Foundation of China (Grant No. 61373072).

REFERENCES

[1] J. Sang, T. Mei, Y. Xu, C. Zhao, C. Xu, and S. Li, “Interaction designfor mobile visual search,” IEEE Transactions on Multimedia, vol. 15,no. 7, pp. 1665–1676, 2013.

[2] P. Shrestha, P. H. N. de With, H. Weda, M. Barbieri, and E. H. L.Aarts, “Automatic mashup generation from multiple-camera concertrecordings,” in ACM Multimedia, 2010, pp. 541–550.

[3] S. J. Russell and P. Norvig, Artificial Intelligence — A Modern Approach.Pearson Education, 2010.

[4] D.-T.-D. Nguyen, M. Saini, V.-T. Nguyen, and W. T. Ooi, “Jiku director:A mobile video mashup system,” in ACM Multimedia, 2013, pp. 477–478.

[5] M. K. Saini, R. Gadde, S. Yan, and W. T. Ooi, “MoViMash: onlinemobile video mashup,” in ACM Multimedia, 2012, pp. 139–148.

[6] X.-S. Hua, L. Lu, and H. Zhang, “Automatic music video generationbased on temporal pattern analysis,” in ACM Multimedia, 2004, pp. 472–475.

[7] I. Arev, H. S. Park, Y. Sheikh, J. K. Hodgins, and A. Shamir, “Automaticediting of footage from multiple social cameras,” ACM Trans. Graph.,vol. 33, no. 4, p. 81, 2014.

[8] L. Canini, S. Benini, and R. Leonardi, “Interactive video mashup basedon emotional identity,” in Proceedings of European Signal ProcessingConf, 2010, pp. 1499–1503.

[9] D. Cardillo, A. Rapp, S. Benini, L. Console, R. Simeoni, E. Guercio, andR. Leonardi, “The art of video mashup: supporting creative users with aninnovative and smart application,” Multimedia Tools and Applications,vol. 53, no. 1, pp. 1–23, 2011.

[10] H. Sundaram, L. Xie, and S.-F. Chang, “A utility framework for theautomatic generation of audio-visual skims,” in ACM Multimedia, 2002,pp. 189–198.

[11] S. Sharff, The Elements of Cinema: Toward a Theory of CinestheticImpact. Columbia University Press, 1982.

[12] H. Sundaram and S.-F. Chang, “Computable scenes and structures infilms,” IEEE Transactions on Multimedia, vol. 4, no. 4, pp. 482–491,2002.

[13] F. Lampi, S. Kopf, M. Benz, and W. Effelsberg, “A virtual camera teamfor lecture recording,” IEEE MultiMedia, vol. 15, no. 3, pp. 58–61, 2008.

[14] S. Sumec, “Multi camera automatic video editing,” in Computer Visionand Graphics, 2006, pp. 935–945.

[15] A. Ranjan, R. Henrikson, J. P. Birnholtz, R. Balakrishnan, and D. Lee,“Automatic camera control using unobtrusive vision and audio tracking,”in Graphics Interface, 2010, pp. 47–54.

[16] C. Zhang, Y. Rui, J. Crawford, and L.-W. He, “An automated end-to-endlecture capture and broadcasting system,” ACM Transactions on Mul-timedia Computing, Communications, and Applications (TOMCCAP),vol. 4, no. 1, p. 6, 2008.

[17] X.-S. Hua, L. Lu, and H.-J. Zhang, “Optimization-based automatedhome video editing system,” IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 14, no. 5, pp. 572–583, 2004.

[18] F. Manchel, Film study: an analytical bibliography. Fairleigh DickinsonUniv Press, 1990, vol. 1.

[19] D. Arijon, Grammar of the film language. Focal Press London, 1976.[20] J.-G. Kim, H. S. Chang, J. Kim, and H.-M. Kim, “Efficient camera

motion characterization for mpeg video indexing,” in IEEE InternationalConference on Multimedia and Expo, 2000, pp. 1171–1174.

[21] P. Shrestha, M. Barbieri, H. Weda, and D. Sekulovski, “Synchronizationof multiple camera videos using audio-visual features,” IEEE Transac-tions on Multimedia, vol. 12, no. 1, pp. 79–92, 2010.

[22] W. Liu, T. Mei, and Y. Zhang, “Instant mobile video search with layeredaudio-video indexing and progressive transmission,” IEEE Transactionson Multimedia, vol. 16, no. 8, pp. 2242–2255, 2014.

[23] T. Mei, X.-S. Hua, C.-Z. Zhu, H.-Q. Zhou, and S. Li, “Home video visualquality assessment with spatiotemporal factors,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 17, no. 6, pp. 699–706,2007.

[24] D. Campbell, E. Jones, and M. Glavin, “Audio quality assessmenttechniques — a review, and recent developments,” Signal Processing,vol. 89, no. 8, pp. 1489–1500, 2009.

[25] Z. Li, J.-C. Wang, J. Cai, Z. Duan, H.-M. Wang, and Y. Wang, “Non-reference audio quality assessment for online live music recordings,” inACM Multimedia, 2013, pp. 63–72.

[26] A. W. R. an J. G. Beerends, D.-S. Kim, P. Kroon, and O. Ghitza,“Objective assessment of speech and audio quality — technologyand applications,” IEEE Transactions on Audio, Speech & LanguageProcessing, vol. 14, no. 6, pp. 1890–1901, 2006.

[27] D. Stowell, M. Plumbley, and Q. Mary, “Adaptive whitening for im-proved real-time audio onset detection,” in Proceedings of the Interna-tional Computer Music Conference, 2007.

[28] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,” IEEETransactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.

[29] S. Liu, L. Yuan, P. Tan, and J. Sun, “Bundled camera paths for videostabilization,” ACM Transactions on Graphics (TOG), vol. 32, no. 4,p. 78, 2013.

[30] ITU-R Recommendation BS.1284-1, “General methods for the sub-jective assessmentof sound quality,” International TelecommunicationUnion, Geneva, Switzerland, Tech. Rep., Dec. 2003.

[31] H. Liu, T. Mei, J. Luo, H. Li, and S. Li, “Finding perfect rendezvouson the go: Accurate mobile visual localization and its applications torouting,” in Proceedings of the 20th ACM International Conference onMultimedia, 2012, pp. 9–18.

[32] Y. Wang, T. Mei, J. Wang, H. Li, and S. Li, “JIGSAW: interactivemobile visual search with multimodal queries,” in Proceedings of the19th International Conference on Multimedia 2011, 2011, pp. 73–82.

Yue Wu received his B.E. degree in 2012 fromthe University of Science and Technology of China(USTC), Hefei, China. He is currently a Ph.D. can-didate in the Department of Electronic Engineering(EEIS), USTC. He worked as an intern in MicrosoftResearch, Beijing, China, from July 2012 to June2012 and from February 2014 to March 2015, re-spectively. His research interests include multimedia,computer vision, machine learning, and data mining.

Tao Mei (M’07-SM’11) is a Lead Researcher withMicrosoft Research, Beijing, China. He receivedthe B.E. degree in automation and the Ph.D. de-gree in pattern recognition and intelligent systemsfrom the University of Science and Technology ofChina, Hefei, China, in 2001 and 2006, respectively.His current research interests include multimediainformation retrieval and computer vision. He hasauthored or co-authored over 100 papers in journalsand conferences, 10 book chapters, and edited threebooks. He holds 11 U.S. granted patents and more

than 20 in pending.Dr. Mei was the recipient of several paper awards from prestigious

multimedia journals and conferences, including the IEEE Circuits and SystemsSociety Circuits and Systems for Video Technology Best Paper Award in 2014,the IEEE Trans. on Multimedia Prize Paper Award in 2013, the Best StudentPaper Award at the IEEE VCIP in 2012, and the Best Paper Awards at ACMMultimedia in 2009 and 2007, etc. He received Microsoft Gold Star Awardin 2010, and Microsoft Technology Transfer Awards in 2010 and 2012. Heis an Associate Editor of IEEE Trans. on Multimedia, Multimedia Systems,and Neurocomputing, a Guest Editor of six international journals. He is theGeneral Co-chair of ICIMCS 2013, the Program Co-chair of IEEE MMSP2015 and MMM 2013, and the Workshop Co-chair of ICME 2012 and 2014.He is a Senior Member of the IEEE and the ACM.


Ying-Qing Xu is a Cheung Kong Scholar ChairProfessor at Tsinghua University, Beijing, China.He received the B.S. degree in mathematics fromJilin University (1982) and the Ph.D. in computergraphics from Chinese Academy of Sciences (1997),separately. His current research interests includehuman computer interaction, computer graphics, andthe e-heritage. Dr. Xu has authored or co-authoredover 80 research papers, as well as has had over20 granted and more pending US patents. He isa member of ACM, ACM SIGGRAPH, and CAA

(Chinese Artists Association), a senior Member of IEEE and CCF (ChinaComputer Federation).

Nenghai Yu received the B.S. degree from NanjingUniversity of Posts and Telecommunications, Nan-jing, China, in 1987, the M.E. degree from TsinghuaUniversity, Beijing, China, in 1992, and the Ph.D.degree from University of Science and Technologyof China, Hefei, China, in 2004.

He has been on the faculty of the Departmentof Electronic Engineering and Information Science,University of Science and Technology of China(USTC) since 1992, where he is currently a profes-sor. He is the executive director of the Department

of Electronic Engineering and Information Science, and the director ofthe Information Processing Center at USTC. His research interests includemultimedia security, multimedia information retrieval, video processing andinformation hiding. He has authored or co-authored over 130 papers injournals and international conferences. He has been responsible for manynational research projects.

Prof. Yu and his research group won the Excellent Person Award andthe Excellent Collectivity Award simultaneously from the National Hi-techDevelopment Project of China in 2004. He was the co-author of the BestPaper Candidate at ACM Multimedia 2008.

Dr. Shipeng Li joined and helped to found Mi-crosoft Researchs Beijing lab in May 1999. Hisresearch interests include multimedia processing,analysis, coding, streaming, networking and com-munications. From Oct. 1996 to May 1999, Dr. Liwas with Sarnoff Corporation. Dr. Li has been ac-tively involved in research and development in broadmultimedia areas and international standards. He hasauthored and co-authored 6 books/book chapters and280+ referred journal and conference papers. Heholds 140+ granted US patents.

Dr. Li received his B.S. and M.S. in Electrical Engineering (EE) from theUniversity of Science and Technology of China (USTC) in 1988 and 1991,respectively. He received his Ph.D. in EE from Lehigh University in 1996.He was a faculty member at USTC in 1991-1992. Dr. Li is a Fellow of IEEE.

Documents

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO ...€¦ · mobile video mashup. In Table I, we provide a comparison of our proposed system with the two existing mashup systems