50
Noboru Babaguchi Osaka University Joint Work with Prof. N. Nitta ICME2013 Co-located WS MMIX13, Keynote San Jose, July 18, 2013

Example-Based Remixing of Multimedia Contents

Embed Size (px)

DESCRIPTION

Example-Based Remixing of Multimedia Contents

Citation preview

Page 1: Example-Based Remixing of Multimedia Contents

Noboru Babaguchi

Osaka University

Joint Work with Prof. N. Nitta

ICME2013 Co-located WS MMIX13, Keynote San Jose, July 18, 2013

Page 2: Example-Based Remixing of Multimedia Contents

Definitions, Background, Related Work

Multimedia Remixing Support System

Video Clip Sequence Creation

Music Clip Selection

Shot Extraction

Conclusion and Future Work

Page 3: Example-Based Remixing of Multimedia Contents

From wikipedia… A remix is a song that has been edited to sound different from the original version. The person who remixed it might have changed the pitch of the singers' voice, changed the tempo and speed and has made the song shorter or longer, or instead of hearing just one person singing they might have duplicated the voice to make it sound like two people are singing, or make the voice echo.

Remixes should not be confused with edits, which usually involve shortening a final stereo master for marketing or broadcasting purposes. … A remix song recombines audio pieces from a recording to create an altered version of the song.

In recent years the concept of the remix has been applied analogously to other media. …. Scary Movie series is famous for its comic remix of various well-known horror movies such as Ring, Scream, and Saw.

Page 4: Example-Based Remixing of Multimedia Contents

Video Remix: a video clip made by recombining various media components to create an altered version of the original videos.

Video transition effects (Cut, fade-in/out,

dissolve, etc.)

Audio clips (music, sound effects,

voices, etc.)

Original video clips

Video remixes (e.g. movie trailers)

Video clip selection & arrangement

Multimedia stream Combination

How can we create video remixes of good quality?

from “The School of Rock” (2003)

Page 5: Example-Based Remixing of Multimedia Contents

Semantic Aspect:

What should we present? (Semantic Content)

Highlights of Sports Games, etc.

Affective Aspect:

How should we present the video content?

(Aesthetic Compatibility, Film Syntax)

Commercial Films,Movie Trailers, etc.

How to arrange video clips or what music clip to augment to enhance the expressive quality

Two aspects in video remixing

Video Summarization

Page 6: Example-Based Remixing of Multimedia Contents

Video Remix

Scene-Music Relation

Shot-Scene Relation A sequence of L video shots

A sequence of D music clips

A video scene

Problem of Video Remixing

A music clip

= A sequence of D video scenes

An excerpt from a video clip

To maintain the feeling of continuity in a scene

Page 7: Example-Based Remixing of Multimedia Contents

Hitchcock[Girgensohn2001]

Template-based Editing[Davis2003]

Lazycut[Hua2005]

Emotion-based[Canini2010]

Video-Music Mixing [Mulhem2003][Hua2004][Wang2005][Yoon2009][Cristani2010]

Page 8: Example-Based Remixing of Multimedia Contents

Video clip selection and arrangement

Focused on how various types of video clips are arranged in sequence.

For example… • A scene has to have at least three video clips[Sundaram01]. • Two video shots of extremely different shot sizes should not be connected[Kumano02]. • The duration of a shot recorded with the camera fixed is up to 15 seconds[Kumano02].

Film Syntax

[Sundaram01] H. Sundaram, et al., “Condensing computable scenes using visual complexity and film syntax analysis,” Proc. ICME,

pp.389-392, 2001.

[Kumano02] M. Kumano, et al., “Video editing support system based on video content analysis,” Proc. ACCV, pp.628-633, 2002.

[Canini10] L. Canini, et al., “Interactive video mashup based on emotional identity,” Proc. European Signal Processing Conf., pp.1499-1503, 2010.

Aesthetic Compatibility

•Shots with similar emotional impact should be connected[Canini10].

Page 9: Example-Based Remixing of Multimedia Contents

Music clip selection

Focused on which types of music clips are mixed with video shots.

For example… • dynamic, motion, and pitch of image and audio streams coincide with each other[Mulhem03]. • novelty, velocity, and brightness of image and audio streams coincide with each other[Yoon09].

Aesthetic Compatibility

[Mulhem03] P. Mulhem, et al., “Pivot vector space approach for audio-video mixing,” IEEE Multimedia, 10(2), pp.28-40, 2003

[Yoon09] J.-C. Yoon, et al., “Automated music video generation using multi-level feature-based segmentation,” MTAP, 41(2), pp.197-214, 2009

[Cristani10] M. Cristani, et al., “Toward an automatically generated soundtrack from low-level cross-modal correlations for automotive scenarios,”

Proc. ACM Multimedia, pp.551-559, 2010

Determined heuristically • brightness of image and audio streams and rhythm of audio stream and optical flow in image stream coincide with each other[Cristani10]

Determined statistically

Page 10: Example-Based Remixing of Multimedia Contents

Multimedia Remixing Support System

Page 11: Example-Based Remixing of Multimedia Contents

It is difficult to explicitly defining the rules and know-how about how the video and music clips should be arranged, considering the aesthetic compatibility.

The rules and structures commonly used in professionally created examples can be modeled by standard machine learning techniques.

Non-professional users can be supported on their interface based on the models which implicitly describe shot-scene and scene-music relations considering aesthetic compatibility.

Page 12: Example-Based Remixing of Multimedia Contents

A Set of Video Remix Examples

Professionally Created Video Remixes

Page 13: Example-Based Remixing of Multimedia Contents

A Set of Video Remix Examples

Target: Remixing original video clips based on Examples

A Set of Music Clips

A Set of Original Video Clips

video remix

Page 14: Example-Based Remixing of Multimedia Contents

video remix

I) Video Clip Sequence Creation Interface

II) Music Clip Selection

III) Shot Extraction (Video and Music Synchronization)

User

・・・

・・・

A set of video clips:

A set of music clips:

A Set of Video Remix Examples

・・・ ・・

・・・

Video Remix Template

Shot

Scene

Video Clip Suggestions

Page 15: Example-Based Remixing of Multimedia Contents

N. Nitta and N. Babaguchi, “Example-based video remixing,” Multimedia Tools and Applications,

51(2), pp.649-673, 2011

N. Nitta and N. Babaguchi, “Example-based home video remixing,” Proc. ICME, 2011

Page 16: Example-Based Remixing of Multimedia Contents

Video Remix Examples

Symbol Sequence

Home (Personal) Videos

Video Clips

Segmentation

Suitability[Nitta2011] To Template

Perceived Quality[Tao2007]

B A B C G E

Template

Interface

Overview of Procedure I)

Template Generation

T. Mei, et al., "Home Video Visual Quality Assessment With Spatiotemporal Factors," IEEE Trans. Circuits and Systems for Video Technology, vol.17, no.6, pp.699-706, 2007.

Page 17: Example-Based Remixing of Multimedia Contents

Video Remix Examples

Slow Scene

Active Scene

HMM

Example-based Template Generation

Shot Length Brightness

Motion Intensity w/wo Camera Work

w/wo Human Objects

Low-level Features

Feature Extraction

・・・

Sequences of video shots

Shot

i h g f e d c b a

Symbolization

Symbol Sequence

Video Remix Template (New Symbol Sequence & State Sequence)

GA

A Sequence of L Shots

A Sequence of D Scenes

Page 18: Example-Based Remixing of Multimedia Contents

Video Clip 1 Video Clip 2 Video Clip 3

A Home Video

Suitability to Template 0.3 0.2 0.7

Perceived Quality 0.7 0.5 0.6

From Shot to Video Clip

Shots in target video are divided into video clips based on the camerawork

Page 19: Example-Based Remixing of Multimedia Contents

Video clip selection

Video Remix Template

Interface

3D book-style video clip presentation

Timeline Presentation

Suitability To Template

Perceived Quality

× △

▲ spine

Fore edge Fore edge

Page 20: Example-Based Remixing of Multimedia Contents

Interface

Page 21: Example-Based Remixing of Multimedia Contents

Video remix examples: 61 action movie trailers

Video clips: 265 home (personal) video clips recording a sports field day held by a kindergarten

Subjective evaluation by 8 subjects

Compare with video clip sequence created by considering only the perceived quality of video clips

Page 22: Example-Based Remixing of Multimedia Contents

Subjective Score: 3.5 Subjective Score: 3

With Template* Without Template

* Selected video clips are shortened according to the template

Created Video Clip Sequence

Using action movie trailers as examples resulted in creating a sequence of many short video clips

Page 23: Example-Based Remixing of Multimedia Contents

N. Nitta and N. Babaguchi, “Example-based video remixing support system,”

Proc. ACM Multimedia, pp.563-572, 2011

Page 24: Example-Based Remixing of Multimedia Contents

Video Clip Sequence (Scene)

Overview of Procedure II) A Set of Video Remix Examples (Scenes)

A set of Music Clips

visually similar video remix examples

similar music clips

Page 25: Example-Based Remixing of Multimedia Contents

Evaluate the compatibility among video scenes and music clips by their distances in the video scene and music feature spaces

Learn non-linear mapping of music feature space so that the distances among video scenes and the mixed music clips would be correlated [Suzuki07]

Music Clip Feature Space

(Music Clips

Mixed to Example Video Scenes)

Video scene feature space

(Example Video Scenes)

Expected Music clip feature space

(Music Clips

Mixed to Example Video Scenes)

[Suzuki07] K. Suzuki, et al., “A similarity-based neural network for facial expression analysis,” Pattern Recognition Letters, 28(9), pp.1104-1111, 2007

Page 26: Example-Based Remixing of Multimedia Contents

Music Clip Selection

Video Scenes・・・Visual Features

Music Clips・・・Audio Features

[Zettl99]

Emotion-based Music Classification

[Zettl99] H. Zettl, “Sight Sound Motion: Applied Media Aesthetics,” Wadsworth Publishing, 1999

Page 27: Example-Based Remixing of Multimedia Contents

Consists of 2 Neural Networks

Input: Audio Features xAi and xB

i of Music Clips A and B

Output: Transformed Audio Features yAj and yB

j of Music Clips A and B

Learn the weights wl,m of Neural Network so that the differences between the distances of yA

j and yBj and the distances of the video scenes mixed with music clips A and B would be

minimized.

wl,m: Weight for the edge between nodes I and m.

・・・ ・・・

・・・

・・・ ・・・

・・・

TAB

dAB

Teacher

(Distances of Video Scenes

Mixed with Music Clips A and B )

Input A

Input B

xAi

xBi

Neural Network A

Neural Network B

yAj

yBj Distance calclulation

Page 28: Example-Based Remixing of Multimedia Contents

Interface

Page 29: Example-Based Remixing of Multimedia Contents

Video Remix Examples: 61 Action Movie Trailers

Video Scene Examples :45 Scenes

Music Clips:180 Music Clips of Various Genres (Movie Soundtracks, Classical Music, Japanese-pop, Western-pop, etc.)

Video Clips: Shots extracted from Original Movies

265 Home Video Clips recording a sports field day held by a kindergarten

Video Clip Sequence: Made by Procedure I)

Page 30: Example-Based Remixing of Multimedia Contents

Input:10 Video Scenes randomly extracted from movie trailers (without Audio Stream )

10 subjects rated (1: very bad – 10: very good) 10 video scenes mixed with

Video1) 3 Music Clips Selected by Proposed Approach

Video2) Music Clips most similar to the music excerpts mixed with the 3 least similar video scenes

Video3) Music Clip mixed with the video scenes in movie trailers (baseline: professional)

Video4) 3 Music Clips selected in the same way as for Video 1) without music feature space transformation

Video5) 3 Music Clips selected in the same way as for Video 2) without music feature space transformation

Video1 – Video 2 = 1.72±0.34

(95% confidence interval)

⇒indicates the effectiveness of

similarity-based music clip selection

Video1 – Video 4 = 1.11±0.35

⇒indicates the effectiveness of

music feature space transformation

Video 1 → closest to Video 3

⇒selected music clips are subjectively

closest to professionally selected ones 0

1

2

3

4

5

6

7

8

Video1

Video2

Video3

Video4

Video5

Average Subjective Scores

6.1

4.4

7.2

5.0 4.5

Page 31: Example-Based Remixing of Multimedia Contents

Video1

6.8

Video2

2.7

Video3

8.3

Video4

6.4

Video5

3.0

Score

Page 32: Example-Based Remixing of Multimedia Contents

Video1

8.5

Video2

2.1

Video3

5.5

Video4

5.5

Video5

2.3

Score

Page 33: Example-Based Remixing of Multimedia Contents

Subjective Score: 3.8 Subjective Score: 5.3

With Template* Without Template

Video Clip Sequence after Music Mixing

Subjective score improved largely after music mixing

Created video clip sequence and selected music clips are synergetic in improving the expressive quality.

* Selected video clips are shortened according to the template

Page 34: Example-Based Remixing of Multimedia Contents

Y. Kurihara, N. Nitta, and N. Babaguchi, “Automatic appropriate segment extraction from shots

based on learning from example videos,” Proc. PSIVT, pp.1082-1093, 2009

Y. Kurihara, N. Nitta, and N. Babaguchi, “Appropriate segment extraction from shots based on

temporal patterns of example videos,” Proc. MMM, pp.253-264, 2008

Page 35: Example-Based Remixing of Multimedia Contents

Vid

eo C

lip

Seq

uen

ce

Vid

eo R

emix

Video Clip 1

Shot 1 Shot 2 Shot 3

Video Clip 3 Video Clip 2

A video clip needs to be shortened. A video clip contains redundant parts.

Which part of a video clip should be extracted as a shot?

Shot Extraction from Selected Video Clip

Page 36: Example-Based Remixing of Multimedia Contents

k frames

Discarded part (Non-shot)

Selected Part (Shot)

Video Clip Example Video Clip

Shot Extraction

Feature Extraction

Pattern Scan for the k frames

which best matches the shot HMM

Feature Extraction

Shot Symbolization

Symbol Sequence

Shot HMM

Non-shot HMM

Overview of Procedure III)

Page 37: Example-Based Remixing of Multimedia Contents

•Shot Classification action and conversation

•Feature extraction Shot

Action Conversation Scenery ・・・

※VSTD : Volume Standard Deviation, LVFR : Low Volume Signal Ratio,ERSB : Energy Ratio of Ferquency SubBand ZCR : Zero Crossing Ratio

Each type of shot is characterized by different features

Page 38: Example-Based Remixing of Multimedia Contents

Examples:Movies+Trailers

Video Clips:Shots in Movies

Shots:Shots in Trailers

Shot extraction from 69 video clips (shots in movies)

Shot Length (k) = Length of corresponding shots in trailers

(32.3% ×video clips on average)

22 47 Test

12 10 Training

Conversation Action

Experiments

Page 39: Example-Based Remixing of Multimedia Contents

Objective Evaluation

Video Clip (Action)

Ground Truth (Shot in Trailer)

Extracted Shot

82 frames

k= 9 frames

Difference:3 frames(0.3sec)

•Compare Extracted Shot with Ground Truth •1 frame=0.1 sec

Page 40: Example-Based Remixing of Multimedia Contents

107 frames

Extracted Shot

Ground Truth

k= 17 frames

Difference:3 frames (0.3sec)

Video Clip (Conversation)

Page 41: Example-Based Remixing of Multimedia Contents

-25

-20

-15

-10

-5

0

5

10

0 5 10 15 20 25 30 35 40 45 50

フレーム#

LogP

f(n);編集区間モデル

g(n);非編集区間モデル

f(n)-g(n)

Extracted Shot Ground Truth

hk(f)-gk(f)

Correctly Extracted Shot

47 frames

k = 5 frames Difference:3 frames

Video Clip (Action)

Extracted Shot

Ground Truth

Shot HMM

Non-Shot HMM

frame

Page 42: Example-Based Remixing of Multimedia Contents

-30

-25

-20

-15

-10

-5

0

5

0 5 10 15 20 25 30 35

フレーム#

logP

f(n);編集区間モデル

g(n);非編集区間モデル

f(n)-g(n)

Extracted Shot Ground Truth

hk(f)-gk(f)

Incorrectly Extracted Shot

35 frames

k = 7 frames Difference:26 frames

Extracted Shot

Ground Truth

Shot HMM

Non-Shot HMM

frame

Page 43: Example-Based Remixing of Multimedia Contents

Objective Evaluation

clips videoof #

extractioncorrect of # accuracy

※Correct Extraction : Shot was extracted within T-frame Difference 1 frame = 0.1 sec

Correct shots were extracted from 72.5%(50/69) of video clips when T=5

73%(16/22) 72%(34/47) T=5

64%(14/22) 60%(28/47) T=3

50%(11/22) 53%(25/47) T=2

Action Conversation

Page 44: Example-Based Remixing of Multimedia Contents

14 subjects watch original long video clips, and then three kinds of shortly extracted shots:

①Ground Truth

②Extracted Shot

③Random Shot

in random order and rank them.

(There can be a tie)

or

or

Page 45: Example-Based Remixing of Multimedia Contents

Ground Truth:③ Extracted Shot:② Random Shot:①

Video Clip (36 frames) ①

③ k = 15 frames

Page 46: Example-Based Remixing of Multimedia Contents

Ground Truth

Extracted Shot Random Shot

・・・Rank 1

・・・Rank 2

・・・Rank 3 69.1%

26.9%

4.0%

53.9% 38.9%

7.2% 7.1%

12.9% 80.0%

Action:18 video clips Conversation:13 video clips

Subjective Evaluation

Extracted Shot ≒Ground Truth >> Random Shot

Page 47: Example-Based Remixing of Multimedia Contents

Subjective Score: 6.2 Subjective Score: 3.9

Without Template With Template

Created Video Remix

Proposed Comparative

I II III I II III

Length (min:sec)

0:36 0:43 10:56 10:59

score 3 5.3 6.2 3.5 3.8 3.9

Page 48: Example-Based Remixing of Multimedia Contents

Introduced an example-based approach for video remixing

Video Clip Sequence Creation

Music Clip Selection

Shot Extraction

Interface

Experiments using movie trailers as remix examples and movies and home videos as video clips

Verified the effectiveness of using remix examples

With Support(6.2), Without Support(3.9)

Conclusion

Page 49: Example-Based Remixing of Multimedia Contents

Improvement of Interface

More investigations using various types/genres of video remix examples

How many examples do we need?

Good examples can reduce the number of examples.

Page 50: Example-Based Remixing of Multimedia Contents