Upload
henrik
View
49
Download
2
Embed Size (px)
DESCRIPTION
Joint Summarization of Large-scale Collections of Web Images and Videos for Storyline Reconstruction. Gunhee Kim Leonid Sigal Eric P. Xing. June 16, 2014. Outline. Problem Statement Algorithm Video summarization Storyline reconstruction Experiments Conclusion. - PowerPoint PPT Presentation
Citation preview
1
Joint Summarization of Large-scale Collections of Web Images and Videos
for Storyline Reconstruction
Gunhee Kim Leonid Sigal Eric P. Xing
June 16, 2014
2
• Problem Statement • Algorithm
Video summarization Storyline reconstruction
• Experiments• Conclusion
Outline
3
Background
Online photo/video sharing becomes so popular
Information overload problem in visual data
Average 3,000 pictures uploaded per minute
100 hours of video are uploaded per minute
Any efficient and comprehensive summary?
4
Our Objective
Jointly summarize large sets of online images and videos• The characteristics of two media are complementary
A user video
Videos: Much redundant and noisy information
backlit subjectsfull of trivial BGoverexposure
A set of photo streams
Images: More carefully taken from canonical viewpoints
Video summarizationCollections of Images
5
Our Objective
Jointly summarize large sets of online images and videos• The characteristics of two media are complementary
A set of user videos
Images: Sequential structure is often missing
A photo stream
Videos: Motion pictures
Image summarization Collections of Videos
Problem Statement
6
(Input) A set of photo streams and user videos for a topic of interest
• Edges: chronological or causal relations (i.e., recur in many photo streams)
• Vertices: dominant image clusters
(Output1) Video summary: keyframe-based summarization
(Output2) Image summary as Storyline graph
7
Flickr and YouTube Dataset
20 outdoor recreational classes
SurfingBeach
HorseRiding
RAfting
YAcht
Air Ball-ooning
ROwing
ScubaDiving
FormulaOne
SNowboarding
SafariPark
MountainCamping
RockClimbing
Tour deFrance
LondonMarathon
FlyFishing
• # videos (15,912)
Independ-ence Day
ChineseNew year Memorial
DaySt.Patrick
Day
Wimble-don
• # images/photo streams (2,769,504, 35,545)
8
• Problem Statement • Algorithm
Video summarization Storyline reconstruction
• Experiments• Conclusion
Outline
9
Algorithm for Video Summarization
1. For each video , find the K-nearest photo streams
• Extreme diversity even with the same keywords
• Use Naïve-Bayes Nearest Neighbor method
A user video
A set of photo streams
2. Build a similarity graph between video frames and images
10
Algorithm for Video Summarization
1. For each video , find the K-nearest photo streams
• Extreme diversity even with the same keywords
• Use Naïve-Bayes Nearest Neighbor method
A user videos
A set of photo streams
2. Build a similarity graph between video frames and images
• k-th order Markov chain between frames• Each image casts m similarity votes
11
Algorithm for Video Summarization
3. Solve the following optimization problem of diversity ranking
A user videos
A set of photo streams
• Choose the nodes to place heat source to maximize the temperature• Sources should be (i) densely connected nodes, (ii) distant one another.
Submodular
[Kim et al. ICCV 2011]
A simply greedy achieves a constant factor approximation
12
• Problem Statement • Algorithm
Video summarization Image summarization (Storyline reconstruction)
• Experiments• Conclusion
Outline
13
Definition of Storyline Graphs
A storyline graph• : the vertex set = the set of codewords (i.e. image clusters)
Edges should be Sparse and Time-varying [Song et al. 09, Kolar et al.10]
• Images are too many, and much of them are largely redundant• : popular transitions recurring across many photo streams
Sparsity : only a small number of branching stories per node • A few nonzero elements in
14
Definition of Storyline Graphs
Edges should be Sparse and Time-varying [Song et al. 09, Kolar et al.10]
Time-varying: popular transitions change over time
timeline
t = 10AM t = 12PM t = 2PM
Cluster 10 25
44
A storyline graph• : the vertex set = the set of codewords (i.e. image clusters)• Images are too many, and much of them are largely redundant• : popular transitions recurring across many photo streams
At 1PM
At 7PM
15
Directed Tree Derived from Photo Stream
1. For each photo stream , find the K-nearest videos
• Use Naïve-Bayes Nearest Neighbor method
2. k-th order Markov chain btw images in a photo stream
4. Additional links are connected based on one-to-one correspondences
3. Keyframe detection for each neighbor video
16
Directed Tree Derived from Photo Stream
5. Replace the vee structure (impractical artifact) by two parallel edges
✗• and are followed by .
• Both and must occur in order for to appear.
17
Inferring Photo Storyline Graphs (1/3)
Input: A set of photo streams
Output : A set of adjacency matrices for
Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions
(A1) All photo streams are taken independently
Likelihood of a single photo stream
(A2) k-th order Markovian assumption btw consecutive images in PS (ex. k=1)
(A3) The codewords of xli are conditional independent one another given xl
i-1
Transition model
18
Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions
Inferring Photo Storyline Graphs (2/3)
•
For transition model, use a linear dynamic model
where Gaussian noise
• 1st order Markovian assumption
• k-th order Markovian assumption
A transition from x to y is very unlikely!
whereTransition model
Objective: Derive the likelihood of an observed set of photo streams with reasonable assumptions
Inferring Photo Storyline Graphs (3/3)
where
For transition model, use a linear dynamic model
where Gaussian noise
• 1st order Markovian assumption
• The transition model per dimension can be
The log likelihood
Transition model
d-th row
20
Optimization (1/2)
• (A4) Graphs vary smoothly over time.
For each t , estimate At by maximizing the log-likelihood
Optimization
Data (i.e. images) Timeline
Gaussian Kernel weighting
21
Optimization (2/2)
In summary, the graph inference is
Iteratively solve a weighted L1-regularized least square problem
• Trivially parallelizable (for each d)
• Linear-time algorithm (eg. Coordinate descent)
• Important in our problem (i.e. handling millions of images).
where
Sparsity
22
• Problem Statement • Algorithm
Video summarization Storyline reconstruction
• Experiments• Conclusion
Outline
23
Evaluation of Video Summarization via AMT
(OursV): our method with videos only. (OursIV): our method with videos and images(Unif): uniform sampling. (Spect),(Kmeans): Spectral clustering/Kmeans(RankT): Keyframe extraction methods using the rank-tracing technique
Groundtruths for video summarication via Amazon Mechanical Turk
• (1) For each of 100 test videos, each algorithm selects K keyframes
• (2) At least five turkers are asked to choose GT keyframes
• (3) Compare between GT keyframes and ones chosen by the algorithm
24
Comparison of Video Summarization
air+ballooning fly+fishing
AMT
(OursIV)
(OursV)
(Kmean)
(Unif)
(Unif): cannot correctly handle different lengths of subshots
(OursIV): Get help from the voting by more carefully taken images
(Kmean): hard to know best K
(OursV): suffer from the limitations of using low-level features only
25
Evaluation on Storyline Graphs via AMT
Main difficulty of quantitative evaluation
• No groudtruth available.
• For a human subject, images and too many and graphs are too big
Crowdsourcing-based evaluation via
Ex) fly+fishingWhich is
better?
26
Evaluation on Storyline Graphs via AMT
1. Each algorithm creates storyline per topic.
2. Sample 100 important images as test images
3. Each algorithm predicts next most-likely image after the test image
4. A pairwise preference test• Given the test image, which of A and B is more likely to come next?
✔ Our method
Baseline 2
• Get responses from at least 3 turkers per test image
A crowd of human subjects evaluate only a basic unit (i.e. important edges of storyline).
Test image A
B
27
Quantitative of Storyline Graphs via AMT
Results of pairwise preference tests
• The numbers indicates the percentage of responses that our prediction is more likely to occur next.
(OursV): our method with videos only. (OursIV): our method with videos and imagesNET: Network-based topic models ([Kim et al. 2008]) HMM: Hidden Markov ModelsPage: PageRank based image retrieval (no structural info)
• At least the number should be higher than 50% to validate the superiority of our algorithm.
28
Qualitative Evaluation on Storyline Graphs
Given a pair of images in a novel photo stream, predict 10 images that are likely to occur between them using its storyline graph
• (HMM) retrieves reasonably good but highly redundant images. No branching structure.
• (PageRank) retrieves high-quality images but no sequential structure.
GT
Ours
(HMM)
(PageRank)
29
Qualitative Evaluation on Storyline Graphs
Given a pair of images in a novel photo stream, predict 10 images that are likely to occur between them using its storyline graph
GT
Ours
A downsized storyline graph
30
• Problem Statement • Algorithm
Video summarization Storyline reconstruction
• Experiments• Conclusion
Outline
31
Structural summary with branching narratives
• Global optimality, linear complexity, and easy parallelization
Joint summarization of Flickr images and YouTube videos
Inference algorithm for sparse time-varying directed graphs
Conclusion
Semantic summary even with simple feature similarity
• 2.7M Flickr images and 17K YouTube videos for 20 classes
Images: More carefully taken from canonical viewpoints• The characteristics of two media are complementary
Videos: Motion pictures
32
Thank you !