Upload
xavier-giro
View
721
Download
0
Embed Size (px)
Citation preview
Semantic and Diverse Summarization of Egocentric
Photo EventsAniol Lidon Baulida
Master Computer Vision (UAB, UPC, UPF, UOC)
Advisors:Xavier Giró Nieto, Image Processing Group, Universitat Politècnica de CatalunyaPetia Radeva, Barcelona Perceptual Computing Lab, Universitat de Barcelona
1
CollaborationBarcelona Perceptual Computing Laboratory :
Marc Bolaños, Petia Radeva
Image Processing Group:
Xavier Giró
Grup de Recerca Cervell, Cognició i Conducta:
Maite Garolera
Institute of Creative Media Technologies:
Matthias Zeppelzauer
2
Motivation• In 2013, 44.4 million people with dementia worldwide.• “Cognitive Stimulation Therapy”
3
Motivation• Lifelogging with Narrative Clip.• Up to 2000~3000 images at day!• Summarization is needed.
4
Goal
5
Automatically summarize events. • Sorting by priority.• Trade-off between relevance and diversity.• Obtaining sorted ranks.
Goal
6
RELEVANCE
Automatically summarize events. • Sorting by priority.• Trade-off between relevance and diversity.• Obtaining sorted ranks.
Goal
7
RELEVANCE
DIVERSITY
Automatically summarize events. • Sorting by priority.• Trade-off between relevance and diversity.• Obtaining sorted ranks.
Sate of the art• This project continues the work started by Ricard Mestre.
– Event segmentation and selecting the most repetitive image from an event.
• Off-the-shelf algorithms used:– Informativeness network: provided by Marc Bolaños (to be published)– Blur detection: Crete et al. The blur effect: perception and estimation with a new no-
reference perceptual blur metric– Saliency Maps: provided by Kevin McGuinness (to be published).– Face detection: Zhu et al. Face detection, pose estimation, and landmark localization in
the wild.– Object Candidates: Arbelaez et al. Multiscale Combinatorial Grouping – Object Detector: Hoffman et al. Large Scale Detection through Adaptation.– Affective: Campos et al. Diving Deep into Sentiment: Understanding Fine-tuned CNNs for
Visual Sentiment Prediction
8
Pipeline
9
Pipeline
10
Prefiltering
11
Aim: Removing uninformative images.
Informativeness network
Fine-tuning by Human Annotations
Filtering out: Discarding absolutely uninformative frames.
Pipeline
12
Pipeline
13
Relevance
14
What is relevance?Frame-level:
•Repeated.• Unusual.• WHAT? Representative of an activity. • WHO? Social interactions. • WHERE? Environment. • WHEN an event has occurred. • HOW activity occurred.
Relevance
15
What is relevance?Frame-level:
• WHAT? Representative of an activity. • Saliency Maps• Object detection
• WHO? Social interactions. • Face detection• Sentiment Analysis (Affectivity)
Relevance Ranking: pipeline
16
Prefiltering
Diversityre-ranking
Relevance rankingSaliency maps
SalNet CNN
Aim: Determining interesting zones.
Scoring for relevance: Averaging all saliency-map values.
17
Relevance ranking
18
Objects
LSDA Large Scale Detection through Adaptation
Object Detector
Aim: Finding well defined objects.
Scoring for relevance: Summing all detected objects scores.
Relevance ranking
19
Faces
Face detection, pose estimation, and landmark localization in the wild.
Aim: Finding well defined faces.
Scoring for relevance: Summing exponentially all faces confidences.
Relevance Ranking: pipeline
20
Prefiltering
Diversityre-ranking
Pipeline
21
Pipeline
22
Diversity re-ranking
Re-ranking by Soft Max Diversity Fusion
23
Color similarity
Faces similarity
Diversity re-ranking
Re-ranking by Soft Max Diversity Fusion
24
Color similarity
Faces similarity
Diversity re-ranking
Re-ranking by Soft Max Diversity Fusion
25
Color similarity
Faces similarity
Similarity measure
26
ImageNetEuclidean distance between features (L2 norm).
CNN trained with ImageNet DB (1000 classes) using CaffeNet Architecture.
Fully connected layer 8 removed.
Pipeline
27
Pipeline
28
Assesment
29
Validation of automatic approach
Manually annotated summaries
• 7 dataset with labelled ground-truth • 2 Online questionnaires• Mean Opinion Score
Psychologists feedback:
INTERMEDIATE VALIDATION FINAL EVALUATION
Subjective problem
30
Precision
GROUND-TRUTH SELECTED
Metric
31
Mean Normalized Sum of Max Similarities (MNSMS)
MN
SMS
n (%)
Normalization in both axesY: Divide by GT samplesX: Reshape samples to N bins
Ground-Truth
Sor
ted
List
(Res
ults
)
n=1
Similarity Sum= + +
Metric
32
Mean Normalized Sum of Max Similarities (MNSMS)
MN
SMS
n (%)
Normalization in both axesY: Divide by GT samplesX: Reshape samples to N bins
Ground-Truth
Sor
ted
List
(Res
ults
)
n=2
Similarity Sum= + +
Metric
33
Mean Normalized Sum of Max Similarities (MNSMS)
MN
SMS
n (%)
Normalization in both axesY: Divide by GT samplesX: Reshape samples to N bins
Ground-Truth
Sor
ted
List
(Res
ults
)
n= 3
Similarity Sum= + +
Metric
34
Mean Normalized Sum of Max Similarities (MNSMS)
MN
SMS
n (%)
Normalization in both axesY: Divide by GT samplesX: Reshape samples
Ground-Truth
Sor
ted
List
(Res
ults
)
Similarity Sum= + +
n= 4
AUC
Metric
35
Mean Normalized Sum of Max Similarities (MNSMS)
MN
SMS
n (%)
Normalization in both axesY: Divide by GT samplesX: Reshape samples
Ground-Truth
Sor
ted
List
(Res
ults
)
Similarity Sum= + +
n= 4
Assesment
36
Validation of automatic approach
Manually annotated summaries
• 7 dataset with labelled ground-truth• MNSMS (ImageNet) AUC
• 2 Online questionnaires• Mean Opinion Score
Psychologists feedback:
INTERMEDIATE VALIDATION FINAL EVALUATION
Intermediate validation
37
Prefiltering•Informativeness Network
•Hand Crafter Estimators
• Not prefitering
Intermediate validation
38
• SalNet
• SalNet + Gaussian
Objects Relevance• LSDA (object detector)
• MCG (object candidates)
0,7
0,75
0,8
0,85
0,9
SalNet SalNet + Gauss
0,7
0,75
0,8
0,85
0,9
LSDA MCG
Saliency RelevanceSaliency Relevance AUC
Objects Relevance AUC
Intermediate validation
Affective Relevance• Positive
• Negative
•Extremum
•Random
Sentiment analysis CNN • 2 classes: positive / negative
39
Assesment
40
Validation of automatic approach
Manually annotated summaries
• 7 dataset with labelled ground-truth• MNSMS (ImageNet) AUC
• 2 rounds of online questionnaires• Mean Opinion Score
Psychologists feedback:
INTERMEDIATE VALIDATION FINAL EVALUATION
Final evaluation
41
SIMILARITY• ImageNet CNN (fc8 removed)
• Places CNN (fc8 removed)
• LSDA (only spatial NMS)
• Fusion (ImageNet + Places + LSDA)
(Diversity re-ranking + Weight fusion in MNSMS)
Final evaluation
43
MEAN OPINION SCORE• ImageNet configuration
• Uniform Sampling
• Ground-truth (previous manual annotation)
Final resultsRepresentativity of summaries:
Preferred summary:
Mean Opinion Score (1 worse - 5 best)
45
GeneralizationMediaeval diverse task
• APPLICATION: Finding more information about a place to visit. • GOAL: Povide a ranked list of Flickr photos for a predefined set of queries. The
refined list should be both relevant to the query and also diverse.
46A. Lidon, M. Bolaños, M. Seidl, X. Giro-i Nieto, P. Radeva, and M. Zeppelzauer, “Upc-ub-stp @ mediaeval 2015 diversity task: Iterative reranking of relevant images,” in MediaEval 2015 Workshop, Wurzen, Germany, 2015.
0,40,420,440,460,48
0,50,520,540,56
Run 1 F1@20 (Visual)
Conclusions
• Contributions: – Mean Normalized Sum of Max Similarities. – New criterion for semantic diversity (based on LSDA).– New method for diversity fusion.– Online evaluation questionnaires.
47
Conclusions• Tested in two applications:
– Memory reinforcement for mild-dementia.– Diverse Social Images Task from the scientific MediaEval benchmark.
• Mean Opinion Score of 4.6 out of 5.00.
• Publications:– Working-notes paper in MediaEval challenge.– Wearable and Ego-vision Systems for Augmented Experience of the
journal IEEE Transactions on Human-Machine Systems.
• Code available: https://imatge.upc.edu/web/resources/semantic-and-diverse-summarization-egocentric-photo-events-software
48
Future work
• Further in other relevance criterion.• Higher level of semantics. • Determine automatically the summary length.
49
Thanks for your attention!
50
Prefiltering
51
Hand-crafted estimators
Blur
Black
Burned Color mean
Crete et al.
Informativeness network
•CNN trained with ImageNet + Places.
•Finetuned with human annotations: relevant / irrelevant
by Marc Bolaños (UB)
Relevance ranking
52
Affective
• VitorNet CNN (2 classes sentiment prediccions)
by Victor Campos (UPC)
Relevance ranking
53
Late fusion
• Score normalization:•By Rank
•By Score
• Aggregate scores
Using MNSMS weights will be learned
Similarity measure
54
ImageNet
Places
LSDa
CNN trained with ImageNet DB (1000 classes) using CaffeNet Architecture.
Fully connected layer 8 removed.
CNN trained with Places (476 classes) DB using CaffeNet Architecture.
Fully connected layer 8 removed.
Object detector : Large Scale Detection through Adaptation (7500 classes).Knowledgement transfer: Classifiers without bounding box annotated data into detectorsTwo post-processing steps of no-maxima supression.
ResultMediaeval diverse task
• APPLICATION: Finding more information about a place to visit. • GOAL: Povide a ranked list of Flickr photos for a predefined set of queries. The
refined list should be both relevant to the query and also diverse.
Ranking for relevance
Filtering
Distance computation
Diversity
Informativeness network, Textual
Keep N% top results
ImageNet, Places, Textual
Diverse top results
ResultMediaeval diverse task
• APPLICATION: Finding more information about a place to visit. • GOAL: Povide a ranked list of Flickr photos for a predefined set of queries. The
refined list should be both relevant to the query and also diverse.
Visual Textual Multi Crediv. Multi