Semantic and Diverse Summarization of Egocentric Photo Events

Semantic and Diverse Summarization of Egocentric

Photo EventsAniol Lidon Baulida

Master Computer Vision (UAB, UPC, UPF, UOC)

Advisors:Xavier Giró Nieto, Image Processing Group, Universitat Politècnica de CatalunyaPetia Radeva, Barcelona Perceptual Computing Lab, Universitat de Barcelona

1

CollaborationBarcelona Perceptual Computing Laboratory :

Marc Bolaños, Petia Radeva

Image Processing Group:

Xavier Giró

Grup de Recerca Cervell, Cognició i Conducta:

Maite Garolera

Institute of Creative Media Technologies:

Matthias Zeppelzauer

2

Motivation• In 2013, 44.4 million people with dementia worldwide.• “Cognitive Stimulation Therapy”

3

Motivation• Lifelogging with Narrative Clip.• Up to 2000~3000 images at day!• Summarization is needed.

4

Goal

5

Automatically summarize events. • Sorting by priority.• Trade-off between relevance and diversity.• Obtaining sorted ranks.

Goal

6

RELEVANCE


Goal

7

RELEVANCE

DIVERSITY


Sate of the art• This project continues the work started by Ricard Mestre.

– Event segmentation and selecting the most repetitive image from an event.

• Off-the-shelf algorithms used:– Informativeness network: provided by Marc Bolaños (to be published)– Blur detection: Crete et al. The blur effect: perception and estimation with a new no-

reference perceptual blur metric– Saliency Maps: provided by Kevin McGuinness (to be published).– Face detection: Zhu et al. Face detection, pose estimation, and landmark localization in

the wild.– Object Candidates: Arbelaez et al. Multiscale Combinatorial Grouping – Object Detector: Hoffman et al. Large Scale Detection through Adaptation.– Affective: Campos et al. Diving Deep into Sentiment: Understanding Fine-tuned CNNs for

Visual Sentiment Prediction

8

Pipeline

9

Pipeline

10

Prefiltering

11

Aim: Removing uninformative images.

Informativeness network

Fine-tuning by Human Annotations

Filtering out: Discarding absolutely uninformative frames.

Pipeline

12

Pipeline

13

Relevance

14

What is relevance?Frame-level:

•Repeated.• Unusual.• WHAT? Representative of an activity. • WHO? Social interactions. • WHERE? Environment. • WHEN an event has occurred. • HOW activity occurred.

Relevance

15

What is relevance?Frame-level:

• WHAT? Representative of an activity. • Saliency Maps• Object detection

• WHO? Social interactions. • Face detection• Sentiment Analysis (Affectivity)

Relevance Ranking: pipeline

16

Prefiltering

Diversityre-ranking

Relevance rankingSaliency maps

SalNet CNN

Aim: Determining interesting zones.

Scoring for relevance: Averaging all saliency-map values.

17

Relevance ranking

18

Objects

LSDA Large Scale Detection through Adaptation

Object Detector

Aim: Finding well defined objects.

Scoring for relevance: Summing all detected objects scores.

Relevance ranking

19

Faces

Face detection, pose estimation, and landmark localization in the wild.

Aim: Finding well defined faces.

Scoring for relevance: Summing exponentially all faces confidences.

Relevance Ranking: pipeline

20

Prefiltering

Diversityre-ranking

Pipeline

21

Pipeline

22

Diversity re-ranking

Re-ranking by Soft Max Diversity Fusion

23

Color similarity

Faces similarity



24

Color similarity

Faces similarity



25

Color similarity

Faces similarity

Similarity measure

26

ImageNetEuclidean distance between features (L2 norm).

CNN trained with ImageNet DB (1000 classes) using CaffeNet Architecture.

Fully connected layer 8 removed.

Pipeline

27

Pipeline

28

Assesment

29

Validation of automatic approach

Manually annotated summaries

• 7 dataset with labelled ground-truth • 2 Online questionnaires• Mean Opinion Score

Psychologists feedback:

INTERMEDIATE VALIDATION FINAL EVALUATION

Subjective problem

30

Precision

GROUND-TRUTH SELECTED

Metric

31

Mean Normalized Sum of Max Similarities (MNSMS)

MN

SMS

n (%)

Normalization in both axesY: Divide by GT samplesX: Reshape samples to N bins

Ground-Truth

Sor

ted

List

(Res

ults

)

n=1

Similarity Sum= + +

Metric

32


MN

SMS

n (%)


Ground-Truth

Sor

ted

List

(Res

ults

)

n=2

Similarity Sum= + +

Metric

33


MN

SMS

n (%)


Ground-Truth

Sor

ted

List

(Res

ults

)

n= 3

Similarity Sum= + +

Metric

34


MN

SMS

n (%)

Normalization in both axesY: Divide by GT samplesX: Reshape samples

Ground-Truth

Sor

ted

List

(Res

ults

)

Similarity Sum= + +

n= 4

AUC

Metric

35


MN

SMS

n (%)

Normalization in both axesY: Divide by GT samplesX: Reshape samples

Ground-Truth

Sor

ted

List

(Res

ults

)

Similarity Sum= + +

n= 4

Assesment

36



• 7 dataset with labelled ground-truth• MNSMS (ImageNet) AUC

• 2 Online questionnaires• Mean Opinion Score



Intermediate validation

37

Prefiltering•Informativeness Network

•Hand Crafter Estimators

• Not prefitering


38

• SalNet

• SalNet + Gaussian

Objects Relevance• LSDA (object detector)

• MCG (object candidates)

0,7

0,75

0,8

0,85

0,9

SalNet SalNet + Gauss

0,7

0,75

0,8

0,85

0,9

LSDA MCG

Saliency RelevanceSaliency Relevance AUC

Objects Relevance AUC


Affective Relevance• Positive

• Negative

•Extremum

•Random

Sentiment analysis CNN • 2 classes: positive / negative

39

Assesment

40



• 7 dataset with labelled ground-truth• MNSMS (ImageNet) AUC

• 2 rounds of online questionnaires• Mean Opinion Score



Final evaluation

41

SIMILARITY• ImageNet CNN (fc8 removed)

• Places CNN (fc8 removed)

• LSDA (only spatial NMS)

• Fusion (ImageNet + Places + LSDA)

(Diversity re-ranking + Weight fusion in MNSMS)

Final evaluation

42

http://aeigsantnarcis.org/aniol/eval/?d=Petia1

Final evaluation

43

MEAN OPINION SCORE• ImageNet configuration

• Uniform Sampling

• Ground-truth (previous manual annotation)

Final evaluation

44

http://aeigsantnarcis.org/aniol/eval/ev2.php?d=Petia1

Final resultsRepresentativity of summaries:

Preferred summary:

Mean Opinion Score (1 worse - 5 best)

45

GeneralizationMediaeval diverse task

• APPLICATION: Finding more information about a place to visit. • GOAL: Povide a ranked list of Flickr photos for a predefined set of queries. The

refined list should be both relevant to the query and also diverse.

46A. Lidon, M. Bolaños, M. Seidl, X. Giro-i Nieto, P. Radeva, and M. Zeppelzauer, “Upc-ub-stp @ mediaeval 2015 diversity task: Iterative reranking of relevant images,” in MediaEval 2015 Workshop, Wurzen, Germany, 2015.

0,40,420,440,460,48

0,50,520,540,56

Run 1 F1@20 (Visual)

Conclusions

• Contributions: – Mean Normalized Sum of Max Similarities. – New criterion for semantic diversity (based on LSDA).– New method for diversity fusion.– Online evaluation questionnaires.

47

Conclusions• Tested in two applications:

– Memory reinforcement for mild-dementia.– Diverse Social Images Task from the scientific MediaEval benchmark.

• Mean Opinion Score of 4.6 out of 5.00.

• Publications:– Working-notes paper in MediaEval challenge.– Wearable and Ego-vision Systems for Augmented Experience of the

journal IEEE Transactions on Human-Machine Systems.

• Code available: https://imatge.upc.edu/web/resources/semantic-and-diverse-summarization-egocentric-photo-events-software

48

https://imatge.upc.edu/web/resources/semantic-and-diverse-summarization-egocentric-photo-events-software

https://imatge.upc.edu/web/resources/semantic-and-diverse-summarization-egocentric-photo-events-software

Future work

• Further in other relevance criterion.• Higher level of semantics. • Determine automatically the summary length.

49

Thanks for your attention!

50

Prefiltering

51

Hand-crafted estimators

Blur

Black

Burned Color mean

Crete et al.

Informativeness network

•CNN trained with ImageNet + Places.

•Finetuned with human annotations: relevant / irrelevant

by Marc Bolaños (UB)

Relevance ranking

52

Affective

• VitorNet CNN (2 classes sentiment prediccions)

by Victor Campos (UPC)

Relevance ranking

53

Late fusion

• Score normalization:•By Rank

•By Score

• Aggregate scores

Using MNSMS weights will be learned

Similarity measure

54

ImageNet

Places

LSDa

CNN trained with ImageNet DB (1000 classes) using CaffeNet Architecture.


CNN trained with Places (476 classes) DB using CaffeNet Architecture.


Object detector : Large Scale Detection through Adaptation (7500 classes).Knowledgement transfer: Classifiers without bounding box annotated data into detectorsTwo post-processing steps of no-maxima supression.

ResultMediaeval diverse task



Ranking for relevance

Filtering

Distance computation

Diversity

Informativeness network, Textual

Keep N% top results

ImageNet, Places, Textual

Diverse top results

ResultMediaeval diverse task



Visual Textual Multi Crediv. Multi

Technology

Semantic and Diverse Summarization of Egocentric Photo Events