12
JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1 Semantic Summarization of Egocentric Photo Stream Events Aniol Lidon, Marc Bola˜ nos, Mariella Dimiccoli, Petia Radeva, Maite Garolera, and Xavier Gir´ o-i-Nieto, Member, IEEE Abstract—With the rapid increase of users of wearable cam- eras in recent years and of the amount of data they produce, there is a strong need for automatic retrieval and summarization techniques. This work addresses the problem of automatically summarizing egocentric photo streams captured through a wear- able camera by taking an image retrieval perspective. After removing non-informative images by a new CNN-based filter, images are ranked by relevance to ensure semantic diversity and finally re-ranked by a novelty criterion to reduce redundancy. To assess the results, a new evaluation metric is proposed which takes into account the non-uniqueness of the solution. Experimental results applied on a database of 7,110 images from 6 different subjects and evaluated by experts gave 95.74% of experts satisfaction and a Mean Opinion Score of 4.57 out of 5.0. Index Terms—photo stream summarization, lifelogging, seman- tic summarization, convolutional neural networks I. I NTRODUCTION F ROM smartphones to wearable devices, digital cameras are becoming ubiquitous. This process is being accom- panied by the progressive reduction of digital storage cost, making it possible to collect large amounts of high-quality pictures in a very easy and affordable way. However, browsing these large collections and searching for specific information is still an extremely time consuming and tedious process. As a consequence, most people do not even have time to look at their pictures and the difficulty to look for a specific event, person or situation is too high. This situation arises a number of natural questions: how to manage this large amount of pictures? Do we really need to store all of them? Summarization, the process of generating a proper, compact and meaningful representation of a given image collection through a subset of representative images, is crucial to help managing and browsing efficiently large volumes of video content. Although summarization is not a new research topic in Computer Vision [1], it is still a largely open problem. Our goal in this paper is to address the summarization problem focusing on a particular scenario: the one of analyzing photo streams acquired through a wearable camera. Motivation for this work is given by the explosion of the number of wearable camera users in recent years and, consequently, by the huge amount of data they produce. To record daily experiences from an egocentric, first-person perspective is a P. Radeva, M. Bola˜ nos and M. Dimiccoli are with Universitat de Barcelona and Computer Vision Center. M. Garolera is with Consorci Sanitari de Terrassa. X. Giro-i-Nieto and A. Lidon are with Universitat Politecnica de Catalunya. Manuscript received October 30, 2015; revised Month Day, 201x. Fig. 1. Example of temporal neighbouring images acquired by a wearable photo camera. trend that has been growing progressively since 1998, when Steve Mann proposed the WearCam [2]. Lately, in 2000, Mayol et al. proposed a necklace-like lifelogging device [3]. In 2006, Microsoft Research started to commercialize the first egocentric lifelogging portable camera, the SenseCam, for research purposes [4]. Lifelogging devices also have more usages and more func- tionalities every day, and this increasing number of capabilities allow us to build more complex and useful applications. Some examples of applications are: summarization of a person’s day [5], extraction of nutritional information, extraction of physical activities and actions performed by the camera wearer [6], extraction of information about objects [7] or people [8] in the users environment, etc. Several authors [9], [10], [11], [12] have studied the benefits of lifelogging cues such as egocentric images to help people with dementia to enhance their memory or to help them remember about their forgotten past. Sellen et al. [9] showed that episodic details from a visual lifelog can be presented to users as memory cues to assist them in remembering the details of their original experience. To support people with dementia, Piasek et al. [11] introduced the ”SenseCam Therapy” as a therapeutic approach similar to the well established ”Cognitive Stimulation Theraphy” [13]. Participants were asked to wear SenseCam in order to collect images of events from their everyday lives, then images were reviewed with a trained therapist. However, lifelogging technologies produce huge amounts of data (thousands of images per day) that should be reviewed by both patients and caregivers. To be efficient, lifelogging systems need to summarize the most relevant information in the images. On the other hand, in order to make possible the recording of images from the whole day, it is necessary to use wearable cameras with low temporal resolution (2 fpm). The peculiarity of these image collections is that, due to the low- temporal resolution of the camera and to the free motion of its wearer, temporally adjacent images may be very different in appearance even if they belong to the same event and should therefore be grouped together. For instance, during arXiv:1511.00438v1 [cs.CV] 2 Nov 2015

JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

Semantic Summarization of EgocentricPhoto Stream Events

Aniol Lidon, Marc Bolanos, Mariella Dimiccoli, Petia Radeva, Maite Garolera,and Xavier Giro-i-Nieto, Member, IEEE

Abstract—With the rapid increase of users of wearable cam-eras in recent years and of the amount of data they produce,there is a strong need for automatic retrieval and summarizationtechniques. This work addresses the problem of automaticallysummarizing egocentric photo streams captured through a wear-able camera by taking an image retrieval perspective. Afterremoving non-informative images by a new CNN-based filter,images are ranked by relevance to ensure semantic diversity andfinally re-ranked by a novelty criterion to reduce redundancy.To assess the results, a new evaluation metric is proposedwhich takes into account the non-uniqueness of the solution.Experimental results applied on a database of 7,110 images from6 different subjects and evaluated by experts gave 95.74% ofexperts satisfaction and a Mean Opinion Score of 4.57 out of5.0.

Index Terms—photo stream summarization, lifelogging, seman-tic summarization, convolutional neural networks

I. INTRODUCTION

FROM smartphones to wearable devices, digital camerasare becoming ubiquitous. This process is being accom-

panied by the progressive reduction of digital storage cost,making it possible to collect large amounts of high-qualitypictures in a very easy and affordable way. However, browsingthese large collections and searching for specific informationis still an extremely time consuming and tedious process.As a consequence, most people do not even have time tolook at their pictures and the difficulty to look for a specificevent, person or situation is too high. This situation arisesa number of natural questions: how to manage this largeamount of pictures? Do we really need to store all of them?Summarization, the process of generating a proper, compactand meaningful representation of a given image collectionthrough a subset of representative images, is crucial to helpmanaging and browsing efficiently large volumes of videocontent. Although summarization is not a new research topicin Computer Vision [1], it is still a largely open problem.

Our goal in this paper is to address the summarizationproblem focusing on a particular scenario: the one of analyzingphoto streams acquired through a wearable camera. Motivationfor this work is given by the explosion of the number ofwearable camera users in recent years and, consequently,by the huge amount of data they produce. To record dailyexperiences from an egocentric, first-person perspective is a

P. Radeva, M. Bolanos and M. Dimiccoli are with Universitat de Barcelonaand Computer Vision Center.

M. Garolera is with Consorci Sanitari de Terrassa.X. Giro-i-Nieto and A. Lidon are with Universitat Politecnica de Catalunya.Manuscript received October 30, 2015; revised Month Day, 201x.

Fig. 1. Example of temporal neighbouring images acquired by a wearablephoto camera.

trend that has been growing progressively since 1998, whenSteve Mann proposed the WearCam [2]. Lately, in 2000,Mayol et al. proposed a necklace-like lifelogging device [3].In 2006, Microsoft Research started to commercialize the firstegocentric lifelogging portable camera, the SenseCam, forresearch purposes [4].

Lifelogging devices also have more usages and more func-tionalities every day, and this increasing number of capabilitiesallow us to build more complex and useful applications. Someexamples of applications are: summarization of a person’s day[5], extraction of nutritional information, extraction of physicalactivities and actions performed by the camera wearer [6],extraction of information about objects [7] or people [8] inthe users environment, etc.

Several authors [9], [10], [11], [12] have studied the benefitsof lifelogging cues such as egocentric images to help peoplewith dementia to enhance their memory or to help themremember about their forgotten past. Sellen et al. [9] showedthat episodic details from a visual lifelog can be presented tousers as memory cues to assist them in remembering the detailsof their original experience. To support people with dementia,Piasek et al. [11] introduced the ”SenseCam Therapy” as atherapeutic approach similar to the well established ”CognitiveStimulation Theraphy” [13]. Participants were asked to wearSenseCam in order to collect images of events from theireveryday lives, then images were reviewed with a trainedtherapist.

However, lifelogging technologies produce huge amounts ofdata (thousands of images per day) that should be reviewedby both patients and caregivers. To be efficient, lifeloggingsystems need to summarize the most relevant information inthe images. On the other hand, in order to make possible therecording of images from the whole day, it is necessary to usewearable cameras with low temporal resolution (2 fpm). Thepeculiarity of these image collections is that, due to the low-temporal resolution of the camera and to the free motion ofits wearer, temporally adjacent images may be very differentin appearance even if they belong to the same event andshould therefore be grouped together. For instance, during

arX

iv:1

511.

0043

8v1

[cs

.CV

] 2

Nov

201

5

Page 2: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

a meeting, the people the camera wearer is interacting withand the objects around them may change their position andfrequently appear occluded (see Fig. 1).

As a consequence of this, taking the semantics into accountis crucial to summarize egocentric sequences acquired by a lowtemporal resolution camera. This paper proposes a method tosummarize each of the events present in a daily egocentric se-quence, aiming at preserving semantic information and diver-sity, while reducing the total number of images. The methodconsists of three major steps: first, non-informative imagesare removed; second, they are ranked by semantic relevance;and finally, a new re-rank is applied by enforcing diversityamong the chosen subset of pictures. Our contributions canbe summarized as follows:

1) Propose a CNN-based informativeness estimator foregocentric images.

2) Define a set of semantic relevance criteria for egocentricimages.

3) Formulate the summarization task as a retrieval problemby combining informativeness, relevance and noveltycriteria.

4) Define a soft metric to assess the novelty from partiallyannotated image datasets.

5) Our results have been validated by medical experts withthe aim of being used in a cognitive training frameworkto reinforce the memory of patients with mild cognitiveimpairment.

The rest of the paper is organized as follows: the next sec-tion reviews the related work, section III details the proposedmethod, section IV describes our experimental setup, sectionV presents the experimental results and finally, section VI endsthe paper with some concluding remarks.

II. RELATED WORK

A. Summarization by key-frame selection in Lifelogging

Egocentric photo stream summarization has been tradition-ally formulated as the problem of grouping lifelog imagesinto coherent collections (or events) by: first, extracting low-level spatio-temporal features and then, selecting the mostrepresentative image from each event. In this spirit, many au-thors proposed different strategies for temporal segmentationand key-frame selection. Doherty et al. [14] proposed a key-frame selection technique, which seeks to select the imagewith the highest quality as key-frame by relying on five typesof features: contrast, color variance, global sharpness, noise,saliency and external sensors data (accelerometers and light).In addition to image quality, Blighe et al. [15] consideredimage similarity to select a key-frame in an event. Basically,after applying an image quality filter that removes all poorquality images, they selected the image which has the highestaverage similarity to all other images in the event as the key-frame. In that case, similarity was measured relying on thedistance of SIFT descriptors. In [16], the key-frame is selectedby a nearest neighbour ratio strategy that favors high qualityimages and, if the difference in quality is not large enough,favors images closer to the middle of the temporal segment.More recently, the auhors in [17] proposed to use a Random

Walk for selecting a single and most representative image foreach temporal segment.

While these methods rely solely on low-level or mid-levelfeatures for the temporal segmentation and key-frame detec-tion, a few recent works have introduced a higher semanticlevel in the selection process for video cameras. Although,in this cases, due to the higher temporal resolution of thecamera (about 30fps), aiming at selecting subshots (shortvideo sub-sequences) instead of unique key-frames. Lu andGrauman [5] and lately Ghosh et al. [18] suggested that videosummarization should preserve the narrative character of avisual lifelog and, therefore, it should ideally be made ofa coherent chain of video subshots in which each subshotinfluences the next through some subset of key visual objects.Following this idea, in [5], [18], first important people andobjects are discovered based on their interaction time withthe camera wearer and then, a subshot selection driven bykey-object event occurrences is applied. Subshot selection isperformed by incorporating into an objective function a termcorresponding to the influence between subshots as well asimage diversity. To model diversity, the authors relied onGIST descriptors and color histograms to model scenes andproposed a measure of diversity that is high when the scenes insequential subshots are dissimilar to ensure visual uniqueness.Visual diversity in lifelog summaries is modeled in [19]through the concept of novelty, which the authors heuristicallydefined as the deviation from some standard background.According to this definition, novelty is detected based onthe absence of a good registration, in terms of ego-motionand the environment, between a new sequence and storedreference sequences. More recently, Gong et al. [20] proposedthe so called Sequential Determinantal Point Process (seqDPP)approach for video summarization, a probabilistic model withthe ability to teach the system how to select informative anddiverse subsets from human-created summaries, so as to bestfit human-perception quality based on evaluation metrics. Thiswork is an adaptation to sequences of [21], which is able tocapture the strong dependency structures between items insequences. It is worth to mention that all these works havebeen conceived to deal with video data, where, assuming thatthe temporal segmentation is good, temporal coherence andframe redundancy make the key-frame selection process easier.

B. Diversity and novelty in information retrievalOne of the first and seminal works on diversity in infor-

mation retrieval was introduced by Carbonell & Goldstain[22], in 1998. Recognizing that in the context of text retrievaland summarization, pure relevance ranking is not sufficient,they proposed to take into account simultaneously relevanceand diversity to reduce redundancy. Following this first ap-proach, diversity aims to ”reduce redundancy while maintain-ing query relevance when re-ranking retrieved documents”.Their approach, called Maximal Marginal Relevance (MMR),is a re-ranking technique that linearly combines independentmeasurements of relevance and diversity into a single metric,which is maximized iteratively.

A similar formulation to MMR of the diversification prob-lem was given by Deselaers et al. [23] and was found to out-

Page 3: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

Fig. 2. General scheme of our event summarization methodology.

perform a common clustering-based diversification approachin the context of diverse image retrieval. As in MMR, diver-sification is achieved via the optimization of a criterion thatlinearly combines relevance and diversity. However, [23] gavea more general formulation and used dynamic programmingalgorithms to perform the optimization in addition to thegreedy and iterative algorithm presented in [22].

Diversity in social image retrieval was one of the focusof the MediaEval 2013, 2014 and 2015 benchmarks andattracted the interest of many groups working in this area.Most participants developed diversification approaches thatcombined clustering with a key-frame selection strategy to ex-tract representative images for each cluster. Spyromitros et al.proposed an MMR-based approach [24] that jointly considersrelevance and diversity, but using a supervised classificationmodel to obtain the relevance scores learned from the userfeedback. The contribution from Dang-Nguyen et al. [25] wasto filter out non-relevant images at the beginning of the processbefore applying diversity, hence simplifying it.

In [26], the authors presented a method for detecting andresolving the ambiguity of a query based on the textualfeatures of the image collection. If a query has an ambiguousnature, this ambiguity should be reflected in the diversity ofthe result to increase user satisfaction. Leuken et al. [27]proposed to reduce the ambiguity of results provided by imagesearch engines relying on textual descriptions by seeking forthe visual diversification of image search results. This isachieved by clustering the retrieved images based on theirvisual similarity and by selecting a representative image foreach cluster. In these works ”diversity” is aimed at addressingthe ”ambiguity” of the textual descriptions (tags) in queriesmore than at avoiding redundancy in search results. The sameterminology is used by [28], where the term ’novelty’ is meantto address redundancy in retrieved documents. In this work wewill use the terms novelty and diversity as defined in [28].

III. METHODOLOGY

Inspired by the image retrieval work, we question whichwould be the most suitable diversity and relevance criteriaapplied on egocentric images that kept the narrative characterof the visual lifelog. Taking into account that the images areacquired non-intentionally, it is important to disregard the non-informative images (see Section III-A) and keep the minimal

set that represents the visual event. In this section, we explainthe four main steps applied to construct the final resultingsummary (see Fig. 2):

1) Informativeness Filtering (Section III-A): egocentricimages are acquired non-intentionally and therefore,many of them may be non-informative, capturing neither(or partially) objects nor people, or being blurred ordark. By means of a CNN-based informativeness filter-ing, we discard most of the non-informative images fromthe egocentric event.

2) Relevance and Diversity-aware Ranking (SectionIII-B): an initial relevance image ranking is computedtaking into account different criteria such as Saliencyand Objectness for dealing with the ambiguity or under-specification of queries like What is the user doing?, orWhere is the user?, and Face Detection for answering thequestion With whom is the user most likely interacting?.

3) Novelty-based Re-ranking (Section III-C): a final re-ranking based on a novelty-maximization procedure isapplied on the images already ranked by relevance. Thisstep is crucial to select those images that representthe most varied set of concepts appearing avoidingredundancy without semantic loss.

4) Estimation of Fusion Weights with Mean Sum ofMaximal Similarities (Section III-D): in this step, wedefine a novel soft metric, which we call Mean Sumof Maximal Similarities (MSMS), in order to define thepriorities of the different relevance terms and constructthe final summary.

A. Informativeness Filtering

Considering both, the wearability of egocentric camerasand the non-intentionality of the pictures they take, severalproblems are inherent to them as over- or under-light exposure,blurriness, pictures of the sky or ground, pictures wherepossible objects of interest are badly centered in the imageor even non-existent, etc. Such images are summarized asnon-informative pictures. A way to avoid a good amount ofthe undesired images in the final summary and also to boostthe performance of the next steps in our methodology, wouldbe being able to discriminate the informative pictures. Fig. 3

Page 4: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

Fig. 3. Examples of informative images (top) and non-informative images(bottom) belonging to the same events. Faces appearing in the images havebeen manually blurred for privacy concerns.

illustrates informative vs. non-informative photos taken by thewearable camera.

In order to learn a model of informative pictures, and takinginto account the complexity of distinguishing visually whetheran image is informative enough or not, we propose traininga CNN for a binary problem. There, all the images with anundesired artifact (empty image, blurred image, image withsmall amount of information, image where something occludesmost of the image region, sky, ground, ceiling, room wall,etc.) will be considered non-informative (label 0), and the rest(images with semantic content) will be considered informative(label 1). With this procedure, we will be able to extract aninformativeness score for each image and filter the unusefulones. In order to remove as many images as possible, but withthe ultimate goal of having a very high recall (always try tokeep any informative image for the next steps), we only discardthe images with informativenessScore < 0.025, whichcorresponds to the ones that are considered certainly non-informative by the CNN (see experimental results in sectionV-A).

The network training for the binary class distinction isperformed by fine-tuning the CaffeNet [29] pre-trained onImageNet [30] dataset, provided in the software Caffe [31].Using this network is crucial as an initialization procedure forassuring the high performance of the method for two reasons:

1) As proven by Yosinski et al. in [32], transferring theparameters of a CNN learned from an initial problem(object recognition, in this case) with a huge databaseof images to a similar one (informativeness), the netconverges faster and is able to find a more optimal pointin the loss function.

2) The data needed for training a CNN from scratch isextremely difficult to gather in the lifelogging field,because of the cost of labeling huge sets of images, andthe limited number of public lifelogging image datasets[33].

Considering that lifelogging images are not mainly focusedon objects and people, but also reflect daily life scenes, onecould first think that using either PlacesCNN or HybridNet,the models proposed by Zhou et al. in [34] as pre-trainednetworks should obtain a higher performance after fine-tuning.Nevertheless, after performing some initial tests with these twonetworks, we discovered that using a net trained on ImageNetfor object recognition gives better results. This behavior canbe explained by: 1) the distinction criteria applied to labelinformative images (images usually with distinguishable andclear people or objects rather than scenes), and 2) most of the

Fig. 4. Images acquired by a lifelogging device, where objects of interestappear like: computer, mobile, coffee, hand, bicycle, person, face, flower,apple, etc.

pictures in our daily life are usually taken in indoor spaces,where we are performing some activity with someone or withsomething that will usually be in the foreground and possiblyoccludes the general scene.

B. Relevance and Diversity-aware Ranking

Once non-informative images are discarded, we proceed torank the remaining images by considering relevance criteria.Our solution formulates the summarization problem in similarterms as in information retrieval. A ranked list of event framesis generated in such a way that a summary of T images directlycorresponds to a truncation of the rank list of N elements atits T -th position. Classic retrieval problems build their rankedlists as a response to a user query, in the summarization ofevents, this query would correspond to: Select T images todescribe the event depicted into these N frames. This query ishighly ambiguous as an event can be described from differentperspectives. For example, an egocentric visual summary ofan event may respond to multiple intentions, such as: Whereis the user? What activity is the user performing? With whomis the user interacting? We hypothesize that the relevance ofeach image with respect to these questions can be estimatedwith computer vision tools for saliency prediction, detectionof objects and detection of faces.

1) Saliency Prediction for image relevance: We assumethat images with more salient content are relevant and shouldhave a higher probability to be included in the summary.Visual saliency can be triggered by a broad range of reasons,such as objects, people or characteristic features appearing inthe picture. Many computer vision algorithms try to estimatethe fixation points of the human eyes in a scene by meansof saliency maps. These are heat maps of the same sizeof the image, whose higher values correspond to the imagelocations with a higher probability of capturing the humanvisual attention.

In this work, we compute the saliency maps with SalNet[35], an end-to-end convolutional network for saliency predic-tion. We adopt the overall sum of the values in the saliencymap as a quantitative estimator of the image relevance.

2) Object Detection for image relevance: We assume thatthose images containing objects are relevant since these objectslikely correspond to the ones the user is interacting with. Theseobjects would address the question of What is the user doing?(see Fig. 4). In addition, introducing a semantic interpretationof the scene, targets the summary from a higher abstractionlevel than saliency maps.

In our work, we used the off-the-shelf tool Large ScaleDetection through Adaptation (LSDA) presented in [36]. Thisobject detector is based on a CNN fine-tuned for local scale

Page 5: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

Fig. 5. Objects detected by the LSDA method in an egocentric image.

and provides a semantic label and a confidence score in thelocalization of the objects in the scene (see Fig. 5). This toolallows estimating the relevance of each frame by summing thedetection scores of all objects in the picture, so that the frameswith higher confidence detection will be considered as morerelevant.

3) Face Detection for image relevance: We assume thatimages containing people are also relevant, since these peoplelikely correspond to the ones the user is interacting with, andthey would be useful to answer the question With whom isthe user interacting?. Therefore, a face detector complementsthe object detector into providing a cue for the user socialinteractions during the event.

In our solution, we adopted the off-the-shelf face detectorby Zhu et al. [37], which provides a confidence for each ofthe detected faces. The relevance of each frame in terms ofsocial interaction was estimated by summing the confidencescores of the face detectors. In the particular implementationof [37], detection scores may also be described with negativevalues, so we actually used these scores in an exponentialsum, which conveniently deals with the negative scores as wellas encourages the selection of frames with multiple detectedfaces.

4) Diversity-aware Ranking: The three criteria used for therelevance detailed above allow to cope with the ambiguityof the query, since they estimate the relevance from threedifferent perspectives. The relevance scores computed for eachcase are then used to build three ranked lists that will becombined into a single one. This combination is based ongenerating a set of normalized scores based simply on theposition of the frames. Normalized scores rk(x) are linearlydistributed from the top (1 for most relevant) to the bottom (0for non relevant) of the ranked list as follows:

rk(x) =M −Rk(x)

M − 1,

where Rk(x) = 1, . . . ,M is the ranking position associatedwith image x according to each relevance criterion k ∈{1, 2, 3} and M is the number of informative frames in theevent, being M < N and N the total number of images in theevent. The standard score normalization [38], [39] with the minand max scores was also tested giving similar results, so the

rank-based normalization was adopted to save computationalresources.

C. Novelty-based Re-ranking

The relevance and diversity-aware ranked list sorts theinformative images combining different criteria for relevance,but does not explicitly cope with the redundancy in theinformation. Further processing is necessary to maximize thenovelty provided by each image with respect to the rest inthe summary. In information retrieval, novelty is defined asa quality of a system that avoids redundancy [28]. In ourapproach for building a summary by truncating a ranked listof images, introducing novelty implies that each image in thelist should differ as much as possible from its predecessors.

Our approach adopts the greedy selection algorithm pre-sented by Deselaers et. al. [40] to re-rank the fused listbased on novelty. The goal of our summarization algorithmis to analyze the input set of informative and ranked imagesX = {x1, ...xM} to iteratively build another set with minimalredundancy, say YT = {xy1 , ...xyT

}, where T ≤ M . Ourapproach starts by selecting the top ranked image x1 in thediversity-aware ranked list as the first element of Y , that isY1 = {xy1

}. The novelty of each candidate image, x∗ to beadded to the summary at iteration t = 2, ..., T is defined as:

n(x∗,Yt) = 1− s(x∗,Yt) = 1− maxxyj∈Yt

s(x∗, xyj) (1)

s(x∗, xyj ) is a normalized similarity measure between thecandidate x∗ and image xyj

. In this way, the more differenta new image is with respect to the ones in Yt, the higher itsnovelty is.

In our work, the similarity s(x∗, xyj) is based on visual

appearance. Each image is described by a feature vectorcorresponding to the seventh fully-connected layer of theCaffeNet convolutional neural network [31], which was alsoused as initialization for the informativeness (see sectionIII-A), and whose architecture was inspired by AlexNet [29]and trained with the ImageNet dataset [30]. The similarityscore is computed with the Euclidean distance of the featurevectors.

The summary of the event is built by combining the rele-vance and the novelty of the candidate frames, x∗. A greedyselection algorithm iteratively chooses the next image in thesummary as:

xyt+1 = argmaxx∗∈X\Yt

(r(x∗) + n(x∗,Yt)),

Yt+1 = Yt ∪ {xyt+1}

(2)

that is, the image with the highest sum of relevance andnovelty. We detail the computation of r(x∗) in Equation (4).In Fig. 6, we can see an example of the evolution of theobtained sum of relevance and novelty measures along theranked list of an event. Note the general descending tendencyof the relevance and novelty score, the small perturbations aredue to the greedy nature of the computation of the noveltyscore.

Page 6: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

Fig. 6. The aggregation of relevance (r) and novelty (n) scores tend to decreasewith the position in the ranking.

Fig. 7. Visual representation of the Sum of Max Similarities when comparingthe first T = 3 images from the ranked list with the P = 3 red, green andblue images from the validation set. The similarity between images from thevalidation set and images from the ranked list is represented as a T×P matrixof gray level squares, where each square (i, j) of the matrix represents thesimilarity between the image i from the ranked list and the image j from thevalidation set. High intensity gray level values indicate high similarity.

D. Estimation of Fusion Weights with Mean Sum of MaximalSimilarities

Different relevance criteria would have different importancefor the final ranking. To this aim, we propose a novel approachto estimate the priorities of each criterion and fuse them, basedon comparing the similarity of the images from a validationset of P elements V = {xv1 , ..., xvP } with a summary of Telements YT = {xy1

, ...xyT}.

The Sum of Maximal Similarities (SMS) of V with respectto YT is defined as:

SMS(V,YT ) =1

P

P∑i=1

s(xvi ,YT ) (3)

where s(xvi ,YT ) is the similarity of image xvi from thevalidation set with respect to YT . Following this metric, themore similar is the validation set V to the selected images YT ,the highest the average similarity.

The main advantage of our soft metric presented in Equation(3) is that automatic summaries, which are very similar tothe validation images, although do not coincide, still willbe assessed as a valid solution. This feature of our soft

Fig. 8. MSMS curves and AUC for each relevance criterion used separately.

metric is specially important when working with sequences ofegocentric photo streams, that usually contain a high amountof redundancy.

Our summaries are built as a truncation of a ranked list, sothe value of SMS depends on T . Fig. 7 shows a schematicexample, where a validation set composed of P = 3 images(represented by three colors: green, blue and red) is comparedwith an event of M = 5 images, which have been previouslyfiltered and ranked based on our diversity and novelty-awarecriterion. In this example, a summary with T = 3 images isconsidered. Let us imagine that the green image from V ismatched with the third one in the ranked list, while the blueand red images are matched with the second one.

Notice that, as in our set up the validation set V is alwaysa subset of the input set X , hence, the final average similarity(i.e. considering the whole sequence, T = M images) for YM

will always correspond to one. Actually, in the example of Fig.7, SMS must reach a value of one when all the ranked list isconsidered, that is, when T = M .

Let us consider the evolution of the SMS of summary Yas the parameter t grows (t = 1, 2, . . . ,M ) defined as:

SMS(V,Y) = {SMS(V,Y1), . . . , SMS(V,YM )}

We construct the SMS curves for all the validation set,interpolate them, normalize them with respect to the lengthM of each sequence and get the average of the curves, onecurve for each event in the validation set. The resulting curverepresents the Mean Sum of Maximal Similarities (MSMS).

Fig. 8 shows three examples of MSMS curves. The curveillustrates the evolution of the MSMS as a function of thepercentage of images covered by the visual summary. Thiscurve is defined over the X-axis representing the proportionof event images represented in the top t items in the list, thatis, plotting the MSMS over t/M , where M is the amount ofalready filtered images in the event. In this way, the curves ofevents of different lengths M can be compared on the sameplot. The best MSMS curves are those which reach highervalues with the minimal amount of images in the summary.

Note that the MSMS can be computed for the three rel-evance criteria that can give as a quality of performance ofeach of them. Hence, we introduce a final refinement in the

Page 7: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

fusion of the ranked lists by estimating the confidence ofeach of the three relevance criteria and using it to weightthe corresponding relevance terms. Thus, we leverage thecontribution of each of the three relevance criteria basedon their stand-alone performance (see Fig. 8). This weightcorresponds to the normalized Area Under the Curve (AUC)of the MSMS measure.

As a result, the fusion weight w(k) for the relevancecriterion k is estimated as:

w(k) =AUC(k)∑3i=1 AUC(i)

.

Finally, the three normalized relevance scores associated toeach frame are aggregated with a weighted sum to obtain theirfused score r(x), as described by the following equation:

r(x) =

3∑k=1

w(k) rk(x), (4)

and the final set of frames is obtained according to the updatedrelevance and novelty criteria according to Equation (2).

IV. EXPERIMENTAL SETUP

A. Dataset

The dataset used for validation of our method wasacquired with the wearable camera Narrative Clip(www.getnarrative.com), which takes a picture every 30seconds (2 fpm). It is composed of 10 day lifelogs from 5different subjects, with a total of 7,110 images1. Each dayhas been segmented in between 10 to 25 events manually,although any automatic segmentation method can be used(e.g. [41]). The event segmentation separated the pictures ina set of ordered and relevant semantic events or segments. Inthis paper, we use this segmentation as starting point for thesemantic summarization.

In order to apply a quantitative evaluation of our method,each day lifelog was annotated at two levels by psychologistsin the following way:

1) GT Level 1 - Informativeness: positive or negativelabel depending on whether the image is consideredinformative or non-informative (see section III-A). Theproportion of images labeled as informative is 61.22%of the complete dataset.

2) GT Level 2 - Summaries: a set of key-frames (subsetof informative images) were selected for each datasetby the experts. These key-frames cover all the relevantaspects of the events, with enough images to easilyfollow each event story.

B. Semantic Assessment of Summaries

The assessment of visual summaries is a challenging taskdue to the rich semantic content of the images and the ambi-guity in the evaluation criteria. In addition, human annotationis expensive resource, which becomes dramatically scarce,

1The link to the dataset and its ground truth are prepared to be made publicdomain when the article is published.

when dealing with tasks that require expert annotations in thedomain. In our case, the ground truth annotations have beenperformed by psychologists who defined the summaries aboutevents, wearers were involved in, from an human-centeredpoint of view.

We addressed the challenge of summary assessment asfollows: once a subset of summaries of the events built by thepsychologists was used as a validation set to learn and studythe settings of our system, the automatic summaries built fromthe best configuration of our system were presented in a blindtaste test to the expert annotators.

1) Validation based on MSMS: The classical methods toevaluate summaries cannot be applied in our setup, becausethey require the annotation of the full dataset. In a traditionalcase, each document in the ground truth is labeled as relevantor non-relevant for the summary and, in some cases, alsoclustered in groups of redundant items that cover the samesub-topic. Such annotations allow the definition of metrics likePrecision [22] for relevance and Cluster Recall (or subtopic-recall) [42] for diversity and/or novelty. However, in othercases (like ours), only a small portion of the dataset isannotated.

Given this limitation, following the framework of Sub-section III-C, we propose an evaluation approach based onthe similarity of the images in the ground truth set G ={xg1 , ..., xgP }, when compared to the automatic summary ofT elements, YT = {xy1

, ...xyT}. In this case, we apply the

SMS, SMS(G,YT ) of ground truth G with respect to theextracted summary, YT . As before, we obtain the MSMS asthe average value, when considering all events in the datasetof the ground truth, which is finally plotted in the Y -axis ofour evaluation scores. Finally, the AUC is computed to obtaina quantitative measure of each configuration for the summary.

2) Blind taste test: The resulting summaries are evaluatedwith a blind taste test by the team of experts who generatedthe ground truth summaries. This methodology, previouslyused in [5], [17], shows different summaries from the sameevent to the experts, so they can rate them from a compara-tive perspective. By randomly changing the position of thedifferent techniques at each event, graders are unaware ofwhat technique was used to generate each of the summaries,guaranteeing this way that their judgments are not biasedtowards any of the algorithm components.

The amount of images included in each summary corre-sponds to the number of frames selected by the experts, whenbuilding the ground truth summaries, that is, P = T . In thisway, the summary shown to the graders is a truncation of thefinal ranked list of images. In our case, the experts decidedthe length of the automatic summary to be a percentage of thelength of the whole event. Note that developing an automaticsystem to establish the length of the summary is out of thescope of this paper.

Additionally, the selected images are sorted in the summaryaccording to their temporal time stamps. This final sortinghelps the user to reconstruct the story between the images tohave a better understanding of the event.

An online platform was developed for the experts to evaluateour results. In a first round of questionnaires (see Fig. 9), we

Page 8: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

Fig. 9. Screenshot of the first questionnaire presented to the evaluators.

wanted to compare the result of our diversity method usingImageNet or Places as a similarity criterion (see Equation (1)).For each event, the questionnaire first shows all the imagesbelonging to the event, and then the user has to answer thequestions Is the summary representative of the event? andWhich summary do you prefer? for each similarity method(using ImageNet or Places).

On the second questionnaire the expert had to give a gradeto each of the presented ranked summaries. A screenshot ofthe questionnaire is shown in Fig. 10. The three summariescorrespond respectively to: our approach, a uniform samplingof images as a lower-bound example, and the ground truthsummaries constructed by the same experts two months earlieras an upper-bound of the performance score. The order inwhich the different summaries are shown is also randomlychosen for each event to avoid again any bias due to thesorting.

In this case, a comparison between the three types ofsummaries is obtained by asking the experts to grade eachsolution from 1 (worst) to 5 (best). This data collected allowedthe computation of the Mean Opinion Score (MOS) for eachconfiguration (see section V-D).

Fig. 10. Screenshot of the second questionnaire presented to the evaluators.

V. RESULTS

Two types of experiments validate the presented results:one for the informativeness filtering and a second oneusing online questionnaires for the relevance, diversity andnovelty on the final visual summaries. In both cases, theevaluation was based on the feedback provided by the experts.

A. Informativeness Filtering

To validate how successful is our algorithm to filternon-informative frames, we applied a 10-fold (a day setout) cross-validation. For all the experiments, we used thefollowing parameters: base lr = 10−8, lr policy = ”step”,gamma = 0.1, stepsize = 3000, blobs lrfc6 =blobs lrconv5 × 5, blobs lrfc7 = blobs lrconv5 × 8 andblobs lrfc8 = blobs lrconv5 × 10. In Table I, we presentthe validation accuracy obtained on each of the sets and thenumber of iterations applied to achieve the best result. In Fig.11, we can see a graph showing the training process for oneof the sets. After having the networks trained for each cross-validation, we evaluated the general average performancein terms of Accuracy, Precision, Recall and F-Measure for

Page 9: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

TABLE IBEST VALIDATION ACCURACY ON EACH SET AND THE RESPECTIVE NUMBER OF TRAINING ITERATIONS PERFORMED TO ACHIEVE THE RESULTS. SubjX

REPRESENTS THE ANONYMIZED SETS FOR A GIVEN SUBJECT.

SubjA1 SubjA2 SubjB1 SubjB2 SubjB3 SubjC1 SubjD1 SubjE1 SubjF1 SubjF2

Accuracy 0.759 0.841 0.805 0.795 0.799 0.837 0.805 0.867 0.795 0.897Iteration # 3,600 1,000 3,600 3,400 3,800 2,600 2,200 3,600 2,400 18,000

different informativeness score threshold values2. As we cansee, we are able to obtain the highest F-Measure value, whenwe filter images with an informativenessScore < 0.025. InFig. 12, we show the Precision-Recall curves for all thresholds,obtaining the best results with respect to the F-Measure of:FMeasure = 0.881, Accuracy = 0.846, Precision = 0.84and Recall = 0.926 with threshold = 0.05.

Fig. 11. Training process on the validation set SubjC1.

Fig. 12. Precision-Recall curve for different informativeness score thresholdvalues (a small subset of them are shown in black dots) for all the sets.

B. Relevance Estimation

We evaluated separately the three relevance criteria:saliency, object detection and face detection. Fig. 13 shows the

2If the informativeness score of a sample is below the threshold, it will beconsidered non-informative.

5 top images extracted from the different criteria. We can seethat although most of them give high scores to images, wherefaces are present, their combination has a lot of potential foroffering also a diverse set of images.

Fig. 13. Example of the top 5 images obtained using each relevance criteria:saliency (top), objectness (middle) and faces (bottom).

For the object relevance method, we compared the LSDAobject detector with another state of the art object detector,namely the Multiscale Combinatorial Grouping (MCG) [43] todetermine which algorithm gives better performance: MCG asan object candidates, and LSDA as an object detector. To testthem, prefiltering and diversity stages were switched off. Theexperiments proposed to compare them were the following:

1) Rank events using objects’ LSDA relevance and evaluateresults.

2) Rank events using objects’ MCG relevance and evaluateresults.

As we can see in Fig. 14, the LSDA object detector outper-formed the MCG object candidates method. Our intuition ofwhy the former beats the latter is, because the objects detectedby LSDA are a few, but high probable objects. On the otherhand, MCG gives thousands of possible objects that, whensumming all scores, introduce noise produced by a huge set oflow probable ones. The strength of LSDA is the Non-MaximalSuppression (NMS) step, where only clear objects are kept.

For finding the best fusion method applicable to the differ-ent relevance queues, we compared either using a weightedfusion based on the AUC value obtained by each method,or a weighted fusion based just on averaging the obtainedrelevance values. In Fig. 15, we can see the comparison, whichcorroborates that using a weighted fusion based on AUC isable to obtain a better overall AUC measure.

Page 10: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

Fig. 14. MSMS graph of object relevance methods.

Fig. 15. MSMS curves and AUC for each fusion method. Based on weightsextracted from the respective AUC from each relevance method (blue) or justaveraging the measures (green)

C. Diversity

Fig. 16 illustrates qualitatively the differences obtainedwhen introducing diversity to the ranked list. As we can see,when we introduce the novelty re-ranking step, we are ablenot only to obtain a more visually acceptable set of images,but also to describe with pictures everything that is happeningduring the whole event. Thus, we are able to avoid focusingon a single set of high relevant images that picture the sameconcept, activity or background.

D. Ranking Summary Quality

The first round of blind taste tests posed two questions foreach event. Firstly, experts decided whether each presentedsummary was representative, and secondly, they chose thepreferred one (see Fig. 9). Given the question Is the summaryrepresentative of the event?, in 95.74% the experts agreedwith our solution, using the ImageNet features to computethe similarity in the novelty-based re-ranking. Using the other

Fig. 16. Three examples of the top 5 images obtained before introducingdiversity (uneven rows) and after introducing it (even rows).

alternative, features from Places CNN, the obtained result was94.33%. We conclude that our approach generates representa-tive summavery high portion of events.

The second question asked was: Which summary do youprefer? and experts had to choose their preferred summary,allowing just one answer. The solution based on Imagenetfeatures was chosen in 59.57% of teh cases, while the onebased on Places was selected 53.19% of the times. The totaladds more than 100% because some summaries were identicalfor both configurations.

The second round of evaluations aimed to compare our so-lution based on ImageNet features with a baseline of uniformsampling and and upper-bound defined by the summaries inthe ground truth. Experts were asked to grade each visualsummary from 1 (worse) to 5 (best). The results obtained foreach solution are provided in Table II.

TABLE IIMEAN OPINION SCORE FOR IMAGENET, GROUND-TRUTH AND UNIFORM

SAMPLING SUMMARIES.

Dataset Our solution Ground-truth Uniform SamplingAll datasets 4,57 4,94 3,99

The performance obtained by uniform sampling (3.99/5) istruly commendable, since this score can be interpreted as a

Page 11: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11

good solution. The results obtained with our solution are alsosatisfactory, because they are closer to the ground-truth thanto the uniform sampling. Experts have shown coherence withtheir ground-truth giving a 4.94 of Mean Opinion Score.

VI. CONCLUSIONS

In this paper, we presented a novel approach for semanticsummarization of egocentric photo stream events, constructedas a ranked list of general semantic criteria. Our method isbased on two main criteria: relevance to optimize semanticdiversity in the summary, and novelty to avoid redundancy inthe final result. After applying a pre-processing step to filternon-informative images through a new CNN-based method,relevance diversity-aware ranking is obtained by integratingstate of the art techniques for saliency detection, object recog-nition and face detection. This list is re-ranked to reduceredundancy so that each image of the truncated list differs asmuch as possible from its predecessors. We proposed a newsoft metric to rank the informative frames and construct thefinal summary that does not penalize summaries equivalent(very similar, but not coinciding exactly) to the ground truth.Experimental results indicate high acceptance and satisfactionof psychologists achieving mean opinion score of 4.57 out of5.0.

As future work, we are interested in exploring an automaticway to define the length of the summaries. On the otherhand, it will be interesting to jointly maximize relevance anddiversity by exploiting dynamic programming techniques asproposed in [40]. Our first application will be to apply thesemantic summaries as basis for the cognitive and memorytraining of patients affected by mild cognitive impairment.Currently, the psychologist of our team is working on defininga cognitive training framework to address cognitive dysfunc-tion symptoms of patients in the early stage of Alzheimer.Automatic semantic summarization will be used as a base togenerate powerful cues to work with the patients in order toimprove their cognitive abilities as well as psycho-social well-being and self-confidence in their daily life.

ACKNOWLEDGMENT

This work was partially founded by the projects TEC2013-43935-R and TIN2012-38187-C03-01 founded by MEC, Spainand SGR1219 and SGR1421 founded by AGAUR, Catalunya,Spain. M. Dimiccoli is supported by a Beatriu de Pinos grant(Marie-Curie COFUND action). We gratefully acknowledgethe support of NVIDIA Corporation with the donation of threeGeoForce GTX Titan GPUs used in this work.

REFERENCES

[1] S. P. Rajendra and N. Keshaveni, “A survey of automatic video sum-marization techniques,” International Journal of Electronics, Electricaland Computational System, vol. 2, 2014.

[2] S. Mann, “’wearcam’ (the wearable camera): personal imaging systemsfor long-term use in wearable tetherless computer-mediated reality andpersonal photo/videographic memory prosthesis,” in Wearable Comput-ers, 1998. Digest of Papers. Second International Symposium on, Oct1998, pp. 124–131.

[3] W. W. Mayol, B. J. Tordoff, and D. W. Murray, “Wearable visualrobots,” Personal Ubiquitous Comput., vol. 6, no. 1, pp. 37–48, Jan.2002. [Online]. Available: http://dx.doi.org/10.1007/s007790200004

[4] S. Hodges, L. Williams, E. Berry, S. Izadi, J. Srinivasan, A. Butler,G. Smyth, N. Kapur, and K. Wood, “Sensecam: A retrospective memoryaid,” in Proceedings of the 8th International Conference on UbiquitousComputing, ser. UbiComp’06. Berlin, Heidelberg: Springer-Verlag,2006, pp. 177–193.

[5] Z. Lu and K. Grauman, “Story-driven summarization for egocentricvideo,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEEConference on. IEEE, 2013, pp. 2714–2721.

[6] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervisedego-action learning for first-person sports videos,” in Computer Visionand Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE,2011, pp. 3241–3248.

[7] M. Bolanos and P. Radeva, “Ego-object discovery,” arXiv preprintarXiv:1504.01639, 2015.

[8] M. Aghaei, M. Dimiccoli, and P. Radeva, “Multi-face trackingby extended bag-of-tracklets in egocentric videos,” arXiv preprintarXiv:1507.04576, 2015.

[9] A. J. Sellen, A. Fogg, M. Aitken, S. Hodges, C. Rother, and K. Wood,“Do life-logging technologies support memory for the past?: Anexperimental study using sensecam,” in Proceedings of the SIGCHIConference on Human Factors in Computing Systems, ser. CHI ’07.New York, NY, USA: ACM, 2007, pp. 81–90. [Online]. Available:http://doi.acm.org/10.1145/1240624.1240636

[10] M. L. Lee and A. K. Dey, “Lifelogging memory appliance forpeople with episodic memory impairment,” in Proceedings of the 10thInternational Conference on Ubiquitous Computing, ser. UbiComp ’08.New York, NY, USA: ACM, 2008, pp. 44–53. [Online]. Available:http://doi.acm.org/10.1145/1409635.1409643

[11] P. Piasek, K. Irving, and A. Smeaton, “Sensecam intervention basedon cognitive stimulation therapy framework for early-stage dementia,”in Pervasive Computing Technologies for Healthcare (PervasiveHealth),2011 5th International Conference on, May 2011, pp. 522–525.

[12] P. Piasek, A. F. Smeaton et al., “Using lifelogging to help construct theidentity of people with dementia,” 2014.

[13] A. Spector, L. Thorgrimsen, B. Woods, L. Royan, S. Davies, M. But-terworth (deceased), and M. Orrell, “Efficacy of an evidence-basedcognitive stimulation therapy programme for people with dementia,” TheBritish Journal of Psychiatry, vol. 183, no. 3, pp. 248–254, 2003.

[14] A. R. Doherty, D. Byrne, A. F. Smeaton, G. J. Jones, and M. Hughes,“Investigating keyframe selection methods in the novel domain ofpassively captured visual lifelogs,” in Proceedings of the 2008International Conference on Content-based Image and Video Retrieval,ser. CIVR ’08. New York, NY, USA: ACM, 2008, pp. 259–268.[Online]. Available: http://doi.acm.org/10.1145/1386352.1386389

[15] M. Blighe, A. Doherty, A. F. Smeaton, and N. E. O’Connor,“Keyframe detection in visual lifelogs,” in Proceedings of the1st International Conference on PErvasive Technologies Related toAssistive Environments, ser. PETRA ’08. New York, NY, USA: ACM,2008, pp. 55:1–55:2. [Online]. Available: http://doi.acm.org/10.1145/1389586.1389652

[16] A. Jinda-Apiraksa, J. Machajdik, and R. Sablatnig, “A keyframe selec-tion of lifelog image sequences,” Erasmus Mundus M. Sc. in Visionsand Robotics thesis, Vienna University of Technology (TU Wien), 2012.

[17] M. Bolanos, R. Mestre, E. Talavera, X. Giro-i Nieto, and P. Radeva, “Vi-sual summary of egocentric photostreams by representative keyframes,”arXiv preprint arXiv:1505.01130, 2015.

[18] J. Ghosh, Y. J. Lee, and K. Grauman, “Discovering important people andobjects for egocentric video summarization,” in 2012 IEEE Conferenceon Computer Vision and Pattern Recognition. IEEE, 2012, pp. 1346–1353.

[19] O. Aghazadeh, J. Sullivan, and S. Carlsson, “Novelty detection from anego-centric perspective,” in Computer Vision and Pattern Recognition(CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3297–3304.

[20] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diversesequential subset selection for supervised video summarization,”in Advances in Neural Information Processing Systems 27,Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence,and K. Weinberger, Eds. Curran Associates, Inc., 2014,pp. 2069–2077. [Online]. Available: http://papers.nips.cc/paper/5413-diverse-sequential-subset-selection-for-supervised-video-summarization.pdf

[21] A. Kulesza and B. Taskar, “Determinantal point processes for machinelearning,” arXiv preprint arXiv:1207.6083, 2012.

[22] J. Carbonell and J. Goldstein, “The use of mmr, diversity-basedreranking for reordering documents and producing summaries,” inProceedings of the 21st Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval, ser. SIGIR ’98.

Page 12: JOURNAL OF LA Semantic Summarization of Egocentric Photo ... · Diversity in social image retrieval was one of the focus of the MediaEval 2013, 2014 and 2015 benchmarks and attracted

JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12

New York, NY, USA: ACM, 1998, pp. 335–336. [Online]. Available:http://doi.acm.org/10.1145/290941.291025

[23] T. Deselaers, T. Gass, P. Dreuw, and H. Ney, “Jointly optimisingrelevance and diversity in image retrieval,” in Proceedings of the ACMInternational Conference on Image and Video Retrieval, ser. CIVR ’09.New York, NY, USA: ACM, 2009, pp. 39:1–39:8. [Online]. Available:http://doi.acm.org/10.1145/1646396.1646443

[24] E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, and I. Vla-havas, “Socialsensor: Finding diverse images at mediaeval 2014,” 2014.

[25] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, and F. De Natale,“Retrieval of diverse images by pre-filtering and hierarchical clustering,”Working Notes of MediaEval, 2014.

[26] K. Song, Y. Tian, W. Gao, and T. Huang, “Diversifying theimage retrieval results,” in Proceedings of the 14th Annual ACMInternational Conference on Multimedia, ser. MULTIMEDIA ’06.New York, NY, USA: ACM, 2006, pp. 707–710. [Online]. Available:http://doi.acm.org/10.1145/1180639.1180789

[27] R. H. van Leuken, L. Garcia, X. Olivares, and R. van Zwol, “Visualdiversification of image search results,” in Proceedings of the 18thInternational Conference on World Wide Web, ser. WWW ’09. NewYork, NY, USA: ACM, 2009, pp. 341–350. [Online]. Available:http://doi.acm.org/10.1145/1526709.1526756

[28] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan,S. Buttcher, and I. MacKinnon, “Novelty and diversity in informationretrieval evaluation,” in Proceedings of the 31st annual internationalACM SIGIR conference on Research and development in informationretrieval. ACM, 2008, pp. 659–666.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 248–255.

[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 675–678.

[32] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?” in Advances in Neural InformationProcessing Systems, 2014, pp. 3320–3328.

[33] M. Bolanos, M. Dimiccoli, and P. Radeva, “Towards storytelling fromvisual lifelogging: An overview,” arXiv preprint arXiv:1507.06120,2015.

[34] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” in Advancesin Neural Information Processing Systems, 2014, pp. 487–495.

[35] J. Pan, K. McGuinness, and X. Giro-i Nieto, “End-to-end convolutionalnetwork for saliency prediction,” in Submitted to CVPR, 2016.

[36] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. B. Girshick,T. Darrell, and K. Saenko, “LSDA: large scale detection throughadaptation,” CoRR, vol. abs/1407.5035, 2014. [Online]. Available:http://arxiv.org/abs/1407.5035

[37] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmarklocalization in the wild,” in Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2879–2886.

[38] M. Montague and J. A. Aslam, “Relevance score normalization formetasearch,” in Proceedings of the tenth international conference onInformation and knowledge management. ACM, 2001, pp. 427–433.

[39] M. E. Renda and U. Straccia, “Web metasearch: rank vs. score basedrank aggregation methods,” in Proceedings of the 2003 ACM symposiumon Applied computing. ACM, 2003, pp. 841–846.

[40] T. Deselaers, T. Gass, P. Dreuw, and H. Ney, “Jointly optimisingrelevance and diversity in image retrieval,” in Proceedings of the ACMinternational conference on image and video retrieval. ACM, 2009,p. 39.

[41] E. Talavera, M. Dimiccoli, M. Bolanos, M. Aghaei, and P. Radeva, “R-clustering for egocentric video segmentation,” in Pattern Recognitionand Image Analysis. Springer, 2015, pp. 327–336.

[42] C. X. Zhai, W. W. Cohen, and J. Lafferty, “Beyond independentrelevance: methods and evaluation metrics for subtopic retrieval,” inProceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval. ACM, 2003, pp.10–17.

[43] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik,“Multiscale combinatorial grouping,” in Computer Vision and PatternRecognition (CVPR), 2014.

Aniol Lidon received his B.S. degree in Audio-visual Engineering from the Universitat Politecnicade Catalunya in 2014, and his interuniversity M.S.degree in Computer Vision from the UniversitatPolitecnica de Catalunya, Universitat Autonoma deBarcelona, Universitat Pompeu Fabra and Universi-tat Oberta de Catalunya, in 2015. His research in-terests have focused on egocentric vision and socialmedia.

Marc Bolanos received his B.S. degree in ComputerScience from the Universitat de Barcelona, Spain,in 2013. He received his interuniversity M.S. de-gree in Artificial Intelligence from the UniversitatPolitecnica de Catalunya, Universitat de Barcelonaand Universitat Rovira i Virgili in 2015 and is cur-rently a PhD candidate at Universitat de Barcelona.He has received multiple collaboration grants andis participaring in an European project, working inthe Applied Mathematics and Analysis Department(MAIA) at Universitat de Barcelona. His research

interests are in the area of deep neural networks and lifelogging image devicesfor health improvements.

Mariella Dimiccoli is a Beatriu de Pinos fellow(Marie-Curie COFUND action) at Computer Vi-sion Center and associate professor at University ofBarcelona. She received a M.S. degree in ComputerEngineering in 2004 from Technical University ofBari (POLIBA), a MAS degree from Technical Uni-versity of Catalonia (UPC) in 2006 and the PhDin Image Processing from the UPC (Cum Laudewith European Honors) in 2009. Her current researchinterests are the automatic analysis and organizationof lifelogging data acquired by a wearable camera

and their exploitation in health research.

Petia Radeva Petia Radeva received her Ph.D. de-gree from the Universitat Autnoma de Barcelona, in1996. Currently, she is Head of Barcelona PerceptualComputing Laboratory (BCNPCL) at the Univer-sity of Barcelona and Head of Medical ImagingLaboratory (MILab) of Computer Vision Center(www.cvc.uab.es). Her present research interests areon development of learning-based approaches forcomputer vision and image processing, speciallyfocused on egocentric vision and visual lifelogging.

Maite Garolera is Licensed in Psychology in 1987.In the 1994, she incorporated as a Clinical Psychol-ogist/Neuropsychologist, in the Psychiatric Unit ofHospital de Terrassa (Consorci Sanitari de Terrassa).Currenlty, her main activities are related to theClinical Research focused on Alzheimers diseasediagnosis and Brain injury cognitive rehabilitation.

Xavier Giro-i-Nieto is an associate professor atthe Universitat Politecnica de Catalunya (UPC). Hegraduated in Electrical Engineering studies at UPCin 2000, after completing his master thesis on imagecompression at the Vrije Universiteit in Brussels(VUB) . In 2001 he worked in the digital televisiongroup of Sony Brussels, before joining the ImageProcessing Group at the UPC in 2002. He obtainedhis Phd on image retrieval in 2012 from the Sig-nal Theory and Communications Department of thesame university. Between 2008 and 2014, he was a

visiting scholar of the Digital Video and MultiMedia laboratory at ColumbiaUniversity in New York City.