27
Multimed Tools Appl DOI 10.1007/s11042-012-1079-z Concurrent photo sequence organization Liliana Lo Presti · Marco La Cascia © Springer Science+Business Media, LLC 2012 Abstract Personal photo album organization is a highly demanding domain where advanced tools are required to manage large photo collections. In contrast to many previous works, that try to solve the problem of organizing a single user photo sequence, we present a new technique to account for the concurrent photo sequence organization problem, that is the problem of organizing multiple photo sequences taken during the same event. Given a set of sequences acquired at the same place during the same temporal window by several users using different cameras, our framework is intended to capture the evolution of the event and groups photos based on temporal proximity and visual content. The method automatically organizes the reference sequence in a tree capturing the event structure. Such a structure is then used to align the remaining photo sequences to the reference one. We tested our approach on the publicly available Gallagher dataset and on a new dataset we collected; this new dataset is composed of four photo sequences taken by four users at a public event. Results demonstrate the effectiveness of our method. Keywords Digital library · Personal photo album · Concurrent photos · Co-organization · Content analysis · Hidden Markov Model This research has been conducted while Dr. Lo Presti was post-doctoral researcher at University of Palermo. L. Lo Presti (B ) Computer Science Department, Boston University, Boston, MA, USA e-mail: [email protected] M. La Cascia DICGIM, University of Palermo, Palermo, Italy e-mail: [email protected]

Concurrent photo sequence organization - unipa.it · 2014. 9. 22. · Multimed Tools Appl There are several reasons why the problem of concurrent sequence organization is difficult

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Multimed Tools ApplDOI 10.1007/s11042-012-1079-z

    Concurrent photo sequence organization

    Liliana Lo Presti · Marco La Cascia

    © Springer Science+Business Media, LLC 2012

    Abstract Personal photo album organization is a highly demanding domain whereadvanced tools are required to manage large photo collections. In contrast to manyprevious works, that try to solve the problem of organizing a single user photosequence, we present a new technique to account for the concurrent photo sequenceorganization problem, that is the problem of organizing multiple photo sequencestaken during the same event. Given a set of sequences acquired at the same placeduring the same temporal window by several users using different cameras, ourframework is intended to capture the evolution of the event and groups photosbased on temporal proximity and visual content. The method automatically organizesthe reference sequence in a tree capturing the event structure. Such a structure isthen used to align the remaining photo sequences to the reference one. We testedour approach on the publicly available Gallagher dataset and on a new dataset wecollected; this new dataset is composed of four photo sequences taken by four usersat a public event. Results demonstrate the effectiveness of our method.

    Keywords Digital library · Personal photo album · Concurrent photos ·Co-organization · Content analysis · Hidden Markov Model

    This research has been conducted while Dr. Lo Presti was post-doctoral researcherat University of Palermo.

    L. Lo Presti (B)Computer Science Department, Boston University, Boston, MA, USAe-mail: [email protected]

    M. La CasciaDICGIM, University of Palermo, Palermo, Italye-mail: [email protected]

  • Multimed Tools Appl

    1 Introduction

    Nowadays cameras are commonplace and large photo collections are collected andneed to be properly organized. Much of the research in this domain focuses on theproblem of managing personal photo collections acquired by a single user, takingadvantage of information such as who is in the photo, and when and where eachphoto has been acquired [5, 10, 19, 28]. Photo organization based on the identitiesof faces detected in the collection is often performed using clustering methods andasking to the user to tag the representative faces or to select from a list of probablelabels as in Choi et al. [4] and Gallagher and Chen [9]. Recently, the problem hasalso been modeled as a data association problem in Lo Presti et al. [20] and Zhanget al. [29].

    Organization of photo collections can take advantage of contextual information,such as photo timestamp and/or geo-reference information coming, for example,from GPS equipped devices. Some methods perform visual content-based photoorganization [13, 17]. In this case, however, where the photo has been collected doesnot refer to geo-reference data but to a particular situation when the photo has beenacquired, for example outdoor or indoor, the park, the sea, etc.

    In this paper, we focus on concurrent photo sequence organization, that is theproblem of organizing multiple photo sequences detected at the same place duringthe same temporal window by several users with different cameras. This problem hasreceived few attention from the scientific community, but tools based on the ideaswe present could be very useful in practice.

    In fact, the scope of taking and managing photos is changing; people desire tocapture important moments of their life and share such moments with the others.Indeed, photo-sharing has been one of the most popular applications recently. Thenumber of photo-sharing applications on smart phones is growing fast and there isstrong demand for them. Moreover, social networks have gained much popularityin recent years and are often used to share photos among friends. In particular, incase a group of “friends” has been involved in a social event, they probably wouldlike to share all the photos taken at the event; in this case, a proper organizationof these photos is required. Tools for organizing several photo sequences of thesame event are currently missing both in mobile photo sharing applications and insocial networks like Facebook [7] or Google+ [11]. Nonetheless, concurrent photosequence sharing is a very interesting scenario that enables the use of collectiveknowledge for photo collection analysis and management [25]. This is the mainmotivation for our work: providing a framework to co-organize and help users tobrowse photos taken concurrently at a social event within their social group.

    Events are the key concepts often used to organize photos in album. In this paper,we focus on the event structure of the shared sequences; we assume that, as thephotos are taken during the same event, it is likely that all the sequences wouldhave a similar event structure. Therefore, in our framework the temporal and visualstructure of a chosen sequence is learned, and then transferred to the remainingunprocessed sequences.

    We consider the case when the photo sequences are not fully overlapping andcameras are not temporally synchronized. In such a scenario, temporal informationis not very helpful to organize the sequences. Only the time difference betweensuccessive photos can reliably be used. Moreover, other information related to thevisual content can be effectively used to find correspondences among sequences.

  • Multimed Tools Appl

    There are several reasons why the problem of concurrent sequence organizationis difficult. First, the set of possible points of view from which photos are taken isvery large. Second, the users can focus only on particular aspects of the scene basedon personal preferences. It is possible that two users acquiring photos in the samemoment and place will focus on completely different aspects/objects taking photosthat do not look similar. Third, the scene to acquire is hard to represent, and there islittle evidence about what needs to be measured to obtain a good description of eachphoto.

    Other challenges should be considered, for example, illumination changes anddifferent sensor characteristics that can affect the quality of the photos. Most im-portant, each user can have his/her own preferences while setting the cameras (flashon/off, zoom, micro-utility...). All these factors make the problem of concurrentsequence organization very difficult and open new interesting research directions toinvestigate in the future.

    The simplest approach to solve the sequence co-organization problem would bethat of considering all the photos together and using a clustering technique to findgroups of photos that look similar. However, such organization could perform poorlybecause of the high variance of the photo content, and would not consider at all thetemporal order of the photos along the sequence that, instead, is an important clue touse. Each photo sequence can be seen as a story with a temporal structure. All thesestories are partially overlapping and our goal is that of detecting which parts of suchstories overlap and properly integrating them to get a unique and more completestory.

    In our framework, one of the sequences (the longest one or the one selected bythe user) is set as reference and the remaining sequences are organized accordingly.Our contribution is twofold: we present a method to organize a sequence of photoshierarchically to capture the structure of the event considering time information andvisual content. Then, the temporal structure and evolution of the event is capturedby a Hidden Markov Model (HMM) that serves to classify photos belonging to theremaining sequences.

    The organization of the paper is as follows: in Section 2, we present previous worksabout photo collection organization, while in Section 3, we define the problem. InSection 4 we present our method to organize the reference sequence in a tree. InSection 5 we describe the probabilistic framework we used to determine the align-ment among concurrent photo sequences. Finally, in Sections 6 and 7, we describedatasets and experimental results, and discuss conclusions and future directions.

    2 Related works

    In recent years, many works focusing on photo collection management were pro-posed enabling users to share, browse, and search their own photos. In Gong andJain [10], many aspects of photo segmentation are pointed out as, for example, thepossibility to make use of contextual information to organize the collections and usewhere and when photos have been taken to easily browse them. In particular, theauthors suggest that EXchangeable Image File (EXIF) data and/or a priori knownevent-model can provide useful insights to understand and represent the structure ofa photo stream.

  • Multimed Tools Appl

    Many approaches were based on partitive clustering techniques for classifyingimages into a default number of groups. However such techniques, derived fromclassic image retrieval studies, do not allow to obtain suitable results when appliedto personal photo collections. In Ardizzone et al. [1], mean-shift clustering is usedto automatically organize image data focusing on faces, background, and timeinformation. Data organization does not need any human intervention since imagefeatures are automatically extracted, and clustering parameters are automaticallydetermined according to an entropy based figure of merit. However, the eventstructure to which photos refer and that could be useful to the user for browsing thecollection, does not completely emerge because no temporal analysis is performedon the sequence.

    Some methods focus on organizing photos based on who is in the picture. In LoPresti et al. [20], a data association problem is set to group faces belonging to thesame identity in order to ease the user’s tagging task. A probabilistic framework is setto find correspondences based on online learned face and clothing models estimatedfor each identity. In Lo Presti et al. [21], the same method has been extended toconsider also time information.

    In Zhang et al. [29], users are allowed to multi-select a group of photos and assign aname/tag to the person appearing in them. The method attempts to propagate namesfrom photo level to face level, i.e. to infer the correspondence between name andface. However, whilst the user’s effort for tagging is minimized, still the user has tomanually identify the group of photos where a person appears. Moreover, in somecases the method is not able to disambiguate between persons in the photos (i.e.,when some persons always appear together in the set of photos).

    In Li et al. [17], photo collections are organized based on image content. Colorhistograms of faces, clothing and background are used as image content features. Asimilarity matrix for the photos in the collection is computed according to temporaland content features; then, hierarchical clustering is used to group similar photos.The contrast context histogram (CCH) technique is used to properly summarize eachcluster.

    All these methods, whilst they are effective for easing the photo collectionbrowsing, do not implement any capability to really understand both the photosequence structure and its time evolution. In this paper, given a set of sequencestaken at the same event, we propose to transfer learning about the temporal andvisual structure of a chosen sequence to the remaining unprocessed ones, consideringthat it is likely all the sequences share the same event structure. At the best ofour knowledge, such problem has been faced only in Jang et al. [14]. They clusterconcurrent photos by first selecting a preferred sequence and then estimating, viatemporal analysis, the basis clusters that will be used as a reference model for theremaining sequences. Then, based on the user’s preferences, photos are iterativelyclustered considering temporal and visual information until the user’s preferencesare not satisfied. Photos were grouped by a hierarchical clustering method combiningtime and visual similarity.

    On the contrary, our work uses clustering techniques only to organize a referencephoto sequence while all the other sequences are processed by means of theestimated temporal model. We used a Hidden Markov Model to explicitly representthe temporal model of the reference sequence and to infer correspondences with thephotos belonging to the other sequences. In this way, it is not required that camerasare time synchronized. In contrast to the work in Jang et al. [14], where the main

  • Multimed Tools Appl

    goal is to satisfy the user preferences, we propose a general unsupervised strategy toestimate the structure and dynamics of an event, and we show how to transfer suchinformation on the other sequences. Under this point of view, our method is closerto event understanding than to photo clustering.

    3 Problem definition

    Let us consider a photo sequence Sr chosen as reference sequence, and a set Cof N sequences C = {S1, S2, ..., SN} all related to the same event, that is all thesequences were collected during the same temporal window and at the same placeby different users. Concurrent photo sequence organization aims to group photosfrom different sequences into clusters with meaningful “visual semantics”, that iswith similar content. For example, if the sequences are related to a birthday party,meaningful moments could be when candles on the cake are blown off or whenpresents are opened. If the event is a marriage, then possible meaningful clusterswould regard the moment when the bride and groom enter the church, or when theycome out, and so on.

    In this paper, we propose to organize the photos of a reference sequence in a treeto represent the event structure; then, we model the dynamics occurring during theevent itself by means of a Markov chain and use such a dynamic model to classifythe pictures belonging to the other sequences. Of course, several methods may beused to infer the tree structure or model the event dynamics. Here, for the sake ofdemonstrating how our idea may be applied, we present specific implementationsfor modeling the event structure and dynamics, but we believe that other techniquesmay be selected instead.

    In our implementation, the reference sequence Sr is organized in a tree whosenodes, at each level, represent clusters of photos computed considering some peculiarcharacteristic. Nodes at the first level are computed considering temporal informa-tion coming from photo timestamps in order to group photos acquired “near in time”.In the following, we call a node at this level situation.

    Photos within each situation are then clustered based on visual similarities tohave a more accurate photo organization. We use hue and saturation values torepresent the visual content trying in this way to account for illumination changes.To group similar photos, we used a clustering technique that does not require anya priori information about the collection. In this way, nodes at the second levelof the tree represent groups of photos that look similar because of both the colordistribution and their temporal proximity. In the following, we refer to nodes at thislevel as content-cluster. Of course, other features could be used to represent contentinformation, and many other levels could be added to the tree based on the propertythat it is necessary to highlight. For example, the objects that are in each photo or thefaces that are detected could be used to generate a new level in the tree.

    The tree computed for the sequence Sr is used to classify the photos belongingto the remaining sequences in the set C. As we are not imposing that sequences aretime synchronized, we cannot use temporal information to perform the alignment.We instead need to classify the photos based on their visual content. However, inthis context, the temporal order of photos is meaningful. If the event has a temporalstructure (as happen for a birthday party or a marriage), then this structure could

  • Multimed Tools Appl

    emerge completely or partially from each sequence and can be used to find an align-ment between them. For this reason, we used a first order Hidden Markov Model(HMM) to represent the structure of the event with respect to the reference photosequence, and infer how the remaining sequences can match the same temporalorder. In practice, photos taken in a certain sequence Si are treated as observationsin the model while, as we will explain later, the “content” clusters detected for thereference sequence Sr are used to define the possible states that the hidden variablescan assume. Photos in the sequence to classify are considered in time order and,while performing inference, the transition matrix of the Markov chain is computedin such a way that the temporal sequence adapts to the temporal structure of thereference sequence. The consecutiveness of the situations discovered in sequenceSr must hold also for the other sequences. Sequences in the set C are processedindividually. In Fig. 1, we present a scenario with three photo sequences. Victor’sphotos are used as reference sequence and are organized in a tree. The leaf nodesare used to classify and organize Alice and Bob’s photos. While performing suchclassification, the temporal order estimated for Victor’s photos is used to organizethe remaining sequences. However, some photos could not be classified (see Nulllinks in Bob’s sequence) because the photo content can largely differ from the one inVictor’s photos.

    Alice's Photos

    Bob's Photos

    Null Null

    REFERENCE

    SEQUENCE

    ORGANIZATION

    CO

    ORGANIZATION

    SEQUENCE

    Fig. 1 Schema of the proposed framework: the reference sequence is organized in a tree consideringtemporal proximity and visual content; the tree is used to classify the photos belonging to theremaining sequences

  • Multimed Tools Appl

    4 Hierarchical sequence organization

    Given a photo sequence Sr, our goal is to organize it considering when and whereeach photo has been acquired by using context and content. Our method organizesphotos in a tree where each node represents a cluster of photos with similar features.Figure 2 shows part of the tree we got for a sequence of our dataset.

    This kind of organization is somewhat related also to the problem of automaticvideo segmentation where three steps are generally performed. The first step isshot boundary detection (SBD), that is the task of identifying similar consecutiveframes. The second step is keyframe selection that extracts one or more frames torepresent the shot. Finally, scene segmentation groups together related shots [26]. Ashot groups together frames taken in a certain temporal window and, therefore, itlooks similar to a situation. However, shot segmentation is performed consideringvisual properties of consecutive frames. In our problem, instead, situations arefound considering only time information. Keyframes are instead conceptually similarto our content-clusters. In video segmentation, visually dissimilar keyframes areselected to represent a shot. In our problem, a set of similar photos in the samesituation are grouped and used to estimate a probabilistic model representing their“appearance”. Several of these models are then used to represent the content of asituation.

    Fig. 2 The image shows some nodes of the estimated tree for the reference sequence. Photos withinthe ellipse have been taken in the same situation; they are then grouped based on visual similarity

  • Multimed Tools Appl

    4.1 Time segmentation

    Given a time ordered photo sequence, our goal is to isolate all those pictures thatwere acquired within the same temporal window, that is in the same situation. Apossible solution to find these situations would be to use clustering methods to findall the photos with a similar timestamp. Such methods generally require to set apriori the number of clusters (i.e. k-means) or a similarity threshold (i.e. hierarchicalclustering). In the last case, the problem is challenging because the threshold affectsthe grain at which clusters are computed and a proper threshold is difficult to setwhen the sequence has been acquired in a short temporal window. The risk is that ofdetecting too many short situations (in the worst case any photo can be consideredas a cluster) or too long temporal windows mixing different “situations”.

    Instead of using directly the timestamp value to group photos, we take a differentapproach similar in spirit to the one presented in Cooper et al. [6]. Temporalsegmentation is performed considering the difference δk of a photo timestamp Tp(k)with the next Tp(k + 1), computed as follows:

    δk = Tp(k + 1) − Tp(k); (1)

    where k indexes the photos along the time ordered sequence. The set of computeddistances may refer to intra-situation δks and inter-situation distances δ

    ki .

    Figure 3 shows the plot of such differences along the ordered timestamps for asequence of 200 photos taken during a period of almost 10 h. Peaks correspondto large time differences and can be considered as the starting of possible newsituations. In practice, the lower is the time difference between two subsequentphotos, more probable is that the two photos belong to the same situation. Thetemporal segmentation problem reduces then to detect the “meaningful peaks”. Ifthe number of situations along the sequences would be a priori known, say K, then itwould be sufficient to take the highest K − 1 peaks for segmenting the sequencein situations. However, in general such information is not known a priori and athreshold to detect the peaks is required (Fig. 4).

    Fig. 3 Plot of the timestampdifferences of consecutivephotos along the time orderedsequence. Peaks show where anew situation starts

    0 20 40 60 80 100 120 140 160 180 2000

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    Photo Indexes

    Tim

    esta

    mp

    Dif

    fere

    nce

    s (s

    eco

    nd

    s)

  • Multimed Tools Appl

    Fig. 4 Situations detected for the first 15 photos of a photo sequence. The time is referred to thefirst photo in the sequence. Each situation is generally composed of a different number of photosshowing a large variance of the visual content

    In contrast to Cooper et al. [6], where a multi-scale analysis is applied to determinethe peaks, we model the probability density function of the distances. We assumethe distances between two successive photos follow two Gaussians, depending onwhether the photos belong to the same situation. To automatically determine asuitable threshold and separate intra-situation and inter-situation distances, we traina mixture of two Gaussians by maximizing the likelihood of the data within theExpectation-Maximization framework.

    To limit the effect of outliers, our technique uses a robust estimator to learn theparameters, as described in Li [18]. During the training, we consider the minimum ofa default threshold and the distance between the Gaussian mean and the trainingsample. We set such threshold equals to the number of seconds within a day.This choice is reasonable for general personal photo collections. The GaussianMixture Model (GMM) estimation provides a clustering of the samples in twoclasses (intra-situation distances {δks } and inter-situation distances {δki }). The inter-situation distances permit to detect the beginning of a new situation in the timestampsequence. The threshold providing the same clustering found through the GMM canbe chosen in the interval [maxk{δks }, min j{δ ji }]. Therefore, we set the threshold T as:

    T = 12

    ·(

    maxk

    {δks

    } + minj

    ji

    }). (2)

  • Multimed Tools Appl

    To find situations at the first level of the tree, the sequence of timestamp differencesis sequentially analyzed. Every time the distance is greater than the threshold T, anew situation has been discovered and used as node of the tree.

    The grain of the temporal segmentation depends on the threshold T and, if afiner segmentation is required, it is possible to decrease the value of this threshold.However, the automatically computed threshold permits the method to adapt to thetemporal duration of the collection. As the data are unidimensional, the Expectation-Maximization procedure used to train the GMM is fast and efficient.

    Figure 5 shows the estimated GMM for the example sequence. As the plot shows,the two modes are well separable. The most peaked one corresponds to the modeof intra-situation distances, while the other one to the inter-situation distances. It ispossible to note that the latter mode is quite flat. This is due to the large variance inthe inter-situation distance mode. The plot also shows the position of the estimatedthreshold that, for this photo sequence, is of 266 s and provides 18 different situations.In Fig. 4 we show the events detected for the first 15 photos.

    Algorithm 1 summarizes the main steps of our technique to organize photosin situations without any a priori information about the collection. The functiontrain_GMM simply trains a mixture of two Gaussians and return the two sets δs andδi of inter-situation and intra-situation distances.

    4.2 Content-based organization

    4.2.1 Content representation

    As Fig. 4 shows, the visual content of the photos can largely change within thesame situation, and different clusters could emerge considering the photo similarities.Representing the content of the photos with the goal of establishing matches amongthem is a difficult task because of the large changes of points of view, different camerasettings, illumination changes and so on. In some works [15], matching betweenimages are established by extracting interest points in the images, for instance SIFT[22] and SURF [2], and then finding correspondences. However, when these methodsare adopted, generally it is a priori assumed that the content in the two images isoverlapping (the images have some content in common). In our case, this knowledge

  • Multimed Tools Appl

    Fig. 5 Plot of the Mixture oftwo Gaussians representingthe intra-situation andinter-situation distances forthe sequence of 200 photos.The threshold T isautomatically computed by(2). The range of distances hasbeen limited to [0, 900] toenhance the visualization

    0 100 200 300 400 500 600 700 800 9000

    0.005

    0.01

    0.015

    0.02

    0.025

    0.03

    0.035

    0.04

    0.045

    0.05

    Distances

    Mixture of two Gaussians to represent the distance distribution

    Inter-SituationDistance Mode

    Intra-SituationDistance Mode

    Threshold T

    is not a priori given. In other works, interesting points are used in a bag of wordsparadigm [8]. In this case, learning a vocabulary of words for describing the imagecontent introduces some loss of information due to quantization or binarization ofthe descriptors. Moreover, such approach requires a suitable training set to learn thevocabulary, and is generally used to learn class-models in a supervised way; in ourproblem, no supervision is a priori available.

    To build the third level of the tree representing the reference sequence, thevisual content of each photo has been represented by means of the distributionof hue and saturation in the HSI space within maximally stable extremal regions(MSER) [23]. For each image, MSERs are detected. Pixels inside each MSER areclosed under perspective transformation of image coordinates and under monotonictransformation of image intensities. Indeed, MSERs are computed considering theconnected components detected for a certain range of possible thresholds of the gray-level image. As a result, these regions are invariant to affine transformations of theimage intensities, and are stables as they are computed in correspondence of thoseextremal regions unchanged over a range of thresholds. In order to discard too finedetails and represent content in meaningful area, we filter the regions discarding toolittle and too large regions, and we use an incremental adaptive clustering method(as in Leow and Li [16]) to filter regions with similar position.

    In our implementation, the MSER regions were computed by means of the codeits authors made publicly available.1 We used the default parameters for computingthe regions and then we analyzed and filtered the detected regions. We retain onlythe regions whose area is between 1% and 35% of the total image size. In theagglomerative clustering, we grouped the regions whose centroid Euclidean distanceis lower than 10. Such parameters have been tuned manually to increase the precisionin retrieval of a sub-set of images.

    1The code may be found at http://www.featurespace.org/.

    http://www.featurespace.org/

  • Multimed Tools Appl

    We then represent the content by the global joint histogram of hue and saturationvalues within the detected MSER-regions. We adopted a uniform quantizationof the hue and saturation space with 8 bins per channel. The joint histogram isthen normalized and transformed in a vector. This content representation implicitlyassigns a greater weight to those pixels that fall in overlapping MSERs, as theyprobably are more meaningful for the content representation of the whole image.

    Considering only hue and saturation makes the representation more robust toillumination changes. The advantage in adopting a histogram representation is thatthe representation is not dependent on the structural and spatial image composition.In our context this is particularly important because our final goal is to comparephotos taken from very different points of view without trying to estimate the realscene structure. Of course, it would be possible to use other affine region detectors.A very useful comparison among approaches at the state of the art has been reportedin [24].

    4.2.2 Mean-shift clustering

    We use the MSER-based descriptor to cluster photos within the same situation ingroups with similar visual content. It is not possible to know a priori how manycontent-clusters have to be computed within the same situation. Therefore, methodssuch as K-means can not be adopted. We instead adopt the unsupervised mean-shiftalgorithm proposed in Wu and Yang [27].

    Mean-shift algorithm is a gradient descent based algorithm used to computeautomatically the modes of the probability distribution on the samples. Initially allthe points in the sample set are candidates to be modes of the distribution. Then, bymeans of an iterative procedure, such points are moved towards the true modes of thedistribution. Mean-shift clustering suffers of the problem of determining how muchfast such points must be moved. In Wu and Yang [27], the authors propose to use thep − th order Epanechknikov Kernel and the sample variance of the distances amongsamples as size of the spatial window for the kernel. The method’s performance isdependent on the order p of the kernel, representing the stabilization parameter.They propose a technique to automatically determine the best value of p. In practice,they define a function to represent the shape of the estimated density distributiongiven p; then they evaluate such function for a range of values p. The best value of pis the one for which this shape remains unchanged and can be evaluated based on thecorrelation value of the shape function evaluated for each sample and for couples ofconsecutive values p. The selected value of p will be the one for which the correlationtends to be 1 (for more details see Wu and Yang [27]).

    In Fig. 6, each curve represents the plot of the correlation computed for theMSER-based descriptors of the photos within the same situation for different valuesof p. In our experiments, the correlation did not always reach the value 1 (this caseis mentioned in Wu and Yang [27]). Therefore, we empirically set a threshold for thecorrelation values and choose p so that the correlation value was greater than thisthreshold (in our experiments we used 0.995).

    4.2.3 Content-based clustering

    After running mean-shift clustering, samples have been shifted toward the distribu-tion modes. As suggested in Wu and Yang [27], to determine the final clusters, we use

  • Multimed Tools Appl

    Fig. 6 Plot of the correlation ρcomputed for the shapefunction for different values oforder p for the kernel. Eachcurve was computed for adifferent segment

    an agglomerative clustering method to group together all the points returned by themean-shift method that have a distance lower than a threshold τ . In our experiments,we set τ to 0.001; this parameter does not affect the performance of the whole systemas mean-shift provides well separated clusters. The points are sequentially analyzed;every time the distance of a point from the already discovered cluster centroids islarger than τ , a new cluster is added. Each cluster is represented as a multivariatenormal probability density with mean μ—computed averaging all the samples in thecluster—and covariance �, set as constant and diagonal, that is � = diag(σ 2), whereσ = 0.1. In case the cluster is composed of only a single photo, the correspondingMSER-based descriptor is used as mean.

    In Fig. 2, the content-based clustering level shows how photos are groupedconsidering the MSER-based descriptors and the mean-shift clustering method.In practice, photos representing similar points of view of the scene are groupedtogether, making emerge a visual semantic in the set of photos.

  • Multimed Tools Appl

    Algorithm 2 summarizes the main steps needed to obtain content-clusters withineach situation. The function extract_MSER aims to detect the MSERs of a certainimage s within the situation S. Then, the regions are filtered to remove large or littleregions and, by using an agglomerative clustering, to remove strongly overlappingregions. Then, meanshift_clustering is used to compute the modes of the distribution.The function agglomerative_clustering detects the content-clusters by analyzing thepoints stored in shif ted_samples.

    5 Sequence co-organization

    An event has an inherent temporality, and can be modeled as a process that unfoldsin time. In practice, we consider photos taken at a certain event as observations of aparticular state of the event-process. We make the assumption that an event can bemodeled as a Markov process, that is a time-varying process for which the Markovproperty holds. In this kind of process, the present state depends only on the paststates. In particular we adopt a first order Markov chain. In our formulation, wehierarchically organize the reference sequence obtaining a set of content-clusters.We use each of these clusters to model a state of the event-process. The order of thesituations which the content-clusters refer to, permits to define the transitions amongstates for the whole process.

    We adopt a Hidden Markov Models (HMMs) [3] to model the dynamics of theprocess across time. In our framework, each photo to classify is an observation.Therefore, the sequence of photos corresponds to the sequence of observations.Given such sequence, our goal is to infer the sequence of states that generated suchobservations. As a state is a content-cluster in the event tree, inferring the hiddenstates corresponds to classify each photo and associates it with the correspondingcontent-cluster. Therefore the transition probability model of the HMM permits totransfer the event time structure from the reference sequence to the sequence toclassify.

    Figure 7 shows the graphical model associated with a first-order discrete timeMarkov model, where the probability at t + 1 depends only on the state inferredat time t. Here, t represents the index of the observation (photo) to classify.

    The state at any round t is denoted ct. In our formulation, each ct represents theindex of a content-cluster, that is one of the leaf node in the tree computed for thereference sequence Sr, as described in Section 4. Each photo in the sequence toclassify is an observation Ot and our framework infers the corresponding content-cluster in the reference sequence.

    Fig. 7 Graphical model for anHMM. At round i, the variableCi represents the hiddendiscrete state, while Oirepresents the observation

  • Multimed Tools Appl

    The temporal model for the whole event is described by the transition probabilitiesP(ct+1|ct), while the observation model P(ot|ct) is defined in terms of probability ofobserving a certain photo given the content cluster.

    Given the reference sequence, the temporal structure of the event is representedby the sequence of situations at the first level of the tree estimated as describedin Section 4. However, we can establish correspondences between sequences con-sidering only visual features, if we assume that temporal information is uncertainor missing for the sequence to classify. Based on how the event tree has beenestimated, each situation may be represented by several content-clusters. Thus,transition probabilities must be computed guaranteeing the process properly unfolds.In practice, from each content-cluster it is possible to transit only to the content-clusters belonging to the same situation or to content-clusters of situations forwardin time. Such transition probability matrix models the dynamics of the event itself.We note that event dynamics have never been modeled or considered in previousworks about photo organization.

    Let us indicate with Ck the set of content-cluster that refers to the same situationk. At round t + 1, it is possible to transit from state ct ∈ Ck to state ct+1 ∈ H,with H = ∪τ≥k{Cτ }. To enforce such constraint we set the transition probability asfollows:

    p(ct+1|ct) =⎧⎨⎩

    1#H

    if ct+1 ∈ H0 otherwise

    (3)

    where #(·) computes the number of content-clusters in H. On the other hand, Hrepresents the set of content-clusters considering all the situations subsequent to thek-th one. Equation 3 assures that the choice of the content-cluster at time t + 1 isuniform over all the reachable content-clusters from the content-cluster at time t.

    The probability of an observation given the state at round t has been modeled bya multivariate normal distribution so that:

    p(ot|ct) = N(ot|μct , �ct ) (4)

    where ot is the MSER-based descriptor we computed as described in Section 4 forthe t-th photo in the sequence to classify. μct and �ct are instead the parametersrepresenting the ct-th content-cluster in the reference sequence.

    We already pointed out that, in some cases, photos in concurrent sequences canlargely differ. It is possible that some photos do not match any particular state ofthe reference sequence. In this case, instead of forcing the method to associate eachphoto with a content-cluster in the reference sequence, it is possible to introducean additional state, we call “Null” state. Every time an observation matches the Nullstate, then such photo is not classified. We set the probability of an observation giventhe Null state empirically to a constant value. However, other possible models can beused for modeling this probability. Our goal here is to demonstrate that some of thepictures can not be properly classified by the event tree. In practice, other methodsmay be adopted to handle information from the unclassified photos to refine theevent tree and add new nodes.

  • Multimed Tools Appl

    Given a sequence to classify, the inference of the content-clusters correspondingto each photo has been performed by maximizing the likelihood L of states andobservations. The likelihood can be computed as:

    L = p(c1) · p(o1|c1) ·N∏

    i=2p(oi|ci) · p(ci|ci−1) (5)

    where N is the number of photos to classify. We considered that all the content-clusters have the same initial probability. To maximize the likelihood, we usedthe Viterbi algorithm [3], that consists in using a dynamic programming methodto maximize the joint probability of the assignment of photos to leaf nodes. Wesummarize the algorithm by means of the pseudo-code in Algorithm 3.

    6 Experimental results

    6.1 Dataset

    We tested our framework on a dataset collected at a public event that we will makefreely and publicly available. This dataset is composed of four different sequencestaken at the same place from four different users using different cameras. The nonprofessional photographers who collected the dataset did not know the purpose forwhich the dataset was being collected and took as many photos as they wanted.Table 1 summarizes the characteristics of each photo sequence. Sequences havebeen taken at a public conference. It is possible to recognize three different sub-events: before arriving at the conference (A), during the conference (B) and afterthe conference—cocktail party (C). To determine the ground-truth of content-clustering, we asked to four users to manually annotate the dataset. We merged

  • Multimed Tools Appl

    Table 1 Dataset for Photo Sequence Co-Organization

    No. Before At the After Clusters Clusters Clusters ClustersSeq. photos conf. (A) conf. (B) conf. (C) ann. 1 ann. 2 ann. 3 ann. 4

    User 1 200 15 167 18 19 21 15 24User 2 56 4 52 0 8 8 6 6User 3 30 0 24 6 11 9 8 10User 4 70 19 37 14 37 31 26 36

    all the sequences and split them in the previously described three main groups (tomake easier for the user to annotate so many photos). Each annotator assignedthe same label to all the photos that he/she felt similar without considering thetemporal relations among photos. None of the annotators was at the conference andtimestamps were not available to them during the annotation. Each annotator useda different personal criteria thus they provided different partitions for the photos. Inpractice, for each sequence we have four different ground-truth each one providedby a different annotator. In the table we report for each sequence the number ofdifferent classes found by each of the four annotators. It is worth to stress thatsuch classes have been found by the annotators for the merged sequences, that isconsidering all the 356 photos together. For all the sequences together, Annotator1 found 49 clusters, Annotator 2 found 44 clusters, Annotator 3 found 32 clustersand, finally, Annotator 4 found 52 clusters. When splitting the ground-truth for eachsequence, the number of annotated clusters per annotator and per sequence arethose reported in Table 1. Figure 8 reports some pictures extracted from each ofthe 4 sequences and shows that the photos’ content largely differs from sequence tosequence making their organization more difficult.

    Fig. 8 Images sampled from the sequences. In some cases, the content largely differs

  • Multimed Tools Appl

    As additional dataset for evaluating the hierarchical organization of the referencesequence, we used the Gallagher dataset [9], a public dataset of 589 photos collectedduring almost six consecutive months by a single user. We asked a user to manuallyclassify each of the photo in the Gallagher dataset grouping all the photos lookingsimilar (we will make available also such annotation). Photos have been presentedto the annotator all together, and the annotator – who did not know temporalinformation associated with each photo – used only visual information to group thepictures.

    6.2 Performance evaluation of the reference sequence organization

    Evaluation of the performance has been carried on computing the accuracy of thepartitions provided by our method versus each of the ground-truth annotations.When evaluating the performance of the reference sequence organization, for eachcluster we find the dominant label coming from the true annotation and count howmany photos have a label equal to the dominant one. In this way, the number ofcorrect matches for each detected content-cluster is the number of photos in theintersection between the content-cluster itself and the most overlapping annotatedcluster. We computed the ratio between the correct matches across the clusters andthe total number of photos in the sequence, that is:

    accuracy =∑

    ∀content-cluster C Number of Correct Matches in cluster CNumber of Photos in the Sequence

    . (6)

    We tested the hierarchical sequence organization on the Gallagher dataset and oneach of the four sequences collected at the conference. As for the four sequences wehave the ground-truth from four different annotators, we compute the accuracy perannotator independently.

    6.2.1 Performance evaluation of the temporal segmentation

    We performed experiments on all the available sequences to test the temporal seg-mentation method. On each sequence, the automatically computed threshold permitsthe method to adapt to the temporal duration of the sequence. Figure 9 shows theplot of the differences along the ordered timestamp sequence on the Gallagherdataset [9]. Peaks correspond to large time differences and can be considered as thestarting of possible new situations. Our method estimates a threshold of almost 9.5 h,providing 117 different temporal segments for the whole collection. We observed thateach situation represents photos taken within the same day. For a collection acquiredin such a long temporal window this seems indeed a reasonable result. The averageduration of the situations was almost 2 h. In case a finer segmentation is required, itis possible to easily extend the method asking the user to manually set a threshold.It is also possible to apply our method recursively on each situation until a minimumduration for each situation has not been reached. However, in this work, we appliedour method only once, to build the first level for the reference sequence organization.Figure 10 shows an example of time segmentation we got for the Gallagher datasetwhere, within the same temporal event, more than a content-cluster can emerge asthe photos are visually dissimilar.

  • Multimed Tools Appl

    Fig. 9 Plot of the timestampdifferences of consecutivephotos along the time orderedsequence. Peaks show where anew situation starts

    0 100 200 300 400 500 6000

    50

    100

    150

    200

    250

    300

    350

    400

    Photo Indexes

    Tim

    esta

    mp

    Dif

    fere

    nces

    (ho

    urs)

    In Table 2, we present the results we got on our 4 concurrent photo sequences. Foreach sequence we report the number of photos, the duration of the temporal windowthe photo sequence has been acquired in, the automatically estimated threshold, the

    Fig. 10 Situations detected for the first 15 photos of a photo sequence. The time is referred to thefirst photo in the sequence. Each situation is generally composed of a different number of photosshowing a large variance of the visual content

  • Multimed Tools Appl

    Table 2 Temporal segmentation on the 4 concurrent photo sequences

    Total no. Duration of Estimated No. of Avg. durationSequence of photos time window(hours) threshold (min) situations (min.)

    User 1 200 10 4.4 18 4.3User 2 56 8.5 17 5 24.2User 3 30 4 15,7 6 7User 4 70 9 10 8 15

    number of situations detected and the average duration of each situation. In the firstsequence, composed of a greater number of photos, more situations have been found.For the other sequences, where less photos have been taken, a comparable numberof situations has been found.

    For comparison purposes, we tested iPhoto v. 8.1.2 [12] to automatically organizepictures based on time information. In iPhoto, this is achieved by using the tool“automatic event detection”. We noted that this tool groups together all the photostaken during the same day. Therefore, for each of the four sequences, the tooldiscovered only one event, as the pictures have been taken in the same day. Whenconsidering all the photos together, iPhoto discovered 2 events. Indeed, the cameraswere not synchronized and one of them was set on the wrong day, originating a newevent. Such kind of event organization, therefore, does not estimate the real temporalevent structure as, instead, our method attempts to do. On the Gallagher Dataset, wefound that iPhoto computes exactly the same events (situations in our case) as ourmethod does, where each detected event corresponds to the set of pictures taken inthe same day.

    6.2.2 Performance evaluation of the hierarchical sequence organization

    After applying the temporal segmentation to the Gallagher dataset, we organizethe photos in each situation based on the content as described in Section 4.2.Within each situation, only few content-clusters (generally between 1 and 4) emerge.The measured accuracy of the content-based organization within the automaticallydetected situations is around 83%. Figure 11 shows the hierarchical organization ofthe first 15 photos of the Gallagher dataset.

    We also measured the performance of the hierarchical organization on each ofthe 4 concurrent sequences. In Table 3, we report the results we got considering 4different annotations. Lower performance corresponds, in general, to finer partitionsprovided by the annotators.

    Whilst the sequences used for testing (the Gallagher dataset and the 4 concurrentphoto sequences) have different characteristics, the performance is comparableand on average, independently on the annotator, the average accuracy for thehierarchical organization is 82.44%. It is worth to note that, as the annotators didnot consider the temporal relations among photos while annotating them, actuallythe accuracy could be higher. Indeed, considering also the time, annotators couldprovide more accurate partitions of the data. This is difficult to realize in practice:first, the annotators have to be aware of the temporal structure of a sequence, thenit is more complex to ask them to provide annotations without conditioning theirjudgements. For this reason we asked the annotators to limit their attention on thecontent.

  • Multimed Tools Appl

    Fig. 11 Hierarchical Organization of the first 15 photos of the Gallagher dataset

    6.3 Performance evaluation of the concurrent photo sequence organization

    To measure the performance of our concurrent photo-sequence organization, weused the HMM to classify each photo in the sequence and assign as label the index ofthe associated content-cluster in the reference tree. Then, comparing the partitionscomputed by our method to the available ground-truth, we computed the accuracyfor each classified sequence by (6).

    6.3.1 Simulations

    To test the probabilistic framework used for concurrent photo sequence organiza-tion, we generated a sequence of different content-clusters by sampling mean and co-variance matrix for each one of them. Then, by using such reference content-clusters,

    Table 3 Accuracy of the Hierarchical Organization on each of the 4 concurrent photo sequences

    Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4 Average(%) (%) (%) (%) (%)

    User 1 81 85.5 88.5 83 84.5User 2 87.5 77 96.4 85.7 86.5User 3 83.33 80 90 84.33 84.41User 4 75.71 81.43 72.9 77.14 76.79

  • Multimed Tools Appl

    Table 4 Accuracy of the concurrent photo sequence organization over the tree computed for thesequence “User 1” – no Null state

    Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%) (%) (%)

    User 1 - Tree 81 85.5 88.5 83User 2 78.6 76.8 87.5 76.8User 3 63.3 66.7 83.3 60User 4 65.7 65.7 60 65.7Average acc. for 72.15 73.9 79.8 71.4the whole set

    we generated a photo sequence by randomly choosing a content-cluster and enforc-ing the smoothness constraint in time (that is the generated photo sequence havethe same temporal structure of the reference one). Then we used our HMM andthe Viterbi algorithm to infer the corresponding content-cluster in the referencesequence for the generated photo sequence. This is not a trivial experiment becauseit permits to test the model we imposed for the state transition probabilities. Werandomly generated 100 couples “reference sequence—sequence to classify”. Whengenerating the event structure, we generated 20 content-clusters. For each of thiscluster we “toss a coin” an decided if the content-cluster belongs to a new situationor not. We used this information to generate a random number of photos for eachcontent cluster. On average, the generated photo sequences were composed of 195photos. We got an average accuracy of 95.80%.

    6.3.2 Using the longest sequence as reference

    We used the longest sequence (we call “User 1”) as reference sequence. Weperformed experiments with and without considering the Null state. In the lattercase (without Null state), we forced the method to classify the photos also when theprobability of the match is very low. That is the match was probably unreliable. Re-sults of the organization of the remaining sequence considering different annotationsare reported in Table 4, when the Null state has not been used, and Table 5 when theNull state is used. In the latter case, some photos may be unclassified. In particular,we got that the 5.36%, 16.67%, and 27.14% of the photos have not been classifiedrespectively for the sequence “User 2”, “User 3”, and “User 4”.

    Table 5 Accuracy of the concurrent photo sequence organization over the tree computed for thesequence “User 1” – with Null state

    Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%) (%) (%)

    User 1 - Tree 81 85.5 88.5 83User 2 80.3 78.6 91.1 78.6User 3 60.67 70 86.67 60User 4 65.7 64.3 64.3 65.7Average acc. for 71.92 74.6 82.64 71.8the whole set

  • Multimed Tools Appl

    Table 6 Accuracy of the concurrent photo sequence organization over the tree computed for thesequence “User 3” – no Null state

    Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%) (%) (%)

    User 3 - Tree 83.33 80 90 84.33User 1 60.5 66.5 69.5 53.5User 2 64.3 57.1 85.7 64.2User 4 44.3 47.1 45.7 44.3Average acc. for 63.10 62.10 72.72 61.58the whole set

    6.3.3 Using the shortest sequence as reference

    We then performed experiments using, as reference sequence, the shortest one (wecall “User 3”). It is worth to note that “User 3” does not have any photo about thesub-event A (see Table 1) and of course this affects the performance of the methodbecause pictures of the remaining sequences can not be properly classified. Resultsof the organization of the remaining sequence considering the different annotationsare reported in Table 6, when the Null state has not been used, and Table 7 whenthe Null state is used instead. In the latter case, some photos may be unclassified. Inparticular, we got that the 6%, 0%, and 38.57% of the photos have not been classifiedrespectively for the sequence “User 1”, “User 2”, and “User 4”.

    6.3.4 Discussion and comparison to baseline method

    As expected, comparing the tables it is possible to see how the choice of the referencesequence affects the accuracy of the organization. When using the sequence “User 1”,accuracy is higher. This is due to the fact that this sequence is the most complete andoverlaps with all the remaining sequences. When using the sequence “User 3”, witha limited number of pictures and a limited overlapping with the other sequences, theaccuracy decreases. It is evident that more information are available, more reliableis the event structure learning.

    Adding the Null state, and considering the sequence “User 1” as reference,accuracy slightly improves. The improvement is much higher when using sequence“User 3” as reference at the cost of a higher number of unclassified photos. Such un-classification rate makes more evident how sequence “User 3” and sequence “User4” have a limited overlapping.

    Table 7 Accuracy of the concurrent photo sequence organization over the tree computed for thesequence “User 3” – with Null state

    Sequence Annotator 1 Annotator 2 Annotator 3 Annotator 4(%) (%) (%) (%)

    User 3 - Tree 83.33 80 90 84.33User 1 67 72.5 76 60User 2 64.3 57.1 85.7 64.2User 4 72.9 78.6 71.4 74.3Average acc. for 71.88 72.05 80.77 69.96the whole set

  • Multimed Tools Appl

    The last row in Tables 4, 5, 6, and 7 reports the average accuracy considering all the4 sequences. The accuracy for the sequences organized in a tree are those reportedin Table 3.

    As baseline method for comparison purposes, we consider the content-basedorganization performed on the whole set of photos with no distinction amongsequences. We applied the mean-shift clustering over the MSER-based descriptors.We do not consider temporal information that, in our hypothesis, is unreliable;we stress that in practical situations, this information can be missing or camerascan be not synchronized (as it happens in our dataset). Accuracy of the baselinemethod has been measured as 57.3% when using “Annotator 1”, 63.2% when using“Annotator 2”, 72.5% when using “Annotator 3”, and 54.2% when using “Annotator4”. Comparing such values with those reported in the previous tables, it is possible tosay that our method generally outperforms the baseline one.

    7 Conclusions and future works

    In this paper we face with the problem of organizing multiple photo sequencestaken from different users with different cameras at the same place and during thesame temporal window. Our framework takes one of the sequence as reference andorganizes it hierarchically in a tree to capture the event structure; then it uses thistree to organize the photos of the other sequences. Our method does not require thatcamera timestamps are synchronized, but it requires the reference sequence to havea timestamp embedded. The reference sequence has been organized hierarchicallybased on both temporal and visual information. Leaves of the tree are the content-clusters containing photos acquired in the same temporal window and with similarcontent. To perform the co-organization, we used an HMM to represent the temporaldynamics of the event and used the Viterbi method to infer the sequence ofcontent-clusters to which each photo to classify belongs to. Our experiments showour technique is effective in co-organizing photo sequences and it may providesemantically meaningful clusters.

    However, our method presents some limitations as it is unable to identify parallelsub-events. For example, during the same temporal window, several events mayoccur in parallel and each photographer may focus only on one of them. In this case,our method can provide only the content-clusters for the events that are representedin the reference photo sequence. In our implementation, we do not refine thetree computed for the reference sequence. Techniques to perform such refinementiteratively using the other sequences could be used and this remains a topic for futureinvestigations. These techniques may help to handle the sub-event case. For example,the unclassified pictures in each sequence may be used to introduce new content-clusters in the event tree able to represent the missing parallel sub-events. In thiscase, suitable strategies have to be defined to reasoning about the time informationof the added sub-events. In future works we will consider the adoption of GMM tomodel the content within each situation; each Gaussian component would representa particular content cluster. We will also focus on finding new techniques to co-organize the set of photo-sequences without selecting a reference.

  • Multimed Tools Appl

    In case more subsequent situations in the event tree have very similar contentstructures, our method could be unable to recover the correct alignment betweenreference sequence and sequence to classify because the correspondences are foundconsidering only visual similarities. Therefore, in future works, we will also explorehow stricter temporal constraints can be considered when defining the time structureof the event by adding new dependencies into the dynamic Bayesian network.Indeed, considering also the dynamics in the sequence to classify, it could be possibleto find a finer alignment between the two photo sequences.

    Finally, we believe photo sequence co-organization is still a greatly unexploredproblem that opens potentially many future directions to investigate as, for example,trying to estimate difference among sensors of different cameras while co-organizingthe sequences, and use then this information to improve the multiple photo-sequenceco-organization.

    Acknowledgements We thank all the anonymous reviewers and the associate editor whose insight-ful comments and very constructive reviews led to significant improvements of the manuscript.

    References

    1. Ardizzone E, La Cascia M, Vella F (2008) Mean shift clustering for personal photo albumorganization. In: International Conference on Image Processing, (ICIP). IEEE, San Diego, CA,pp 85–88, 12–15 Oct 2008

    2. Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Proc of EuropeanConf on Computer Vision (ECCV). Springer, Graz, Austria, pp 404–417, 7–13 May 2006

    3. Bishop C (2006) Pattern recognition and machine learning, vol 4. Springer, New York4. Choi J, Yang S, Ro Y, Plataniotis K (2008) Face annotation for personal photos using context-

    assisted face recognition. In: Proc of int conf on Multimedia Information Retrieval (MIR). ACM,Vancouver, Canada, pp 44–51, 30–31 Oct 2008

    5. Chu WT, Lee YL, Yu JY (2009) Using context information and local feature points in faceclustering for consumer photos. In: Proc of Int Conf on Acoustics, Speech, and Signal Processing(ICASSP). IEEE, Taipei, Taiwan, pp 1141–1144, 19–24 Apr 2009

    6. Cooper M, Foote J, Girgensohn A, Wilcox L (2005) Temporal event clustering for digital photocollections. ACM Trans Multi Commun App (TOMCCAP) 1(3):269–288

    7. Facebook (2004) http://www.facebook.com8. Fei-Fei L, Perona P (2005) A Bayesian hierarchical model for learning natural scene cate-

    gories. In: Proc of int conf on Computer Vision and Pattern Recognition (CVPR), vol 2. IEEE,San Diego, CA, pp 524–531, 20–26 June 2005

    9. Gallagher A, Chen T (2008) Clothing cosegmentation for recognizing people. In: Proc of Com-puter Vision and Pattern Recognition (CVPR). IEEE, Anchorage, Alaska, 23–28 June 2008

    10. Gong B, Jain R (2008) Hierarchical photo stream segmentation using context. In: Proceedings ofIS&T/SPIE, vol 6820. SPIE, San Francisco, CA, p 682003

    11. Google+ (2011) http://plus.google.com/12. iPhoto (2009) http://www.apple.com/ilife/iphoto13. Jaimes A, Benitez A, Chang S, Loui A (2002) Discovering recurrent visual semantics in con-

    sumer photographs. In: Proc of Int Conf on Image Processing (ICIP), vol 3. IEEE, Rochester,New York, pp 528–531, 22–25 Sept 2002

    14. Jang C, Yoon T, Cho H (2010) Digital photo classification methodology for groups of photogra-phers. Multimed Tools Appl 50(3):441–463

    15. Jiang H, Yu S (2009) Linear solution to scale and rotation invariant object matching. In: Proc ofint conf on Computer Vision and Pattern Recognition (CVPR). IEEE, Miami, FL, pp 2474–2481,20–25 June 2009

    16. Leow W, Li R (2004) The analysis and applications of adaptive-binning color histograms. Com-put Vis Image Underst (CVIU) 94(1–3):67–91

    http://www.facebook.comhttp://plus.google.com/http://www.apple.com/ilife/iphoto

  • Multimed Tools Appl

    17. Li C, Chiu C, Huang C, Chen C, Chien L (2006) Image content clustering and summarizationfor photo collections. In: Proc of Int Conf on Multimedia and Expo (ICME). IEEE, Toronto,Canada, pp 1033–1036, 9–12 July 2006

    18. Li SZ (2005) Markov random field—modeling in computer vision. Springer-Verlag19. Lin D, Kapoor A, Hua G, Baker S (2010) Joint people, event, and location recognition in

    personal photo collections using cross-domain context. In: Proc of Eur Conf on Computer Vision(ECCV). Springer, Crete, Greece, pp 243–256, 5–11 Sept 2010

    20. Lo Presti L, Morana M, La Cascia M (2010) A data association algorithm for people re-identification in photo sequences. In: Int Symposium on Multimedia (ISM). IEEE, Taichung,Taiwan, pp 318–323, 13–15 Dec 2010

    21. Lo Presti L, Morana M, La Cascia M (2011) A data association approach to detect and organizepeople in personal photo collections. Multimed Tools Appl 1–32. doi:10.1007/s11042-011-0839-5

    22. Lowe D (1999) Object recognition from local scale-invariant features. In: Proc of Int Conferenceon Computer Vision (ICCV), vol 2. IEEE, Kerkyra, Greece, pp 1150–1157, 20–27 Sept 1999

    23. Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide-baseline stereo from maximally stableextremal regions. Image Vis Comput 22(10):761–767

    24. Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Gool L(2005) A comparison of affine region detectors. Int J Comput Vis 65(1):43–72

    25. Sandhaus P, Boll S (2011) Semantic analysis and retrieval in personal and social photo collec-tions. Multimed Tools Appl 51(1):5–33

    26. Tavanapong W, Zhou J (2004) Shot clustering techniques for story browsing. Trans Multimed6(4):517–527

    27. Wu K, Yang M (2007) Mean shift-based clustering. Pattern Recognition 40:3035–305228. Zhang L, Chen L, Li M, Zhang H (2003) Automated annotation of human faces in family albums.

    In: Proc of conf on multimedia (MM). ACM, Berkeley, CA, pp 355–358, 2–8 Nov 200329. Zhang L, Hu Y, Li M, Ma W, Zhang H (2004) Efficient propagation for face annotation in family

    albums. In: Proc of conf on multimedia (MM). ACM, New York, NY, pp 716–723, 10–16 Oct 2004

    Liliana Lo Presti received her Master Degree and PhD in computer engineering from Universityof Palermo, Italy, in 2006 and 2010 respectively. During 2008–2009, she was visiting researcher atthe Image and Video Computing Research Group at Boston University. In 2010–2011, she was post-doctoral researcher in the Computer Engineering Department at University of Palermo. Currently,she is post-doctoral researcher in the Computer Science Department at Boston University. Herresearch interests include computer vision and machine learning with application to distributedvideo-surveillance systems, multimedia and information filtering and retrieval.

    http://dx.doi.org/10.1007/s11042-011-0839-5

  • Multimed Tools Appl

    Marco La Cascia received his MSEE in electrical engineering and PhD from University of Palermo,Italy, in 1994 and 1998, respectively. During 1996–1999, he was a research associate in the Image andVideo Computing Research Group at Boston University. After that, he worked as a senior softwareengineer in the computer telephony group at Offnet S.p.A. Rome, Italy. He joined the Universityof Palermo as assistant professor at the end of 2000 and he is currently associate professor at thesame University. His research interests include low and mid-level computer vision, image, and videodatabases retrieval, and vision-based video surveillance. Dr. La Cascia has co-authored more that 50refereed journal and conference papers.

    Concurrent photo sequence organizationAbstractIntroductionRelated worksProblem definitionHierarchical sequence organizationTime segmentationContent-based organizationContent representationMean-shift clusteringContent-based clustering

    Sequence co-organizationExperimental resultsDatasetPerformance evaluation of the reference sequence organizationPerformance evaluation of the temporal segmentationPerformance evaluation of the hierarchical sequence organization

    Performance evaluation of the concurrent photo sequence organizationSimulationsUsing the longest sequence as referenceUsing the shortest sequence as referenceDiscussion and comparison to baseline method

    Conclusions and future worksReferences