Geographically-aware Cross-media Retrieval for Associating ... · travelogue texts (iii) a mechanism based on a probabilis-tic model created with tags that represent the photos, and

Geographically-aware Cross-media Retrieval forAssociating Photos to Travelogues

Rui [email protected]

Instituto Superior Técnico, INESC-IDAv. Professor Cavaco Silva

2744-016 Porto Salvo, Portugal

ABSTRACTThe automatic association of illustrative photos to para-graphs of text is a challenging cross-media retrieval problemwith many practical applications. In this paper we proposenovel geographic information retrieval methods to automat-ically associate photos to textual documents. The proposedmethods are based on the recognition and disambiguation oflocation names in the texts, using the corresponding geospa-tial coordinates to query Flickr for candidate photos. Thephotos are also clustered with basis on visual similarity, andwe use relevance estimators obtained from both individualimages and image clusters. The best photos are selectedwith basis on their popularity, on their proximity, on tempo-ral cohesion, and on the similarity between the photo’s tex-tual descriptions and the text of the documents. We testeddifferent learning to rank approaches to combine a rich setof features capturing photo relevance. A method that usesthe Coordinate Ascent algorithm to combine term-level tex-tual similarity, topical similarity, geographic proximity, andtemporal cohesion obtained the best results.

1. INTRODUCTIONThe automatic association of illustrative photos to para-graphs of text is a challenging cross-media retrieval prob-lem with many practical applications. For instance the Ze-manta1 blog enrichment extension is a commercial applica-tion capable of suggesting photos from Flickr to blog posts.Another example concerns with textual documents describ-ing travel experiences, usually called travelogues, which cangive interesting informations in the context of planning atrip. Today, there are several websites where these docu-ments are shared and the use of web information for plan-ning travels has also increased. However, the use of thetravelogues by themselves is very restrictive. It is our con-viction that the visualization of photos associated with spe-cific parts from the travelogue, like common scenarios andpoints of interest, may lead to a better usage of travelogues.

1http://www.zemanta.com/

Despite the huge number of high quality photos availablein websites like Flickr2, these photos are currently not be-ing properly explored in cross-media retrieval applications.In this paper, we propose novel geographic information re-trieval methods to automatically associate photos, publishedon Flickr, to textual documents. These methods are basedon mining geographic information from textual documents,using a free service to recognize and disambiguate locationnames and points of interest from the mentioned documents.The places recognized in the documents are then used toquery Flickr for related photos. The photos are clusteredtogether with basis on their visual similarity, and we userelevance estimators obtained from both the individual pho-tos and from the visual image clusters, measuring the as-sociation between the photos and the textual documents.Finally, the best photos are selected with basis on their pop-ularity and on the similarity between their information (e.g.,textual, geographical and temporal metadata, from individ-ual images and from image clusters) and the informationfrom the document (e.g., textual contents, recognized placesand temporal metadata). To combine the multiple relevanceestimates, we experimented with the usage of state-of-the-art learning to rank methods. Experimental results attestfor the adequacy of the proposed approaches, showing thata method that uses the Coordinate Ascent algorithm tocombine term-level textual similarity, topical similarity, geo-graphic proximity and temporal cohesion can lead to valuesof 0.73 and .0.83, respectively in terms of the Precision@1and Reciprocal Rank metrics.

The rest of this paper is organized as follows: Section 2presents the main concepts and related works. Section 3describes the proposed methods, detailing the mining of ge-ographic information contained in texts and the selection ofthe best photos, based on their popularity and similarity.Section 4 describes the validation methodology and the ob-tained results. Finally, Section 5 presents our conclusionsand points directions for future work.

2. RELATED WORKTypical problems related with the treatment of geographicreferences in textual documents have been widely studiedin the area of Geographic Information Retrieval [1, 16, 21,24]. Using this information requires the recognition of placenames in texts (i.e., delimiting the text tokens referencinglocations) and the disambiguation of those place names in

2http://www.flickr.com

order to know their real location in the surface of the Earth(i.e., give unique identifiers, typically geospatial coordinates,to the location names that were found). The main chal-lenges in both tasks are related with the ambiguity of nat-ural language. Amitay et al. characterized those ambiguityproblems according to two types, namely geo/non-geo andgeo/geo [1]. Geo/non-geo ambiguity occurs when locationnames have a non-geographic meaning (e.g., the word Turkeycan refer to either the country or the bird). Geo/geo ambi-guity refers to distinct locations with the same name (e.g.London, a city in England or in Ontario).

Leidner studied different approaches for the recognition anddisambiguation of geographic references in text [16]. Most ofthe studied methods resolve places references by matchingexpressions from the texts against dictionaries of locationnames, and use disambiguation heuristics like default senses(e.g., the most important referenced location is chosen, es-timated by the population size) or the spatial minimality(e.g., the correct disambiguation should minimize the poly-gon that covers all the geographic references contained inthe document). Recently, Martins et al. studied the usageof machine learning approaches in the recognition and dis-ambiguation of geographic references, using Hidden MarkovModels in the recognition task and SVM regression modelswith features corresponding to the heuristics surveyed byLeidner, in the disambiguation task [21]. Other recent worksfocused on place reference recognition and disambiguationproblems that are particularly complex, involving the pro-cessing of texts where geographic references are very am-biguous and with a low granularity (e.g., mountaineeringtexts mentioning tracks and specific regions in mountains)

”and where it is important to distinguish between the loca-tion names pertinent to route descriptions and those thatare pertinent to the description of panoramas [24].

Currently, there are also many commercial products for rec-ognizing and disambiguating place references in text. Anexample is the Yahoo! Placemaker3 web service, which wasused in this work and is better described in Section 3.1.

Previous works have also studied the usage of Flickr as a Ge-ographic Information Retrieval information source [6]. Theinformation stored in this service revealed itself to be veryuseful for many different applications, due to the direct linksbetween geospatial coordinates (i.e., the coordinates of theplaces where the photos were taken, either given by cameraswith GPS capabilities or manually set by the authors), dates(i.e., the moments when the photos were taken) and text de-scriptions that are semantically rich (i.e., descriptions andtags associated to photos).

In particular, Lu et al. addressed the automatic associationof photos, published on Flickr, to Chinese travelogues [20],with basis on a probabilistic topic model detailed on a pre-vious work [10], which was an extension of the ProbabilisticLatent Semantic Indexing (pLSA) method [11]. The mainidea in the work by Lu et al. is similar to the basis of ourwork, as the authors tested different methods for the selec-tion of photos, obtained by querying Flickr’s search enginewith the location names recognized in the texts. The prob-

3http://developer.yahoo.com/geo/placemaker/

abilistic topic model is used by the authors to avoid the gapbetween the vocabulary used in the documents and the tex-tual descriptions used in photos, modeling photos and/ordocuments as probabilistic distributions over topics, whichin turn can be seen as probabilistic distributions over words.The authors tested four different approaches for the selec-tion of relevant photos, namely (i) a baseline approach basedon a simple word-to-word matching with the words from thetravelogue texts and the tags that represent the photos (ii) amechanism based on a probabilistic model created with thetravelogue texts (iii) a mechanism based on a probabilis-tic model created with tags that represent the photos, and(iv) a mechanism based on a probabilistic model using thetexts and the tags, which obtained the best results. In ourwork, we approached the problem in a slightly different way,starting by querying Flickr with the geospatial informationassociated with places recognized in the documents.

In terms of previous works related to the area of cross-mediaretrieval, Deschacht and Moens presented an approach thattries to find the best picture of a person or an object, storedin a database of photos, using the captions associated toeach picture [7]. The authors built appearance models (i.e.,language models that represent the image textual captions),to capture persons or objects that are featured in an image.Two types of entity-based appearance models were tested,namely an appearance model based on the visualness (i.e.,the degree to which an entity is perceived visually), andanother appearance model based on the salience (i.e., theimportance of an entity in a text). As baseline approaches,the authors built two simpler appearance models, namely(i) a bag-of-words (BOW) model based on the words of theimage captions, and (ii) a bag-of-nouns (BON) model basedon the nouns and proper nouns contained in the image cap-tions. From a dataset composed of several image-captionpairs, the authors created two different sets of images anno-tated with the entities, namely (i) an easy dataset composedof images with one entity, and (ii) a difficult dataset com-posed of images with three or more entities. The resultsshowed that when the dataset was queried with only oneentity, the method using the appearance model based onthe visualness achieved the best results. On the other hand,when the query was composed of two entities, the methodusing the bag-of-words had better results.

Coelho and Ribeiro approached the task of finding suitableimages to illustrate text, from specific news stories to moregeneric blog entries, through a content-based multimedia re-trieval technique based on a three-stage process involving(i) textual search, (ii) score filtering and (iii) visual clus-tering [5]. The system was tested on the SAPO-Labs mediacollection, containing photos for known personalities, and onthe MIRFlickr-25000 collection, with photos and user tagscollected from Flickr. Visual content was described by theJoint Composite Descriptor and indexed by a Permutation-Preffix Index. The obtained MIRFlickr results correspondto precision values above 70%, while the SAPO-Labs person-ality search fared worse, with precision values below 40%.

Previous works in information retrieval, and also in geo-graphic information retrieval [22], have addressed the usageof supervised machine learning for developing search engineranking formulas, combining multiple estimators for docu-

ment relevance in an optimal way. Both Tie-Yan Liu [19]and Hang Li [17] presented good surveys on the subject oflearning to rank for information retrieval (L2R4IR), catego-rizing the existing algorithms into three groups, accordingto their input representation and optimization objectives:

• Pointwise approach - The L2R4IR task is seen aseither a regression or a classification problem. Givenfeature vectors of each single resource (e.g., a photo)from the data for the input space, the relevance de-gree of each of those individual resources is predictedwith scoring functions. Through these scores we cansort resources and produce the final ranked list. Sev-eral pointwise methods have been proposed, includingMulti-class Classification for Ranking (McRank) [18].

• Pairwise approach - The L2R4IR task is seen as a bi-nary classification problem for the pairs of resources tobe ranked, since the relevance degree can be regardedas a binary value telling which ordering is better for agiven pair of resources. Given feature vectors of pairsof resources from the data for the input space, the rel-evance degree of each resource can be predicted withscoring functions which try to minimize the averagenumber of misclassi?ed resource pairs. Several differ-ent pairwise methods have been proposed, includingSVMrank [14], RankNet [4] or RankBoost [9].

• Listwise approach - The L2R4IR task is addressed ina way that takes into account an entire set of resources,associated with a query, as instances. These methodstrain a ranking function through the minimization of alistwise loss function defined on the predicted list andthe ground truth list. Given feature vectors of a list ofresources of the data for the input space, the relevancedegree of each of those documents can be predictedwith scoring functions which try to directly optimizethe value of a particular information retrieval evalu-ation metric averaged over all queries in the trainingdata [19]. Several different listwise methods have alsobeen proposed, including SVMmap [28], AdaRank [26]or Coordinate Ascent [23].

In this paper, we made experiments with the application ofstate-of-the-art learning to rank algorithms available throughthe RankLib4 open-source Java package, namelly, Rank-Boost and Coordinate Ascent. We modeled the task of se-lecting relevant photos to associate to textual documents asa learning to rank problem were we represent the associa-tions between documents and photos as feature vectors con-taining multiple relevance estimators (e.g., the textual sim-ilarity between the document and the tags from the photo),and use training data to learn a ranking model capable ofsorting photos in relation to a given document in a way thatoptimizes the Reciprocal Rank evaluation metric.

3. ASSOCIATING PHOTOS TO TEXTSThe proposed method for the automatic association of pho-tos to textual documents is essentially based on a pipelineof four stages, which involves (i) recognizing and disam-biguating location names and points of interest referenced in

4http://www.cs.umass.edu/~vdang/ranklib.html

documents, (ii) collecting candidate photos through Flickr’sAPI5, (iii) clustering photos with basis on visual image simi-larity, and (iv) selecting the best photos with basing on theirimportance and on their similarity (e.g., textual, geograph-ical and temporal similarity, computed from the individualphoto and from the corresponding image cluster) towardsthe document. This section details the previous four steps.

3.1 Mining geographic data in documentsIn this work, we used the Yahoo! Placemaker web servicein order to extract locations and specific points of inter-est from texts. Placemaker can identify and disambiguateplaces mentioned in textual documents. The service takesas input a textual document with the information to be pro-cessed, and returns an XML document that lists the refer-enced locations. For each location found in the input doc-ument, the service returns also its position in the text, theexpression that was recognized as the location, the type oflocation (e.g., country, city, suburb, point of interest, etc.),an unique identifier in the locations database used by theservice (i.e., the Where On Earth Identifier - WOEID - usedby Yahoo! GeoPlanet6), and the coordinates of the centroidthat is associated to the location (i.e., the gravity centerof the minimum rectangle that covers its geographic area).Also, for each document taken as input, the service returnsthe bounding box corresponding to the document (i.e., theminimum rectangle that covers all its geographic locations).

3.2 Finding relevant photosThe main challenges in collecting and selecting photos rel-evant to a segment of text are related to the semantic gapbetween the photo metadata and the text, as well as withthe noise present in the documents and in the descriptionsof the photos. For instance, in the case of travelogues, anddespite the fact that these documents have a uniform struc-ture, their authors frequently mention information relatedto transportation and accommodation, and not only descrip-tions of the most interesting locations. For example, if thetext of a travelogue mentions an airport or the city where thetrip ends, while describing the arrival, one can select photosrelated to these locations, which are not important for illus-trating the interesting contents of the document. We havethat travelogues frequently mention locations that are onlyslightly relevant, and so it is very important to distinguishbetween relevant and irrelevant locations.

Other challenges in collecting and selecting relevant photosare related with the fact that photos published in Flickr arefrequently associated to tags, titles or textual descriptionsthat are irrelevant to their actual contents (e.g., titles andtags are usually identical among different photos uploadedby the same person, at the same time), and also the vocabu-lary used in Flickr can be very different from the vocabularyused in textual documents.

Having these limitations in mind, we tested different ap-proaches for the selection of relevant photos, attempting tocombine multiple estimators of relevance in order to improveresults. These approaches are as follows:

5http://www.flickr.com/services/api/6http://developer.yahoo.com/geo/geoplanet/

T1: Selection based on a baseline textual similar-ity metric: We compute the textual similarity be-tween the tags plus the title of the photos, and thetext of the document. Specifically, we compute thecosine measure between the textual descriptions of thephotos (i.e., joining tags and title) and the textual doc-ument, using the Term Frequency × Inverse DocumentFrequency (TF-IDF) method to weight terms in thefeature vectors. The idea behind this method is that,if a photo has textual descriptions more similar to thetext of a document, in terms of having many commonwords, then it can be considered as a good photo to beassociated to the document.

T2: Selection based on rich textual similarity fea-tures: We compute the textual similarity between thetags plus the title of the photos, and the text of thedocument, using a rich set of textual similarity fea-tures that goes beyond the baseline TF-IDF metricfrom method T1. We represent the textual contentsof the documents and the textual metadata for thephotos as either term vectors or as probability dis-tributions over a set of topics, afterwards measuringsimilarity with basis on these representations. In thecase of the topic-based representations, they are ob-tained through the application of the Latent DirichletAllocation (LDA) topic modeling technique [2] as im-plemented in the LingPipe package 7, specifically byfitting a model with 25 topics over a representative col-lection of Flickr data that contains tags, titles and de-scriptions, namely the large MIRFlickr collection [13].We build the LDA model in an offline pre-processingstep, afterwards computing Bayesian point estimatesfor the topic distribution of documents and photo de-scriptions through Gibbs sampling.

The complete set of textual features is as follows:

– Two features capturing (i) the number of terms inthe textual descriptions of the photos (i.e., joiningtags and title), and (ii) the number of terms in thetextual document.

– The number of terms shared between the descrip-tions of the photos (i.e., joining tags and title) andin the textual document, normalized according tothe number of terms in the textual descriptionsof the photo (i.e., a term frequency feature).

– The cosine similarity measure between the termvectors corresponding to the descriptions of thephotos and the textual document, using the TermFrequency × Inverse Document Frequency (TF-IDF) method to weight individual terms.

– The cosine measure between the topic vectors ob-tained through LDA from the descriptions of thephotos and from the textual document. Eachof the 25 dimensions on these vectors contains ascore representing the probability of the photo ordocument belonging to that particular topic.

– A symmetrized form of the Kullback-Leibler di-vergence computed between the probability dis-tributions for the topics present in the descrip-tions of the photos and in the textual document.

7http://alias-i.com/lingpipe/

– A binary feature indicating if the most probabletopic present in the description of the photo is thesame as the most probable topic obtained for thetextual document.

By representing photos and documents in terms of theunderlying topics, instead of just tags and words, wehope to avoid the semantic gap between the vocabularyused in the textual documents and the vocabulary usedto tag photos, similarly to Lu et al. [20].

T3: Selection based on rich textual similarity fea-tures and sentimental polarity match: We com-bined the rich textual similarity from T2 with a fea-tures based on the sentimental mood associated toboth the documents and the photos. A simple opin-ion mining method, based on counting the number ofwords from the document appearing in a subjectiv-ity lexicon8, was used to estimate the polarity of thetextual document, assigning it to a value ranging fromminus one (i.e., negative sentimental mood) to plus one(i.e., positive mood). A similar value is computed fromthe visual contents of the candidate image, assigningit to a score ranging from minus one (i.e., cooler im-ages, were the blue component dominates over the redcomponent) to plus one (i.e., warmer image, where thered component dominates over the blue component).Under the assumption that cooler images should beassigned to documents having a negative sentimen-tal polatity, and warmer images should be assignedto documents having a positive sentimental polarity,we compute the absolute difference between the valuescomputed from the document d and the image i. Theactual formula is shown bellow.

match(d, i) = ‖ |words(d,+)|−|words(d,−)||words(d)| −

∑p∈i pR−pB∑

p∈i 255‖

In the formula, the function words() returns the num-ber of words from the textual document matching thepolarity lexicon, and p ∈ i denotes the RGB represen-tation of a pixel from image i, from where we take theR and B components.

T4: Selection based on rich textual similarity fea-tures, sentimental polarity and geographic prox-imity: We combined the textual and polarity simi-larity features from T3 with the similarity, based onthe geospatial coordinates, between the locations rec-ognized in the document and the locations where pho-tos were taken. The geographical similarity is com-puted according to the formula 1

(1+d), where d is the

geodesic distance between the two locations. Becausemultiple locations can be recognized in the document,we computed the maximum and the average similaritytowards each photo. The idea behind this method isthat a photo that was taken near a location recognizedin the document can be considered as a good photo tobe associated to the document.

T5: Selection based on rich textual similarity, sen-timental polarity, geographical proximity andtemporal cohesion: We combine method T4 withthe temporal distance, in terms of semesters, betweenthe publication date of the document and the moment

8http://sentiwordnet.isti.cnr.it/

when a photo was taken. Similarly to what is donein method T3, we use a temporal similarity metriccomputed according to the formula 1

(1+t), where t id

the number of semesters separating the photo fromthe document. The idea behind this method is thata photo taken in a moment close to the date whenthe document was written can be considered as a goodphoto to be associated to the document.

T6: Selection based on rich textual similarity, senti-mental polarity, geographic proximity, tempo-ral cohesion and photo interestingness: We com-bine method T5 with other information related to theinterestingness of the photos (e.g., the number of com-ments and the number of times other users consideredthe photo as a favorite). In this case, if a photo wastaken in a location inside the bounding box of the doc-ument (i.e., the bounding box that contains all loca-tions), then the number of comments and the numberof times a photo was marked as favorite are consideredas features, otherwise these features assume the valueof minus one. The idea behind this method is thata photo that was taken near the locations recognizedin the document, and that is considered an interestingphoto due to the high number of comments and thehigh number of times users marked it as a favorite,can be considered a good candidate photo to be asso-ciated to the document.

T7: Selection combining individual photo similar-ity, polarity and interestingness, together withfeatures computed from visual image clusters :We group together images with similar visual charac-teristics (i.e., with similar colors in the same regions)and compute the similarity between these groups ofphotos and the textual document, combining thesesimilarity scores with the features from method T6.Specifically, we use the K-means clustering algorithmtogether with a simple visual similarity metric9 basedon local color features to group the photos taken ina location inside the bounding box of the document(i.e., the bounding box that contains all locations) intoK clusters. The algorithm for computing the visualsimilarity starts by normalizing image dimensions to300x300 pixels, afterwards computing 25 ∗ 3 featurescorresponding to the average of the RGB values on25 regions with 30x30 pixels each, uniformly sampledfrom the image. To compute the similarity betweentwo images A and B, we take the 25 regions from eachimage, compute the Euclidean distance between the re-gions, and accumulate the values, afterwards normal-izing the aggregated distance d to a similarity scoreaccording to the formula 1

(1+d). For photos coming

from a location outside the bounding box of the doc-ument, these features assume the value of minus one.The number of clusters K is determined through themethod described in [25], which involves choosing avalue that maximizes inter-cluster similarity and min-imizes intra-cluster similarity. The photo is assignedto the most similar cluster and the following set offeatures is then computed:

9http://www.lac.inpe.br/JIPCookbook/6050-howto-compareimages.jsp

– The total number of terms in the set of all tex-tual descriptions from the photos belonging to themost similar cluster.

– The number of terms shared between the descrip-tions of all the photos belonging to the most simi-lar cluster, and the textual document, normalizedaccording to the number of terms in the full setof textual descriptions for the photos.

– The cosine similarity measure between the termvectors corresponding to the descriptions of allthe photos belonging to the most similar cluster,and the textual document, using the Term Fre-quency × Inverse Document Frequency (TF-IDF)method to weight individual terms.

– The average similarity, based on the geospatialcoordinates, between the locations recognized inthe document and the centroid location for themost similar cluster towards the photo. The geo-graphical similarity is computed according to theformula 1

(1+d), where d is the average geodesic

distance between the two locations.

– The maximum similarity, based on the geospatialcoordinates, between the locations recognized inthe document and the closest location associatedto the one of the photos from the most similarcluster. Similarity here is also computed accord-ing to the formula 1

(1+d), where d is the minimum

geodesic distance between the two locations.

– The average temporal distance, in semesters, be-tween the publication date of the document andthe moments when the photos from the most sim-ilar cluster were taken. We use the average of thetemporal similarities computed according to theformula 1

(1+t), where t id the number of semesters

separating photos from the document.

– The maximum temporal similarity, computed interms of semesters as in the case of the previousfeature, between the publication date of the doc-ument and the moments when the photos fromthe most similar cluster were taken.

The above features computed from the most similarcluster are used together with those from method T6,with the intuition that visual clusters can be usedto propagate relevance information across similar im-ages, thus avoiding the data sparseness issues com-monly associated to Flickr tags. If we can group to-gether images depicting the same monuments and thesame points of interest, then we can propagate tagsand other metadata information between these visualclusters of images, hopefully improving the results.

The above combination approaches were based on the us-age of learning to rank algorithms to combine the multiplefeatures. The following subsection details the consideredlearning to rank algorithms.

One should also refer that when processing the textual de-scriptions of the photos and the text of the documents, stop-words were first removed, followed by stemming process and

tags were considered to be more important than titles to de-scribe the photos. Thus, we applied different weights for thedifferent types of textual descriptions, weighting the tags astwice more important (i.e., representing the photos throughvirtual text documents where the text from the tags is re-peated twice, and the text from the title appears only once).

3.3 Learning to rank photosAs previously stated, we experimented with the applica-tion of state-of-the-art learning to rank algorithms availablethrough the RankLib open-source Java package, namely:

• RankBoost - This is a pairwise boosting techniquefor ranking [9]. Training proceeds in rounds, start-ing with all the pairs of photos being assigned to anequal weight. At each round, the learner selects theweak ranker that achieves the smallest pairwise loss onthe training data, with respect to the current weightdistribution. Pairs that are correctly ranked have theirweight decreased and those that are incorrectly rankedhave their weight increased, so that the learner willfocus more on the hard samples in the next round.The final model is essentially a linear combination ofweak rankers. Weak rankers can theoretically be ofany type, but they are most commonly chosen as abinary function with a single feature and a threshold.This function assigns a score of 1 to a document ofwhich feature value exceeds the threshold, and 0 oth-erwise. In our experiments, the number of thresholdcandidates was set to 50 and the number of rounds totrain was set to 1000.

• Coordinate Ascent - This is a state-of-the-art list-wisemethod, originally proposed by Metzler and Croft, whichuses coordinate ascent to optimize the ranking modelparameters [23]. Coordinate ascent is a well-knowntechnique for unconstrained optimization, which opti-mizes multivariate objective functions by sequentiallydoing optimization in one dimension at a time. Themethod cycles through each parameter and optimizesover it, while fixing all the other parameters.

We modeled the task of selecting relevant photos to associateto textual documents as a learning to rank problem were werepresent the associations between documents and photosas feature vectors containing multiple relevance estimators(e.g., the textual similarity between the paragraph and thetags plus the title from the photo), and used training data tolearn a ranking model capable of sorting photos in relationto a given document in a way that optimizes the reciprocalrank (RR) evaluation metric.

Previous authors have argued that some evaluation metricsare more informative than others, and thus the target met-rics used in the optimization of learning to rank models mayact as bottlenecks that summarize the training data. Yilmazand Robertson challenged the assumption that retrieval sys-tems should be designed to directly optimize for a metricthat is assumed to evaluate user satisfaction, showing thateven if user satisfaction can be measured by a metric X,optimizing the model for a more informative metric Y mayresult in a better test performance according to X [27]. This

is particularly true in cases when the number of features islarge and when we only have access to a limited amountof training data. Experiments confirmed this hypothesis,although we do not report those experiments here for con-strains in terms of space.

4. VALIDATION EXPERIMENTSIn order to validate the proposed methods, we created a col-lection with 900 photos downloaded from Flickr, containinggeographical information and a sufficiently large textual de-scription (i.e., more then 100 words and containing locationnames or points of interest). We used expressions frequentlyused in travelogues, such as monument or vacations to filterthe photos collected from Flickr. The collected photos weretaken in a point contained in the bounding box correspond-ing to the geospatial footprint of one of the world’s 18 mostvisited cities10 (i.e., a set of 50 photos was collected fromeach of the 18 cities, all containing large textual descriptionsand associated to geospatial coordinates). Also, the consid-ered photos were taken in a date ranging from 2000-01-01 to2010-06-01. For each photo, the number of comments, thenumber of times it was considered as favorite by other usersand the number number of times other users accessed thephoto were also collected.

In order to conduct the experiments, we needed a collec-tion of documents with relevance judgments for photos, i.e.,a correct photo associated to each document. The photodescriptions from Flickr, with the above characteristics, arefairly good examples of documents with relevance judgments,because the owner considered the photo as a relevant phototo be associated to the large textual description. So, forthe purpose of our experiments, we considered the textualdescriptions as representations of textual documents havingthe same characteristics as travelogues, and the photos fromwhich the textual descriptions were taken as the relevantphotos that should be automatically associated with basison their titles, tags, geospatial coordinates and popularityinformation derived from comments and visits.

A prototype photo association system, implementing differ-ent configurations for the feature vectors and also using oneof the considered learning to rank algorithms, was then usedto to process the documents, associating them to relevantphotos. The configurations for the feature vectors are de-scribed in Section 3.2, and Section 3.3 detailed the consid-ered learning to rank algorithms.

The 50 textual descriptions associated to each of the 18 citieswere randomly split into two groups, and we used 25 × 18textual descriptions for building a training dataset for thelearning to rank algorithms, and the other 25 × 18 textualdescriptions for building a test dataset. In the case of thetraining dataset, and in an attempt to build a balanced col-lection with both relevant and non-relevant photos, we asso-ciated each textual description to a maximum of ten photos,namely the relevant photo to be associated to the text, anon-relevant photo containing the highest textual similarityaccording to the TF-IDF metric, a non-relevant photo con-taining the highest textual similarity according to the topicdistributions metric, a non-relevant photo taken in a place

10http://en.wikipedia.org/wiki/Tourism

T1 RB2 CA2 RB3 CA3 RB4 CA4 RB5 CA5 RB6 CA6 RB7 CA7

Paris

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


London

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


New York

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Barcelona

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Madrid

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Amsterdam

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Shanghai

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Bangkok

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Istanbul

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


HongKong

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Singapore

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Kuala Lumpur

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Moscow

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Prague

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Rome

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Beijing

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Vienna

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1


Seoul

0.0

0.2

0.4

0.6

0.8

1.0Mean RRMean P@1

Figure 1: Reciprocal Rank and Precision@1 for each method and city.

closest to the mentioned locations, a non-relevant phototaken in the same semester, a non-relevant photo contain-ing the highest sentimental polarity, a non-relevant photocontaining the highest cluster textual similarity accordingto the TF-IDF metric, a non-relevant photo whose imagecluster geographical location is closest to the mentioned lo-

cations, a non-relevant photo whose image cluster date istaken in the same semester, a non-relevant photo whose im-age cluster has the highest sentimental polarity, and finallya photo selected randomly.

With the results for each textual description, obtained by

Number of words

Num

ber

of D

ocum

ents

0 200 400 600 800 1000

010

020

030

040

050

0

Number of places

Num

ber

of D

ocum

ents

0 5 10 15 20

010

020

030

040

050

0

Figure 2: Histograms with the number of words andthe number of places for each document.

considering all seven possible feature vector configurationsand the two different learning to rank algorithms, we usedthe trec eval tool to evaluate the matchings between photosand documents. Figure 1 presents the results obtained interms of Precision at cut-off position 1 (Precision@1) andin terms of the Mean Reciprocal Rank, separately for eachof the considered cities. The horizontal lines represent themean value of Reciprocal Rank, in red, and the mean valueof Precision@1, in blue, for all the cities and with the bestconfiguration. In all the charts, the bar in red that is fullycolored represents the value of Mean Reciprocal Rank, whilethe bar in blue with a shaded color, represents the value ofPrecision@1.

The graphics show that method T5 using either the Coordi-nate Ascent or the RankBoost algorithm, outperforms theother methods in all the cities. These results suggest thatthe usage of multiple features (e.g., geographical proximityand temporal cohesion) combined with the textual similarityis better then the usage of the textual similarity alone.

Table 1 summarizes the obtained results averaged across allthe considered cities, showing a higher performance of the

100 200 300 400 500 600 700 800 900 1000

Number of Words

Pre

cisi

on

@1

/Re

cip

roca

l Ra

nk

0.0

0.2

0.4

0.6

0.8

1.0

Reciprocal RankPrecision@1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18

Number of Places

Pre

cisi

on

@1

/Re

cip

roca

l Ra

nk

0.0

0.2

0.4

0.6

0.8

1.0

Reciprocal RankPrecision@1

Figure 3: Variations in Precision@1 and Recipro-cal Rank in terms of the number of words and thenumber of places mentioned in the documents.

CA algorithm in method T5, either in terms of ReciprocalRank or Precision@1.

Figure 4 illustrates the obtained results for two example tex-tual descriptions, presenting the top-3 most relevant photosas returned by the best performing method (i.e., T5 with theCA algorithm), together with their tags from Flickr. Noticethat for an text referencing France, the top-3 most relevantphotos returned are indeed related to France.

Figure 2 presents the number of documents, in the entirecollection of test and training documents, for each possiblenumber of words, and the number of documents mentioningdifferent numbers of places. In the collection, there is ahigher number of documents with 100 to 200 words. Also,the number of recognized places is frequently low, with mostof the documents containing 1 to 5 places.

Figure 3 illustrates the relationships existing between thevalues of Precision@1 and Reciprocal Rank, with the num-ber of words and the number of places, when considering thecombination method that had the best results. These resultssuggest that either a higher number of words or a higher

The Louvre Pyramid is a large glass and metal pyramid,surrounded by three smaller pyramids, in the main courtyardof the Louvre Palace in Paris. The large pyramid servesas the main entrance to the Louvre Museum. Completedin 1989,] it has become a landmark for the city of Paris.The construction of the pyramid triggered considerablecontroversy because many people felt that the futuristicedifice looked quite out of place in front of the Louvre

Museum with its classical architecture.id=3265571095 id=2226067358 id=2728788074

pyramid paris parismuseum france francelouvre eiffel pompidouparis europe centrefrance toureiffel explore

reflections eiffeltower centrepompidouvacation city longexposureeurope reflections nikonpalace reflect frontpage

holidays sky parijs

Figure 4: The top three most relevant photos returned in a example document.

number of places do not improve the results, neither in termsof Precision@1 or Reciprocal Rank. The higher value ofReciprocal Rank and Precision@1 in documents with 900and 1000 words can be explained with the small number ofdocuments with these number of words. Also, some of thehigher value of Reciprocal Rank and Precision@1 in docu-ments with more than 10 places can be explained with thesmall number of documents with these number of places.

CombinationAlgorithm

P@1 RRMethod Avg StDev Avg StDev

T1 0.44 0.07 0.54 0.06T2 RB 0.44 0.14 0.59 0.12

CA 0.50 0.12 0.62 0.09T3 RB 0.44 0.14 0.59 0.12

CA 0.50 0.12 0.64 0.10T4 RB 0.50 0.17 0.66 0.13

CA 0.58 0.14 0.70 0.10T5 RB 0.80 0.17 0.86 0.11

CA 0.76 0.11 0.87 0.08T6 RB 0.72 0.18 0.83 0.12

CA 0.72 0.12 0.83 0.09T7 RB 0.70 0.16 0.83 0.11

CA 0.68 0.17 0.80 0.13

Table 1: The obtained results.

5. CONCLUSIONS AND FUTURE WORKIn this paper, we described methods for the automatic as-sociation of photos to textual documents. The describedmethods are based on a pipeline of four steps, in which geo-graphic references are first extracted from documents, thenphotos matching the geographic references are collected, us-ing Flickr’s API, afterwards photos are clustered with basison visual similarity, and finally the best photos are selectedwith basis on their similarity and interestingness. Differ-ent methods to select relevant photos were compared and amethod based on the combination of term-level textual simi-larity, topical similarity, geographic proximity and temporalcohesion, using the Coordinate Ascent learning to rank ap-proach for performing the combination, obtained the best

results.

Despite the good results from our initial experiments, thereare also many challenges to future work. From our pointof view, the major challenge is to improve the evaluationmechanism. The evaluation should be done with a collec-tion of static photos, with relevance judgments clearly es-tablished by humans. The Content-based Photo Image Re-trieval (CoPhIR) collection, described in [3] and built from106 million photos from Flickr, as well as the MIRFlickrcollections for visual concept detection [12, 13], could beimportant baselines for the development of a more adequatetest collection for our work.

A particularly interesting idea for future work concerns withusing spatial outlier detection approaches, or even place ref-erence classification methods such as those proposed by Pi-otrowski at al. for mountaineering accounts [24], in order tofilter from the textual documents place references that areonly marginally relevant. Besides filtering place references,it would also be interesting to experiment with other features(e.g., features derived from image classifiers trained for de-tecting specific objects or concepts) or refined versions of thefeatures that were already considered (e.g., better methodsfor estimating the sentimental polarity, or better methodsfor measuring image similarity when clustering images).

Finally, we also believe that it would be interesting to ex-periment with unsupervised learning to rank methods forcombining the different relevance estimators. Recent worksin the area of information retrieval have described several ad-vanced unsupervised learning to rank methods, capable ofoutperforming the popular CombSUM and CombMNZ ap-proaches [8]. In IR, rank aggregation is currently a very hottopic of research and, for future work, we would for instancelike to experiment with the ULARA algorithm, which wasrecently proposed by Klementiev et al. [15].

References[1] E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web-

a-where: geotagging web content. In Proceedings ofthe 27th Annual international ACM SIGIR Conference

on Research and Development in information Retrieval,2004.

[2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet alloca-tion. Journal of Machine Learning Research, 3, 2003.

[3] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese,R. Perego, T. Piccioli, and F. Rabitti. Cophir : A testcollection for content-based image retrieval. Technicalreport, Institute of Information Science and Tecnolo-gies, National Reasearch, Pisa, Italy, 2009.

[4] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,N. Hamilton, and G. Hullender. Learning to rank usinggradient descent. In Proceedings of the 22nd interna-tional conference on Machine learning, 2005.

[5] F. Coelho and C. Ribeiro. Automatic illustration withcross-media retrieval in large-scale collections. In Pro-ceedings of the 9th International Workshop on Content-Based Multimedia Indexing, 2011.

[6] D. J. Crandall, L. Backstrom, D. Huttenlocher, andJ. Kleinberg. Mapping the world’s photos. In Pro-ceedings of the 18th international conference on WorldWide Web, 2009.

[7] K. Deschacht and M. Moens. Finding the best pic-ture: Cross-media retrieval of content. In Proceedingsof the 30th European Conference on Information Re-trieval, 2008.

[8] E. A. Fox and J. A. Shaw. Combination of multiplesearches. In Proceedings of the 2nd Text Retrieval Con-ference, 1994.

[9] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.Journal of Machine Learning Research, 4, December2003.

[10] Q. Hao, R. Cai, X. Wang, J. Yang, Y. Pang, andL. Zhang. Generating location overviews with imagesand tags by mining user-generated travelogues. In Pro-ceedings of the 17th ACM international Conference onMultimedia, 2009.

[11] T. Hofmann. Probabilistic latent semantic indexing.In Proceedings of the 22nd Annual International ACMSIGIR conference on Research and development in in-formation retrieval, 1999.

[12] M. J. Huiskes and M. S. Lew. The mir flickr retrievalevaluation. In Proceeding of the 1st ACM internationalconference on Multimedia information retrieval, 2008.

[13] M. J. Huiskes, B. Thomee, and M. S. Lew. New trendsand ideas in visual concept detection: the mir flickrretrieval evaluation initiative. In Proceedings of the in-ternational conference on Multimedia information re-trieval, 2010.

[14] T. Joachims. Optimizing search engines using click-through data. In Proceedings of the 9th ACM Confer-ence on Knowledge Discovery and Data Mining, 2002.

[15] A. Klementiev, D. Roth, K. Small, and I. Titov. Un-supervised rank aggregation with domain-specific ex-pertise. In Proceedings of the 21st International JointConference on Artifical intelligence, 2009.

[16] J. Leidner. Toponym Resolution in Text: Annotation,Evaluation and Applications of Spatial Grounding ofPlace Names. PhD thesis, Institute for Communicat-ing and Collaborative Systems, School of Informatics,University of Edinburgh, 2007.

[17] H. Li. Learning to Rank for Information Retrieval andNatural Language Processing. Morgan & Claypool Pub-lishers, 2011.

[18] P. Li, C. Burges, and Q. Wu. Learning to rank usingmultiple classification and gradient boosting. In Pro-ceedings of the 21st Conference on Advances in NeuralInformation Processing Systems, 2008.

[19] T. Liu. Learning to rank for information retrieval.Foundations and Trends in Information Retrieval, 3(3),2009.

[20] X. Lu, Y. Pang, Q. Hao, and L. Zhang. Visualizing tex-tual travelogue with location-relevant images. In Pro-ceedings of the 2009 international Workshop on Loca-tion Based Social Networks, 2009.

[21] B. Martins, I. Anastacio, and P. Calado. A machinelearning approach for resolving place references in text.In Proceedings of the 13th AGILE International Con-ference on Geographic Information Science, 2010.

[22] B. Martins and P. Calado. Learning to rank for geo-graphic information retrieval. In Proceedings of the 6thACM Workshop on Geographic Information Retrieval,2010.

[23] D. Metzler and W. B. Croft. Linear feature-based mod-els for information retrieval. Information Retrieval, 10,2007.

[24] M. Piotrowski, S. Liubli, and M. Volk. Towards map-ping of alpine route descriptions. In Proceedings ofthe 6th ACM Workshop on Geographic information Re-trieval, 2010.

[25] S. Ray and R. H. Turi. Determination of number ofclusters in k-means clustering and application in colourimage segmentation. In Proceedings of the 4th Interna-tional Conference on Advances in Pattern Recognitionand Digital Techniques, 1999.

[26] J. Xu and H. Li. Adarank: a boosting algorithm forinformation retrieval. In Proceedings of the 30th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, 2007.

[27] E. Yilmaz and S. Robertson. On the choice of effective-ness measures for learning to rank. Information Re-trieval, 13, 2010.

[28] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. Asupport vector method for optimizing average precision.In Proceedings of the 30th ACM SIGIR internationalConference on Research and Development in Informa-tion Retrieval, 2007.

Documents

Geographically-aware Cross-media Retrieval for Associating ... · travelogue texts (iii) a mechanism based on a probabilis-tic model created with tags that represent the photos, and