Learning to Tag - EPrints · 2009-04-06 · WWW 2009, April 20–24, 2009, Madrid, Spain. ACM 978-1-60558-487-4/09/04. before [6]. Along with the proliferation of images on the World-Wide-Web,

Learning to Tag

Lei Wu∗

MOE-MS KeyLab of MCCUniversity of Science and

Technology of [email protected]

Linjun YangMicrosoft Research Asia49 Zhichun Road,Beijing

100190, [email protected]

Nenghai YuMOE-MS KeyLab of MCCUniversity of Science and

Technology of [email protected]

Xian-Sheng HuaMicrosoft Research Asia49 Zhichun Road,Beijing

100190, [email protected]

ABSTRACTSocial tagging provides valuable and crucial information forlarge-scale web image retrieval. It is ontology-free and easyto obtain; however, irrelevant tags frequently appear, andusers typically will not tag all semantic objects in the im-age, which is also called semantic loss. To avoid noises andcompensate for the semantic loss, tag recommendation isproposed in literature. However, current recommendationsimply ranks the related tags based on the single modality oftag co-occurrence on the whole dataset, which ignores othermodalities, such as visual correlation. This paper proposesa multi-modality recommendation based on both tag andvisual correlation, and formulates the tag recommendationas a learning problem. Each modality is used to generate aranking feature, and Rankboost algorithm is applied to learnan optimal combination of these ranking features from dif-ferent modalities. Experiments on Flickr data demonstratethe effectiveness of this learning-based multi-modality rec-ommendation strategy.

Categories and Subject DescriptorsH.3.1 [Information Storage and Retrieval]: ContentAnalysis and Indexing-indexing methods; H.2.8 [DatabaseApplications]: Image databases

General TermsAlgorithms, Theory, Experimentation

KeywordsTag recommendation; Learning to tag; multi-modality Rank-boost; social tagging

1. INTRODUCTIONWith the advance of Web2.0 technology, multimedia con-

tent creation and distribution are much easier than ever

∗This work was performed when Lei Wu was visiting Mi-crosoft Research Asia as a research intern.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2009, April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.

before [6]. Along with the proliferation of images on theWorld-Wide-Web, effective image search approaches to ob-tain targeted images have gradually become an urgent de-mand. Currently, the performance of Web image searchmainly depends on the quality of the image annotationsor keywords (tags). Some methods automatically generatemetadata by analyzing the image content, or the surround-ing text on the webpages; while others generate these textualmetadata by manual tagging. Most recently, social tagginghas become a popular means to annotate Web images.

Although the automatic creation of metadata costs lit-tle human effort, the result of these statistical model basedautomatic methods are generally unsatisfying [14][1]. Espe-cially on web images, which are quite noisy. To improve theperformance of the automatic annotation, some approachescombine both image content analysis and the surroundingtext on the image’s webpages, e.g., [11][20][16][19]. Thesemethods obtain some improvements over the purely contentbased methods, but they are still unacceptable for practicaluse.

The manual metadata generation is relatively more ac-curate and practical than the automatic annotation. Themanual metadata generation is mainly based on the ideaof ontology based labeling, which firstly defines an ontol-ogy and then let users label the web resources using the se-mantic markups in the ontology. There are also some workto mitigate the manually labeling work by semi-automaticannotation [5]. Although these ontology based annotationwork is successful in some applications, e.g. bioinformaticsand knowledge management, there are several limitations.Firstly, to build a semantic ontology that covers sufficientdescriptions for multimedia content itself is expensive, timeconsuming and often requires domain knowledge [15]. Sec-ondly, ontology based annotation usually requires users fa-miliar with the ontology, which is too complicated for anyonewithout specialized training and knowledge.

Recently, a promising approach for manual metadata gen-eration is social tagging, which requires all the users in thesocial network label the web resources with their own key-words and share with others. This labeling operation isnamed “tagging”. Different from ontology based annotation;there is no pre-defined ontology or taxonomy in social tag-ging. Thus this task is more convenient for ordinary users.Social tagging has currently attracted huge amount of web

WWW 2009 MADRID! Track: Rich Media / Session: Tagging and Clustering

361

Figure 1: Tag distribution over a collection of 640million images from Flickr.com. There are totally1,300 million tags. Around 1% of the tags appearingmore than 20,000 times, which contain little infor-mation. Around 5.82% of the tags have appearedmore than 5,000 in the collection, which are consid-ered as popular tags. 33.21% of the tags appearsmore than 50 and less than 5,000 times, which aredefined as specific tags. 60% of the tags have ap-peared less than 50 times

users and effectively helps the organization of web resources.This strategy is adopted by some famous websites (e.g. De-licious, Flickr). This organic system of organization is alsocalled “folksonomy”.

Although social tagging is easy to perform, there are alsosome drawbacks. Firstly, it suffers polysemy and synonymsproblem. As the users can use their own words to tag theimages, different users may tag similar images with differentwords. So when querying “sea”, one may not find imagestagged “ocean” which represents the same concept. On theother hand, it is difficult for the users to input all the tagsof the equivalent meaning. For this reason, lots of imagesmay not be effectively retrieved. Secondly, ambiguity is alsoa problem. Users may use a general tag to represent differ-ent things. For example, when an image is tagged “apple”,maybe it refers to the fruit “apple”, or it could refer to thecorporation or the product. In general, it is also quite diffi-cult for the web users to realize the existence of ambiguitywhen tagging if they did not think of or even know the othermeanings of the query. With these ambiguous tags, lots ofirrelevant images may be retrieved.

To tackle the above problems, some researchers proposedthe query expansion and suggestion [9][23], which extendthe query to some related words to make the intention moreclear. However, it does not completely eliminate the syn-onymy and tag ambiguity problems. The information in thequery is limited, and the query expansion frequently can-not compensate the semantic loss in the tagging process,when users may ignore some semantic objects in the im-ages. Recently, Xirong et al. [10] proposed the neighborvoting algorithm for image retrieval, which tried to predictthe relevance of the user contributed tags. However thesimilarity between individual images is itself an open andcomplex problem. In this paper, we propose to tackle thesemantic loss problem during the tagging process by com-bining both visual correlation in concept level and tag co-occurrence information. The semantically or visually relatedtags are recommended to the users to improve the taggingquality. The recommendation system will remind the users

of the alternative tags and it can also help clarify the truesemantic of the images. For example, when the user tags animage with word “sea”, the recommendation system will listmore rich and precise tags based on the input tags, such as“ocean”, “water”, “wave”, etc. These recommendations willhelp users clarify the image content as well as remindingthem of related semantics which may otherwise be ignored.

The quality of tag recommendation is quite importantto social tagging and the consequent performance of im-age search. Firstly, high quality tag recommendation willmotivate users to contribute more useful tags to an image[13]. The average number of tags for each image on Flickris relatively small [2]. One of the reasons for that the usersdid not make large amount of tags is that they generallycannot think of too many words [17] in a short moment andfew people would like to spend much time thinking aboutthe alternative tags or more precise tags. With the help ofhigh quality tag recommendation, users can provide a lotof useful tags in a short time. Also the spelling errors canbe effectively avoided. Thus the average number of correcttags for each image is expected to increase. Secondly, tagrecommendation will remind the users of more rich and spe-cific tags. The distribution of tags on Flickr follows a powerlaw distribution (1). Most of the users only use the popularkeywords, which are only 5.82% of the whole tag collection.These tags are popular because they are common vocabu-lary and easily come to mind. Another 33.21% of the tagswhich appear 50-5,000 times are also informative while gen-erally ignored by most users, because these words are moreprofessional terms or only used for specific object or situa-tions. The tag recommendation will help remind the user touse both popular and specific tags for social tagging. Thisreminder also helps create more precise tags. Thirdly, tagrecommendation can depress the noise in social tagging sys-tem. It shows in the tag distribution that there are around60% of tags in the tag corpus are misspelling or meaninglesswords. With the help of tag recommendation, users can tagan image by choosing rather than typing, which effectivelyavoids these spelling errors.

Existing tag recommendation approaches are performedby ranking the related tags based on the tag co-occurrenceinformation. Much information is ignored in these meth-ods, such as the visual correlation between tags, and theimage content. A better choice is to use correlation frommulti-modalities, such as tag co-occurrence, correlation be-tween tag related images, the content of the target image,etc. However, it is not easy to combine these multi-modalitycorrelations, since these modalities should be weighted dif-ferently for different samples. The basic idea of this paperis to learn an optimal combination of the multi-modalitycorrelations to generate a ranking function for tag recom-mendation. Given the image and one or more initial tags,the algorithm will rank and sort the rest of the tags basedon the tag correlation from each modality. Each is taken asa weak ranker. Then Rankboost[7] is adopted to combineweak rankers and form a better ranking function. Userscan click the tags on the ranking list to annotate the im-age. After each click, the algorithm will update the rankingfunction as well as the tag recommendation function. Sincethe recommendation is based on the multi-modality corre-lations and is dependent on the ever-increasing tags in thedatabase, it seems the users are using an selected ontologyfor tagging. The proposed method actually regularizes the


362

Tag

Selection

Tag Invert

Table

Visual

Language

Model

Visual

Correlation

Tag

+ content

RankBoost

recommen-

dation

Tag

Image

Social

Tagging

Tag

Co-occurrence

Figure 2: The flowchart of the social tagging recom-mendation system.

folksonomy with this dynamic ontology and makes the tag-ging converge to the underlying taxonomy.

The rest of the paper is organized as follows. Section 2briefly introduces the related work on tag recommendation.Section 3 gives an overview of the annotation recommenda-tion framework. Section 4 discusses the tag co-occurrencemeasurement. Section 5 discusses the tag content correla-tion measurement. 6 elaborates on the image content condi-tioned tag correlation. The hybrid information based Rank-Boost algorithm is discussed in Section 7. The experimentalresults and discussion are given in Section 8. Section 9 con-cludes this paper.

2. RELATED WORKRecently, the tag recommendation based on the collective

knowledge [18] is proposed. The authors measured the simi-larity between tags by their co-occurrence information in thedata collection, and used the top similar tags as recommen-dations. This work has achieved some exciting results on theFlickr data, however, this kind of similarity measurementis greatly influenced by the synonyms and polysemy tags.For example, the similarity between “player” and “football”may be underestimated, since the co-occurrence of “player”and “soccer” should also have been taken into account tocalculate the similarity between “player” and “football”. Inthe other case, the similarity between “apple” and “ipod” isoverestimated, since the “apple” here only indicates the cor-poration, and the cases where it represents the fruit shouldnot be taken into account. While these problems can be bet-ter handled if the multi-modality visual similarity betweenthe tags is used.

3. OVERVIEW OF THE TAG RECOMMEN-DATION FRAMEWORK

Given an image and one or several initial tags, we wouldlike to recommend more tags which may have a semantical orvisual correlation to the image. We rank these tags by theircorrelation to the target image, and list the top N keywordsas the recommendation for further tagging.

We use a combination of three kinds of correlations torank the tags: tag co-occurrence, tag visual correlation, andimage conditioned tag correlation. The web users are very

likely to tag the images with semantically related words,like“ipod” together with “apple”. Based on this assumption,this semantic relationship can be somewhat captured by thetag co-occurrence in a large online photo sharing websitewith great number of independent users. However, it doesnot capture all relations between tags, such as the “tyre”in a photo of car, or the “eye” in a photo of “face”. Thephotos containing both “tyre” and “car” may be tagged “car”only, and it is the same with “eye” and “face”. To tacklethese problems, visual correlation of the tags are applied.We build a visual language model (VLM) [22] for each tagand then use the inverse of the distance between these visuallanguage models to measure the tag visual similarity. Thesetwo kinds of correlations only use the relation between tags,and the content of the target image is ignored. Further more,the image conditioned tag correlation is proposed to capturethe tag similarity with respect to the target image.

We also formulate the recommendation as a learning torank problem and combine these three kinds of correlationto generate the ranking. Since different types of correlationare independent measurements, it does not make sense tolinearly combine them. In this paper, we consider these dif-ferent correlations as different ranking features and combinethem in the Rankboost framework, which uses the order ofinstances rather than the absolute distance. The flowchartof the system is shown in Fig. 2.

4. TAG CO-OCCURRENCE (TC)Concept co-occurrence in daily life contains useful infor-

mation to measure their similarity in the semantic domain.The semantic about the concepts is related to human cogni-tion. Since 80% of the human cognition is formed from thevisual information in daily life, the occurrence of conceptsin daily life contributes a lot to their semantics. For exam-ple, the “monkey” is semantically related to “trees” becausewe often see monkeys living on the trees. This visual co-occurrence information contribute in forming the “monkey -tree” semantic relationship. It is also the same with lots ofother kinds of semantic relationships, such as“flower - fruit”,“fish - sea”, “football - soccer”,“bird - sky” etc.

Tag co-occurrence (TC) on Flickr can partially capturethe conceptual relationship in daily life. We assume that iftwo tags are frequently assigned to the same image, the cor-responding concepts also have a high probability to co-occurin daily life. This statement makes sense for the followingpoints. First, the Flickr dataset contains a huge amount ofdaily life photos generated by individual users. It seems thatthere are hundreds of millions of cameras capturing the ob-ject co-occurrence in daily life. Secondly, users are supposedto label the images according to their content with good rea-sons. There are studies [2], which show that the motivationof the users to tag the images is for social incentives. Inother words, the users would like to make themselves knownby contributing to the tagging task. Based on this conclu-sion, the users would more likely to tag the images trulyaccording to the content rather than making noises.

The calculation of the tag co-occurrence on Flickr has al-ready been investigated by the recent work [18]. Here weadopt the similar method to calculate the tag co-occurrenceover a large dataset of 6 million images from Flickr. Thisdataset is sufficiently large for generating the statistics aboutthe tag co-occurrence. Generally, according to different ap-plications, the tag relevance is normalized into asymmetric


363

and symmetric forms, which are briefly represented as fol-lows.

Asymmetric relevance measure. The relevance measure-ment between two tags is defined as follows.

Ratag(ti, tj) =|ti⋂tj |

|ti|(1)

where ti and tj are any two tags in the database. |ti| rep-resents the number of times of the tag ti that appears inthe database.This relevance measure is asymmetric, whichmakes sense. Given tag ti the probability of tag tj may notbe the same from that given tag tj the probability of tag ti.

Symmetric relevance measure. Although the Asymmetricrelevance measure makes sense for tag recommendation, insome cases, the symmetric relevance measurement is moreconvenient to use.

Rstag(ti, tj) =|ti⋂tj |

|ti⋃tj |

(2)

5. TAG CONTENT CORRELATION (TCC)Tag co-occurrence ignores some kinds of correlation. For

example, if a user has tagged the image with “football”,he/she may not tag it with “soccer”. In other words, thecorrelation between synonym tags is not well estimated bytag co-occurrence. For another case, like the previously men-tioned “tyre - car” and “eye - face” cases, users usually ignoresome appearing concepts unconsciously. This drawback canbe compensated by calculating the tag content correlation(TCC).

To represent the content correlation of the tags, visuallanguage model is adopted to model each tag. For the ith

tag ti, we collected a set of images marked with the tag.Then we build VLM based on these images to represent thecontent of the tag, and the differences of these quantifiedmodels are used to measure the tag visual correlation.

5.1 Data SourceIn multimedia domain, the content of the images gener-

ally contain the semantic information of the tag. We intendto use the images as the tag content. Since a single im-age contains insufficient information to describe the tag, wecollected a set of images to represent a tag. We aim to gener-ate a semantic representation for the tags using these sets ofimages. In order to sample sufficient images to describe thetag, we generated a large image pool consisting of 1,000,000images related to more than 50,000 popular tags by randomwalk sampling [4] from the popular photo sharing websiteFlickr.

5.2 Visual RepresentationVisual language model (VLM) [22] is adopted to model

the content of the tags in visual domain. This type of modelcan be generated very fast, which is appropriate for largescale datasets. It captures both the frequency of the visualfeatures related to the tag, but also considers the spatialrelationship between the neighboring features. This addi-tional spatial information used in this model makes it morediscriminative to characterize different tags.

The generation of the VLM is briefly described as fol-lows. Each image is firstly divided into uniformly distributedequal-sized patches. Then some type of local appearancefeatures, such as the texture histogram, color moment, etc,

are generated for each patch. To depress the noise as wellas the consequent training process, these local appearancefeatures are sometimes coded into a visual word by k-meansclustering or hash coding method. Afterwards, the VLMassumes that there are some visual grammar constraints onthe arrangement of these visual words. According to differ-ent constraints, the VLM can be divided into unigram, bi-gram and trigram models. The unigram model assumes thatthe visual words are independent to each other, the outputof the unigram model is the conditional distribution of thevisual words given the tag. The bigram model assumes thatthe visual words are related to one of its neighboring words(usually left neighbor), and the output of the bigram modelis the conditional distribution of the visual words given boththe tag and one of their neighboring words. Accordingly, thetrigram model assumes that the visual words are correlatedto two of its neighboring words. So the output of the trigrammodel is the conditional distribution given the tag and twoof the neighboring words. The dependency assumption canbe described in the following equations.

P (x) = ΠijP (wij) (3)

P (x) = Πn−1i=1 P (wi0)Πn−1

j=1 P (wij |wi,j−1) (4)

P (x) =P (w00)Πn−1j=1 P (w0j |w0,j−1)Πn−1

i=1 P (wi0|wi−1,0)

Πn−1i,j=1P (wij |wi−1,jwi,j−1)

(5)

where x is an image, and wi,j is the visual word of the i, jth

patch in the image.The commonly used back-off and smoothing methods in

the statistical language model are also adopted to estimatethe conditional distribution of P (wij), P (wij |wi,j−1), andP (wi0|wi−1,0). More details about this modeling methodare discussed in previous work [22].

For the unigram model, the VLM is the visual word distri-bution over each tag. P (wi), wi ∈ V, i = 1, · · · , |V | where Vis the vocabulary of size |V |. For bigram and trigram mod-els, the VLM are the conditional distribution of visual words,P (wi|wj),wi, wj ∈ V, i, j = 1, · · · , |V | and P (wi|wj , wk),wi, wj , wk ∈ V, i, j, k = 1, · · · , |V |. In summary, in all typesof VLM, the content representation of a tag is a conditionaldistribution.

5.3 Similarity MeasurementSimilarity measurement is critical in many applications

[24]. It is also true in the tag recommendation system, wherethe core is the measurement of the visual relationship be-tween tags. This similarity measurement directly influencesthe recommendation results. The direct measurement of thetag relationship is hard to achieve. As a result, we intendto represent the tags with the related images by contextmodeling and dictionary learning, which are successful inobject category recognition [3][25][12]. The tag similarity ismeasured by the differences between the VLM of these tags,which is easily calculated using a generalized Flickr Distance[21]. If two tags are semantically correlated, their imagesare more probably to share some objects or scenes. Sincethe VLM captures the visual statistics of the tags and eachVLM is a type of conditional distribution, we can calculatethe square root of JS divergence between these distributionsto measure the visual distance among the tags. Althoughthere are some other kinds of distance measurements [8],the measurement is simple and effective.


364

The content correlation is divided into symmetric andasymmetric measurements. The symmetric measurementconsiders the two tags are equally important in the mea-surement. However, sometimes the two tags are not equallyconsidered, i.e. given the tag “apple tree” the concept “ap-ple” is more probably to appear than given the tag “apple”there exists an “apple tree”, since “apple” have many othermeanings and may exist in many occasions. Thus the asym-metric relevance measurement makes sense. Generally, thesetwo kinds of measurements makes tiny different. Since thesymmetric relevance can reduce half of the computationalburden, it is widely used in common cases.

Asymmetric relevance measure.

DaTCC(ti, tj) = KL(L(ti)||L(tj)) (6)

RaTCC(ti, tj) =1

DaTCC(ti, tj)

(7)

where L(ti) represents the VLM for tag ti and KL(·) is theKL distance between two visual language models. Da

TCC isdefined as the asymmetric tag correlation measurement.

Symmetric relevance measure.

DsTCC(ti, tj) =

1

2[KL(L(ti)||M) +KL(L(tj)||M)] (8)

M =1

2[L(ti) + L(tj)] (9)

RsTCC(ti, tj) =1

DsTCC(ti, tj)

(10)

where M is the average language model between the twotags. Ds

TCC is the symmetric content distance between thetwo tags. In fact considering the VLM as a probabilisticdistribution, this distance equals to the well-known Jensen-Shannon divergence, the square of which is demonstratedto be a strict metric. The symmetric relevance measurebetween the content of the tags RaTCC(ti, tj) is defined asthe reciprocal of their symmetric visual distance.

6. IMAGE CONDITIONED TAG CORRELA-TION MEASUREMENT (ITC)

Although the tag visual similarity adopted the visual in-formation of the images related to the respective tag to cal-culate the relevance, it does not use the information aboutthe test image. We believe that for different test images,the similarity measurement should be different. For exam-ple, the tag “apple” and “pear” is of high relevance, however,given the test image of an ipod, the “pear” should not bea proper recommendation. In order to explore this kind ofconditional correlation, the image conditioned tag correla-tion (ITC) measurement is proposed.

We assume that if the two tags are similar to each other,the likelihood of generating the target image should be simi-lar. We represent the tag by the likelihood of generating thetarget image given unigram, bigram and trigram models re-spectively. These three likelihoods can be taken as locationof the tag in the likelihood space.

Let x be a novel test image (for user to tag), and xk, k =1, · · · , L be L related images to a tag t ∈ T , where T isthe tag corpus. First, we measure the similarity betweenthe target image x and all the tag associated images. Toachieve this task, the target image x is also represented in

Figure 3: The target image conditioned tag distancemeasurement. Given different target image x1,x2,the distance between tags ti,tj is different.

Figure 4: The geometric interpretation of the asym-metric relevance measurement.

the matrix of visual words x = [wij ]i,j=1,··· ,n representationas discussed in Section 5. Then the similarity is defined bythe likelihood L.

Lmt (x) ∝ P (x|V LMmt ) (11)

Also we apply the visual context constraints, the similar-ity measurement can be further represented in the unigram,bigram and trigram forms.

Lmt (x) ∝ ΠijP (wij |V LMmt ) (12)

Lmt (x)

∝ Πn−1i=1 P (wi0|V LMm

t )Πn−1j=1 P (wij |wi,j−1, V LM

mt )

(13)

Lmt (x)

∝ P (w00|V LMmt )Πn−1

j=1 P (w0j |w0,j−1, V LMmt )

Πn−1i=1 P (wi0|wi−1,0, V LM

mt )Πn−1

i,j=1P (wij |wi−1,jwi,j−1, V LMmt )

(14)

where V LMmt is the m-gram visual language model for the

tth tag. It is worth noting that the m-gram used here couldbe the unigram model, the bigram model, or the trigrammodel. For simplicity, we denote these three kinds of likeli-hood functions with the brief index m = 1, 2, 3 respectively,where m = 1 corresponds to the unigram model; m = 2 isthe bigram model; m = 3 denotes the trigram model.

Given the image x, each tag t is represented by threelikelihood values corresponding to the unigram, bigram andtrigram models respectively, as shown in Fig. 3. Thus thedistance between the two tags ti and tj can be defined in thislikelihood space as the inter product of the two likelihoodvectors. This kind of distance also consists of symmetricand asymmetric forms, which are formulated in Eq. (15)


365

and (16) respectively.

DsITC(ti, tj , x) =

Lti(x) · Ltj (x)

‖Lti(x)‖‖Ltj (x)‖ (15)

Lti(x) = [L1ti ,L

2ti ,L

3ti ] (16)

and the asymmetric distance is defined as

DaITC(ti, tj , x) = ‖

Lti(x) · Ltj (x)

‖Lti(x)‖ − Lti(x)‖ (17)

where ti is usually the initial tag related to the target imageand tj is the novel tag.

Correspondingly, the similarity measurement between tagsgiven the target image is the inverse of the distance metric.

RsITC(ti, tj , x) =1

DsITC(ti, tj , x)

(18)

RaITC(ti, tj) =1

DsITC(ti, tj , x)

(19)

This asymmetric distance measurement (Eq. (17)) is revis-ited here. In another aspect, this relevance measurementcan also be interpreted from the relevance decompositionview, which is illustrated as Fig.4. In the likelihood space,the novel tag tj is represented as a 3-dimensional vector,which is further decomposed into two components, the rel-evant component tjh that is parallel to the initial tag ti,and irrelevant component tjv which is vertical to the initialtag ti. The correlation is inverse to the distance betweenrelevant component tjh and the initial tag ti.

In the target image conditioned tag similarity measure-ment, the relevance between both the initial tag and thetarget image is considered, which makes sense for image tag-ging recommendation.

7. MULTI-DOMAIN RELEVANCE FOR TAGRECOMMENDATION

In this section, we discuss to combine multi-domain rel-evance for tag recommendation (MRR). Here the multi-domain relevance refers to the tag co-occurrence, tag con-tent correlation, and image conditioned tag correlation. Thebasic idea is to produce an accurate ranking function bycombining many “weak” learners. Different from traditionaltraining procedure, these “weak” learners are trained basedon cross domain relevance of the semantic targets.

7.1 Recommendation frameworkGiven an image and one or more tags, the task is to rec-

ommend more related tags to label the image. This recom-mendation is performed by ranking based on the three typesof similarity, tag co-occurrence, tag to tag similarity, and im-age conditioned tag similarity. The tag co-occurrence (TC)measures how likely two different tags are labeled together.The tags are generated by web users, and their co-occurrencecan reflect the tag relationship in human cognition. The tagto tag similarity (TTS) measures the correlation of the vi-sual content of the tags. In this sense, we can also measurethe similarity of the tags by their VLMs. The image basedtag similarity (ITS) measures the similarity between the tagsconditioned on the target image.

These different types of correlation can be combined to-gether to generate a more robust recommender. However,

as these similarity metrics are generated from different do-mains, the linear combination of these similarity measure-ments does not make sense. To tackle the multi-domainmeasurement problem, we propose to combine these multi-domain metrics into the Rankboost framework to generate areasonable combined recommender. Since the Rankboost al-gorithm only considers the relative order of the samples andit does not actually use the distance metric, it can generatea fair ranking based on the multi-domain information.

7.2 RankboostIn this framework, each sample tag is considered as an

instance. All the tags in the dataset form the instance spaceX . For each weak ranker, we aim to generate a function fi,which maps an instance xi from the instance space X to thepreference space or ranking space R. These given rankingsof the instances are called ranking feature, which is takenas a kind of feature to train the weak rankers. In fact, theranking feature is any kinds of measurement indicating therelative order between two instances. This measure doesnot have to be metric, since only the relative order is usefuland the distance is not used. For example, the fi(x1) >fi(x2) means the ranking feature of x1 is superior to thatof x2. The distance fi(x1) − fi(x2) > fi(x2) − fi(x3) ismeaningless. Based on this property, the ranking featurescan be generated using different kinds of information or fromdifferent domains. The requirement is that they can be usedto measure the relative order between any two instances. Inthis paper, these ranking features are generated from boththe textual domain and visual domain. So the weak rankersare also called multi-domain weak rankers.

In order to learn the ranking, we define the ranking lossas follows.

rlossD(H) =∑x0,x1

D(x0, x1)δ(H(x1) ≤ H(x0)) (20)

D(x0, x1) = c ·max(0,Φ(x0, x1)) (21)

where Hi is the sum combination of all weak rankers, andδ(π) is 1 if π holds and 0 otherwise. Φ(·) is the feedbackfunction, Φ : X × X → R. If x1 is more relevant to thetag than x0, then Φ(x0, x1) > 0. This feedback function isusually generated by user interaction.

As is demonstrated in [7], the upper-bound for this rank-ing loss is Zt

Zt =∑x0,x1

Dt(x0, x1)exp(αt(ht(x0)− ht(x1))) (22)

where ht is the output of the tth weak ranker. The weaklearner is only needed to minimize this upper-bound.

7.3 Learning to tagGiven the image and some of its initial user created tags,

we would like to recommend a list of related tags which maybe also applicable to the image. We denote the set of ini-tial tags as OT , and the set of remaining tags as UT . Therelevance of the tags is represented in two domains. Theaverage tag co-occurrence on other Flickr images is deemedas one domain, and their content correlation as another do-main. Then for these domains, we generate several rankingfeatures {fl}3nl=1 (n is the number of initial tags). The first nranking features are generated based on tag co-occurrence;


366

Algorithm 1 Cross domain Rankboost training process

Input: Given tags t1, · · · , tn ∈ OT , and t1, · · · , tm ∈ UT ,and distribution D over UT × UT .whereOT is the set of the initial tags.UT is the set of the remaining tags in the database.

Initialize D1 = D.Generate ranking features {fl}3nl=1; ∀ti ∈ UT , tl ∈ OTfl(ti, tl) = RsTC(ti, tl), l = 1, · · · , nfn+l(ti, tl) = RsTCC(ti, t), l = 1, · · · , nfn+l(ti, tl) = RsITC(ti, t), l = 1, · · · , nwhere ti ∈ UT , tl ∈ OT

for k = 1, · · · ,K. do• Select pair (ti, tj) ∈ UT × UT with distribution D.• Get weak ranking hk from ranking features of selectedpairs• Update: αk = 1

2ln( 1+r

1−r ),

where r =∑ti,tj

Dk(hk(ti)− hk(tj))

• Update: Dk+1(ti, tj) =Dk(ti,tj) exp(αk(hk(ti)−hk(tj)))

Zk.

where Zk is a normalization factor.Zk =

∑ti,tj

Dk(ti, tj) exp(αk(hk(ti)− hk(tj)))

end forOutput the final ranking: H(t) =

∑Kk=1 αkhk(t).

the next n ranking features are generated by TTS; and thelast n are generated by ITS.

fl(ti, tl) = RsTC(ti, tl), tl ∈ OT , ti ∈ UT , l = 1, · · · , n(23)

fn+l(ti, tl) = RsTCC(ti, tl), tl ∈ OT , ti ∈ UT , l = 1, · · · , n(24)

f2n+l(ti, tl) = RsITC(ti, tl), tl ∈ OT , ti ∈ UT , l = 1, · · · , n(25)

These ranking features of the two domains are combinedin the Rankboost framework. Eq. 22 shows that the weakranker depends only on the relative ordering informationof the samples rather than the specific scoring information.Since here we adopted similarity measurements from multi-ple domains, and the score may represent different seman-tics, it does not make sense to use the scoring informationdirectly. For these reasons, it is reasonable to adopt theRankboost framework as described in [7] in this paper.

In the training process, the algorithm generates some ran-dom pairs over UT × UT with the distribution D. Thenbased on these random pairs, some weak learners are trainedusing both the tag co-occurrence relevance and the contentcorrelation between the tags in each pair. The distributionD is also updated to minimize the ranking loss. A highweight α is assigned to the pairs indicating importance ofmaking that pair correct.

We adopt the same type of weak ranker as [7]. In the rec-ommendation process, the top N relevant tags, according tothe ranking result, are recommended to the user for furthertagging.

8. EXPERIMENTIn this section, we compare the proposed approach with

the commonly used tag co-occurrence based recommenda-tion approach in the real world image tagging task.

8.1 SettingWe have collected 1,000,000 images and associated 200,000

tags from Flickr as the database. Since these photos areuploaded by large amount of independent users, there arediverse categories of topics and can reflect the real situationof web images. We randomly select 500 images to performthe manually tagging task. The rest of the images with tagsare used to train the tag co-occurrence measurement as wellas the content correlation measurement.

Some frequency filtering methods are adopted to removethe noise in tag co-occurrence. In the real world, the webusers may make mistakes during the tagging process. Someof the tags may be misspelled. They will be low frequenttags. To depress this kind of noise, we adopt a simple andeffective method by removing the tags that appear less than50 times in the dataset. There are also some high frequencystop words, such as “bravo”, “image”, “photo”, etc. Thesetags contain little information. We also consider them asirrelevant tags, which can be suppressed by removing thetags that appear more than 10,000 times in the dataset.

To initialize the feedback function D for the Rankboost al-gorithm, we adopt the pseudo feedback function generatedfrom WordNet, which contains the human created knowl-edge about word correlations. Using WordNet for initializa-tion is less expensive than the human labeling. However,since there are lots of tags out of the scope of WordNet,we can just generate the feedback function among the tagswithin the WordNet corpus. To extend the coverage of thefeedback function, human interaction is unavoidable. Givensome initial tags, if tag x1 gets a higher similarity score tothe initial tags than tag x0 in the WordNet, the feedbackfunction Φ(x0, x1) > 0; otherwise Φ(x0, x1) < 0.

For each test image, we give 5 initial tags generated byusers on Flickr. For the baseline method, we adopt the com-monly used recommendation by tag co-occurrence. For thesecond method, we use only tag co-occurrence as the rank-ing feature for the Rankboost recommendation. For thethird method, we use both tag co-occurrence and contentcorrelation in the Rankboost recommendation. To providereal-time recommendation, we calculate the co-occurrencesimilarity and content similarity of all tags in the databaseoffline, and the Rankboost algorithm is only performed onthe top 100 relevant tags for each image.

8.2 EvaluationEach recommendation approach will generate an ordered

list of relevant tags for each image. Then a volunteer isrequired to evaluate these recommended tags. If a tag isrelevant to the image, it will be marked true; otherwisefalse. The average precision of the topN (N=5) recommen-dations and the coverage over all correct recommendationsare adopted to qualitatively measure the performance ofeach recommendation method. The coverage is defined asthe proportion of correct tags (including all correct tags byboth methods and initial tags) that are recommended bythe specific method. We adopt the coverage rather than therecall, because the recall is unapplicable for the recommen-dation task.

Coverage(mi) =

∑x∈TopN δ(x|mi)∑

i

∑x∈TopN δ(x|mi)

(26)


367

δ(x|mi) =

{1, given method i, x is related tag;0, otherwise.

(27)

8.3 Experiment A: Multi-domain relevance V.S.tag cooccurrence

In this experiment, we aim to compare the Multi-domainRelevance based Rankboost recommendation (MRR) withthe Tag Co-occurrence based tag recommendation (TC). Forthe MRR method, one weak ranker is trained based on thetag co-occurrence information; another one is trained by thetag content correlation; and a third one is trained by the im-age conditioned tag correlaion. These weak rankers are com-bined together into the Rankboost framework. The compar-ison between MRR and TC shows how much of the improve-ment is gained by incorporating the tag content correlationand image conditioned tag correlation. To reveal how muchis gained by the multi-domain features, we further comparedthe MRR method with the Tag Co-occurrence based Rank-boost (TCR).

We randomly choose 20% of sample images as test data,and the rest images and associated tags are used as thetraining data. To train the TCR model, we count the co-occurrence frequency between every pair of tags from thecollection, and then normalize them into range 0 to 1. Forthe MRR method, we train the tag based weak ranker thesame way as the purely tag co-occurrence based method. Inthe visual domain, we generate the ranking features basedon the visual similarity of these tags, which are discussed inthe previous sections. Then we combine these weak rankerswith the Rankboost algorithm.

Figure 5 and Figure 6 compare the performance of thetwo methods under 5 initial tags. In Figure 5, the left fig-ure shows average precision at top 10 tags. “M1” representsthe TC method [18]. “M2” denotes the TCR. “M3” denotesthe MRR method. Comparing M2 and M1, we find thesupervised recommendation outperforms the unsupervisedmethod by 3.3% in precision and 8.8% in coverage. Com-parison of M3 with M2 shows that after the combinationof content correlation, the precision and coverage gain 5.5%and 10.5% separately. In total, the proposed hybrid infor-mation based Rankboost has gained 9.0% in precision and20.3% in coverage over the commonly used tag co-occurrenceapproach. These results demonstrate the effectiveness of thesupervised recommendation as well as the usefulness of thecorrelation of the image content.

(a) Precision@5 (b) Coverage@5

Figure 5: Performance of different methods.

Figure 6: Tag recommendation examples.

Figure 6 gives an illustration of the tag recommendationresults. The texts below each image are its tags. The firstrow is the initial tags assigned by the Flickr users. Thesecond row is the results of the simple tag co-occurrencemethod [18]. The third row is the result for tag domainRankboost method, and the last row is multi-domain Rank-boost method. The proposed method provides more relevanttags.

8.4 Experiment B: Rankboost V.S. linear com-bination

To combine the visual correlation with the tag cooccur-rence, we can also use the linear combination rather than theRankboost algorithm. For the multi-domain relevance linearcombination method (MRL), we generate a new similarityscore in the following form.

Rslinear = RsTC + η1RsTCC + η2R

sITC (28)

Ralinear = RaTC + ζ1RaTCC + ζ2R

aITC (29)

where Rslinear is the combination of the symmetric similaritymeasurement, and Ralinear is the combination for asymmet-ric similarity. η1, η2, ζ1, ζ2 are the combination coefficients,which are determined by experiments.

This kind of linear combination also takes multi-domaincorrelation information, but there are two disadvantages forthis simple method. Firstly, the combinational coefficientsis difficult to determine automatically. Secondly, the linearcombination method combines the similarity score in differ-ent domains, which is not reasonable. In the Rankboostmethod, we combine the results of the weak rankers ratherthan the similarity scores.

In this experiment, we intend to demonstrate the ad-vantage of Rankboost based multi-domain recommendationover the linear combination. For each test data, we give K(K=1,3,5,8) initial tags, and a ranking of the rest tags aregenerated by both Rankboost and the linear combinationmethod. For the linear combination method, the set bothη1, η2 and ζ1, ζ2 to 1. Then the precision at top 10 tags areevaluated manually. The results are shown in Fig. 7.

From these comparisons, we find that both precision@10and coverage@10 rise with increasing the number of initialtags. The Rankboost algorithm outperforms the simple lin-


368

(a) (b)

Figure 7: Comparison with linear combinations

Figure 8: precision@N with KI initial tags.

ear combination of the ranking features. This advantage ismore obvious when the number of initial tags is larger. Thisis because given more initial tags, the Rankboost algorithmwill have more information for ranking as well as more in-formation to determine how to more effectively combine thetwo kinds of ranking features. While for the linear combina-tion algorithm, since the combination coefficients are fixed,it is not likely for this method to optimize the combination.The increase of performance in the linear algorithm is purelybased on more information for ranking.

8.5 Experiment C: Parameter influenceIn this experiment, we would like to study the influence of

the number of initial tags KI . In the real world, there maynot be large amount of tags for each image. The number ofinitial tags may greatly limit the application of the method.In this study, we selected a collection of images which havemore than 10 initial tags. Then we set KI from 1 to 8, andcalculate the precision@1,5,10 under respective initial tagsetting. The results are shown in the following figure.

From this results, we find the precision of the recommen-dation is improved with the increasing ofKI . This is becausewith more initial tags, the algorithm can use more informa-tion to predict the content of the image. It is also found thatthe precision with 5 initial tags is already acceptable for acommon recommendation system. Fortunately, with respectto the statistic generated from the 10,000 random samplingfrom Flickr, the average number of correct tags for an imageis above 5. This means that in general cases the users canlabel more than 5 correct tags, and then it is possible toapply this recommendation scheme on Flickr to help users

Figure 9: precision@N with KN irrelevant tags

tagging more.

8.6 Experiment D: Noise resistanceNoise exists in the real world data. We can not assume

that the tags on the web are all correct. For the real applica-tion, noise resistance is a critical feature. In this experiment,we would like to show this property of the proposed recom-mendation approach. 5 initial correct tags are given for eachtest image. In this experiment, we also generate some noise(irrelevant tags), which are combined with the correct tags.For a detailed analysis of the influence of noise on the rec-ommendation results, we gradually increase the number ofirrelevant tags KN from 1 up to 5. The final precisions attopN (N=1,5,10) under each irrelevant rate are shown inFigure 9.

From the figure, we find the precision of the recommenda-tion drops when the number of irrelevant tags increase. Thereason is that the irrelevant tags have mislead the rank-ing function. The irrelevant tags may not relate to eachother, while the correct tags of the same image are seman-tically related to each other. Thus the ranking algorithmwill still take more consideration on tags related to the cor-rect ones. This is also reflected in the result. Even with5 irrelevant tags, that is 50% of the number of tags, theprecision@1,5,10 still remain above 75%,73% and 53% re-spectively. This means the algorithm is applicable with 50%of irrelevant tags, which is actually worse than the real worldsituation. By sampling 10,000 images from Flickr, the statis-tic shows that on average, there are about 24% of tags foreach image are irrelevant.

8.7 Experiment E: Computational costComputational cost is also one of the main considerations

for the web scenarios. The main computational time for theproposed methods lies on the visual similarity measurementbetween tags. Fortunately, this process can be generated of-fline. We first generate a dictionary of tags, and then gener-ate both co-occurrence based similarity measurement as wellas the content based tag similarity measurement offline. Forthe online recommendation, we only need to rank the tagsaccording to the initially labeled tags. This process is veryfast. The details of the computational time for each sam-ple under different number of initial tags are shown in thefollowing table. This experiment is performed by Intel(R)Core(TM)2 Duo [email protected], 1G memory. For each im-age, the multi-domain Rankboost method takes only 0.02second.


369

Table 1: Computational cost between differentmethods.

TC TCR MRL MRRTraining 0.178 0.183 2.190 2.195

Test 0.001 0.002 0.001 0.002

In Table 8.7, the training time shows the cost for calcu-lating the similarity between each concept pair. For recom-mendation, the time shows the cost for generating a recom-mendation list for each target image. From this comparison,we find the computational cost for the proposed method liesmainly on the computation of the visual information. How-ever, the building of VLM can be performed offline in thetraining process. So the computational time for the onlinerecommendation process is almost the same for all the meth-ods.

9. CONCLUSIONSIn this paper, we have introduced the learning based tag

recommendation approach. It generates ranking featuresfrom multi-modality correlations, and learns an optimal com-bination of these ranking features by the Rankboost algo-rithm. This recommendation will update each time when anew tag is added, the efficiency makes this recommendationmethod suitable for real time applications. With this learn-ing based recommendation, better quality of recommenda-tions is achieved, and the users are reminded of more diverseand correlated tags. Experiments also demonstrate that thislearning to tag framework is more effective than current ap-proaches and the combination of multi-modality relevance ishelpful to tag recommendation.

10. ACKNOWLEDGMENTSThe research is supported in part by the National Nat-

ural Science Foundation of China (60672056), the 863 Pro-gram (2008AA01Z117), and USTC Postgraduate InnovationFoundation (KD2007049).

11. REFERENCES[1] E. Akbas and F. Yarman Vural. Automatic image

annotation by ensemble of visual descriptors. CVPR’07., June 2007.

[2] M. Ames and M. Naaman. Why we tag: motivationsfor annotation in mobile and online media. In CHI’07, 2007.

[3] J. Amores, N. Sebe, and P. Radeva. Context-basedobject-class recognition and retrieval by generalizedcorrelograms. IEEE Trans. Pattern Anal. Mach.Intell., 29(10):1818–1833, 2007.

[4] Z. Bar-Yossef and M. Gurevich. Random samplingfrom a search engine’s index. In WWW ’06Proceedings, 2006.

[5] J. Blythe and Y. Gil. Incremental formalization ofdocument annotations through ontology-basedparaphrasing. In WWW ’04, 2004.

[6] S. Boll, P. Sandhaus, A. Scherp, and U. Westermann.Semantics, content, and structure of many for thecreation of personal photo albums. In Proceedings ofACM Multimedia ’07, 2007.

[7] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.In Proceedings of ICML’98,, 1998.

[8] G. Koloniari, Y. Petrakis, E. Pitoura, and T. Tsotsos.Query workload-aware overlay construction usinghistograms. In Proceedings of CIKM ’05, 2005.

[9] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y.Chang. Pfp: Parallel fp-growth for queryrecommendation. In ACM Recommendation Systems,Lausanne,, 2008.

[10] X. Li, C. G. Snoek, and M. Worring. Learning tagrelevance by neighbor voting for social image retrieval.In Proceedings of MIR ’08, 2008.

[11] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, andS. Ma. Dual cross-media relevance model for imageannotation. In Multimedia’07, 2007.

[12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, andA. Zisserman. Supervised dictionary learning, 2008.

[13] M. Naaman and R. Nair. Zonetag’s collaborative tagsuggestions: What is this person doing in my phone?In IEEE Multimedia,, 2008.

[14] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, andH.-J. Zhang. Correlative multi-label video annotation.In Proceedings of ACM Multimedia’07, 2007.

[15] Y. Qi, K. S. Candan, J. Tatemura, S. Chen, andF. Liao. Supporting olap operations over imperfectlyintegrated taxonomies. In SIGMOD’08 Conference,2008.

[16] X. Rui, M. Li, Z. Li, W.-Y. Ma, and N. Yu. Bipartitegraph reinforcement model for web image annotation.In Multimedia’07, 2007.

[17] S. Sen, S. K. Lam, A. M. Rashid, D. Cosley,D. Frankowski, J. Osterhouse, F. M. Harper, andJ. Riedl. tagging, communities, vocabulary, evolution.In CSCW ’06, 2006.

[18] B. Sigurbjornsson and R. van Zwol. Flickr tagrecommendation based on collective knowledge. InWWW ’08, 2008.

[19] C. G. M. Snoek, B. Huurnink, L. Hollink, M. D. Rijke,G. Schreiber, and M. Worring. Adding semantics todetectors for video retrieval. IEEE Transactions onMultimedia, 9, 2007.

[20] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang.Content-based image annotation refinement.Proceedings of CVPR ’07, 2007.

[21] L. Wu, X.-S. Hua, N. Yu, W.-Y. Ma, and S. Li. Flickrdistance. Proceedings of ACM Multimedia’08, 2008.

[22] L. Wu, M. Li, Z. Li, W.-Y. Ma, and N. Yu. Visuallanguage modeling for image classification. Proceedingsof MIR’07, 2007.

[23] R. Yan and A. Hauprmann. Query expansion usingprobabilistic local feedback with application tomultimedia retrieval. In Proceedings of CIKM ’07,2007.

[24] J. Yu, J. Amores, N. Sebe, P. Radeva, and Q. Tian.Distance learning for similarity estimation. IEEETrans. Pattern Anal. Mach. Intell., 30(3):451–462,2008.

[25] Y.-T. Zheng, S.-Y. Neo, T.-S. Chua, and Q. Tian.Visual synset: towards a higher-level visualrepresentation. In Proceedings of CVPR’08, 2008.


370

Documents

Learning to Tag - EPrints · 2009-04-06 · WWW 2009, April 20–24, 2009, Madrid, Spain. ACM 978-1-60558-487-4/09/04. before [6]. Along with the proliferation of images on the World-Wide-Web,