Improving image annotation via ranking‐oriented neighbor ...users.jyu.fi/~swang/publications/JASIST15.pdf · Improving Image Annotation via Ranking-Oriented Neighbor Search and

Improving Image Annotation via Ranking-OrientedNeighbor Search and Learning-BasedKeyword Propagation

Chaoran Cui, Jun Ma, Tao Lian, and Zhumin ChenSchool of Computer Science and Technology, Shandong University, 1500 Shunhua Road, Jinan, 250101,China. E-mail: [email protected]; [email protected]; [email protected]; [email protected]

Shuaiqiang WangSchool of Computer Science and Technology, Shandong University of Finance and Economics, 40 ShungengRoad, Jinan, 250014, China. E-mail: [email protected]

Automatic image annotation plays a critical role inmodern keyword-based image retrieval systems. Forthis task, the nearest-neighbor–based scheme works intwo phases: first, it finds the most similar neighbors of anew image from the set of labeled images; then, it propa-gates the keywords associated with the neighbors to thenew image. In this article, we propose a novel approachfor image annotation, which simultaneously improvesboth phases of the nearest-neighbor–based scheme. Inthe phase of neighbor search, different from existingwork discovering the nearest neighbors with the pre-dicted distance, we introduce a ranking-oriented neigh-bor search mechanism (RNSM), where the ordering oflabeled images is optimized directly without goingthrough the intermediate step of distance prediction. Inthe phase of keyword propagation, different from exist-ing work using simple heuristic rules to select the propa-gated keywords, we present a learning-based keywordpropagation strategy (LKPS), where a scoring functionis learned to evaluate the relevance of keywords basedon their multiple relations with the nearest neighbors.Extensive experiments on the Corel 5K data set and theMIR Flickr data set demonstrate the effectiveness of ourapproach.

Introduction

With advances in Internet and Web 2.0 technologies, thecreation and distribution of digital images are much easier

than ever before. This proliferation of images on the web hashighlighted the urgent need for effective image retrievaltechniques. Because of the well-known semantic gapbetween low-level features and high-level semantics,modern image retrieval systems regard the annotation ofimages as a natural bridge narrowing the gap between text-based query and the visual content of images. The perfor-mance of image retrieval systems depends on the quality ofannotations. However, manual image annotation is a labori-ous and time-consuming process. Therefore, automaticimage annotation has been extensively studied for imageretrieval.

Automatic image annotation aims to assign a set of rel-evant keywords to an image, which can reflect its visualcontent. In general, conventional approaches boil down tobuilding the connection between visual features and key-words over a well-established training set. They typicallylearn probabilistic models (Feng, Manmatha, & Lavrenko,2004; Yakhnenko & Honavar, 2008) to infer the joint prob-ability distribution of visual features and keywords, or clas-sification models (Cusano, Ciocca, & Schettini, 2004; Fan,Gao, & Luo, 2007) to predict the presence or absence ofeach keyword in an image. However, because of lack ofsufficient training images with high-quality labels, theseconventional model-driven approaches are rarely extendedto large-scale image repositories.

In recent years, along with the popularity of social net-working sites like Flickr, we have witnessed the advent ofmassive user-tagged images on the web. Although some ofthe user-contributed tags may be inaccurate and subjective,these images provide plenty of annotated resources for our

Received April 3, 2013; revised September 17, 2013; accepted October 23,

2013

© 2014 ASIS&T • Published online 6 May 2014 in Wiley Online Library(wileyonlinelibrary.com). DOI: 10.1002/asi.23163

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 66(1):82–98, 2015

mailto:[email protected]





http://wileyonlinelibrary.com

research. With these valuable resources, many data-driventechniques have been developed to resolve the image anno-tation task. The nearest-neighbor–based scheme (Makadia,Pavlovic, & Kumar, 2008) has become increasingly attrac-tive because of its superior performance and straightforwardframework. It is based on the assumption that visuallysimilar images are more likely to share common keywords.Given a new image, the nearest-neighbor–based schemeworks in two phases. First, it finds the most similar neigh-bors of the new image from the set of labeled images. Thenthe scheme propagates the keywords associated with theneighbors to the new image.

In spite of the simplicity of the nearest-neighbor–basedannotation framework, there are some critical researchissues that remain to be addressed. One important aspect ofthe framework is how to search the nearest neighbors effec-tively. Some early work (Li, Chen, Zhang, Lin, & Ma, 2006;Wang, Zhang, Jing, & Ma, 2006) was concerned with devel-oping fast indexing or matching techniques to speed up thesearch process. Most recently, distance metric learning (Liu& Jin, 2006) has become popular and has been applied toimage annotation to some extent. Many studies (Makadia,Pavlovic, & Kumar, 2008; Wu, Hoi, Jin, Zhu, & Yu, 2009;Wu, Hoi, Zhao, & He, 2011; Zhang, Huang, Li, & Metaxas,2012) have attempted to define a suitable distance metricbetween images through linearly combining the distance indifferent dimensions of the visual feature space. All labeledimages were ranked by their distance from the new imageand the nearest neighbors were subsequently discovered.Typically, these studies followed the general principle oflearning the optimal metric such that the distance betweeneach predetermined pair of similar images is minimized,whereas the distance for those dissimilar image pairs ismaximized. The optimization objective was to reduce thedifference between the predicted distance and the predeter-mined “true” distance.

However, in the task of neighbor search, the real quantityof interest is the ordering of images induced by distances inthe learned metric, rather than the actual numerical values ofdistances themselves. The methods of distance predictionactually are solving a harder intermediate problem than nec-essary, which may require more training data to achieve adesirable result. More importantly, we argue that higheraccuracy in distance prediction does not necessarily lead tobetter ordering of labeled images. For example, let u and vdenote two labeled images whose true distances from thenew image are 4 and 5, respectively. Suppose a methodpredicts the distance to be 5 for u and 4 for v. Although it isa good result in terms of prediction error (e.g., mean squarederror of distance), it fails to ensure the correct ordering of uand v. In light of the earlier problems, it is necessary to shiftour attention from approximating the absolute distancebetween images to directly predicting the relative orderingof labeled images.

After finding a group of nearest neighbors, the next stepis to evaluate the relevance of the keywords associated withthe neighbors and propagate the most relevant ones to the

new image. However, this process is not yet well investi-gated in existing work. In most cases, studies used someheuristic rules to select the propagated keywords. Forinstance, Torralba, Fergus, and Freeman (2008) ranked key-words in terms of keyword frequency (kf) in the neighbors.However, the general keywords that occur frequently in theentire collection may dominate the results. Wang, Jing,Zhang, and Zhang (2008) assessed the relevance of akeyword by multiplying its kf by the inverse keyword fre-quency (ikf). The ikf value of a keyword is inversely andlogarithmically proportional to the frequency of thekeyword in the entire collection. Nevertheless, this schemetends to overrate rare keywords (Li, Snoek, & Worring,2009).

Generally speaking, these heuristic rules assumed a sur-rogate measurement (e.g., kf or kf*ikf) of keyword rel-evance and selected the propagated keywords based on it.However, the surrogate measurement may not always bevalid in different situations; thus, there is no guarantee of thequality of the keywords selected by these heuristic rules. Infact, with a set of training data, we can attempt to use thesupervised learning methods to directly learn a reliable cri-terion for evaluating keyword relevance, and thereby avoidpostulating what it means for a keyword to be “relevant” inan ad hoc way.

Motivated by the earlier discussion, we propose a novelapproach for image annotation, which simultaneouslyimproves both phases of the nearest-neighbor–basedscheme. In the neighbor search phase, we present a ranking-oriented neighbor search mechanism (RNSM), which usesthe learning-to-rank (LTR) (Liu, 2009) techniques to directlyoptimize the relative ordering of labeled images rather thantheir absolute distance with respect to a given image. Unlikestandard LTR methods, our proposed ranking algorithmexploits the implicit preference information hidden in thelabeled training images. Besides, as only the top-K neighborsare generally considered in the nearest-neighbor–basedscheme, we enforce the ranking model to focus more on thecorrectness of the top-K ranked results. A boosting algorithm(Freund, Iyer, Schapire, & Singer, 2003) is used to solve theresulting optimization problem. In the keyword propagationphase, we introduce a learning-based keyword propagationstrategy (LKPS), with the aim of learning a scoring functionthat can evaluate the relevance of candidate annotation key-words for a new image. The relevance is assessed based ondifferent kinds of relations between the candidate keywordsand the neighbors of the new image. We adopt the structuralsupport vector machine (SVM) (Tsochantaridis, Joachims,Hofmann, & Altun, 2006) as the backbone of our learningapproach, and the cutting plane algorithm (Joachims, Finley,& Yu, 2009) is applied in the training process.

Our experiments were conducted on two publicly avail-able image data sets: Corel 5K and MIR Flickr. Experimentalresults reveal that our proposed approach achieves a remark-able improvement over many state-of-the-art methods. Inaddition, we also demonstrate the effectiveness of each indi-vidual component of our approach for image annotation.

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—January 2015 83DOI: 10.1002/asi

The main contributions of this article can be summarizedas follows:

1. An effective image annotation approach is proposed bysimultaneously improving both phases of the nearest-neighbor–based scheme.

2. A novel RNSM is presented to directly optimize theordering of all labeled images without going through theintermediate step of distance prediction.

3. A new LKPS strategy is introduced to learn a scoringfunction that evaluates the relevance of keywordsaccording to their various relations with differentneighbors.

The rest of this article is organized as follows. In thenext section, we provide a review of related work. Then weillustrate our RNSM and LKPS strategies, respectively.The Experiments section presents the experimental resultsand analysis. Finally, we point out some directions forfuture research in the Conclusions section.

Related Work

In this section, we review some existing studies on imageannotation. Moreover, we also briefly introduce two fields ofwork closely related to our approach, that is, LTR and struc-tural learning.

Image Annotation

Model-Driven Methods. Since the early 2000s, to solve thechallenge of automatic image annotation, many machinelearning models have been proposed for building theconnection between annotations and low-level visualfeatures. Among these model-driven methods, two maingroups can be identified: probability-based methodsand classification-based methods. The probability-basedmethods infer the joint probability distribution of imagefeatures and annotation keywords. The conditional probabil-ity of each keyword given the features is computed by nor-malizing the joint likelihood of annotating a new image.Representative probability-based methods include topicmodels and mixture models. Topic models assume annotatedimages to be samples from a specific mixture of latenttopics, where each topic is a joint distribution over imageregions and keywords. Examples of topic models includelatent Dirichlet allocation (Barnard, Duygulu, Forsyth, deFreitas, Blei, & Jordan, 2003), probabilistic latent semanticanalysis (Monay & Gatica-Perez, 2004), and hierarchicalDirichlet processes (Yakhnenko & Honavar, 2008). Mixturemodels aim to directly define the joint distribution overimage features and annotation keywords. To this purpose,they separately estimate the probability density functions forimage features and keywords, which are typically assumedto be distributed as multinomials (Jeon, Lavrenko, &Manmatha, 2003; Lavrenko, Manmatha, & Jeon, 2004) orBernoulli’s (Feng et al., 2004).

The classification-based methods formulate image anno-tation as a classification problem in which each keyword istreated as a class label. Various supervised learning methodshave been applied for this purpose, including SVM (Cusanoet al., 2004; Grangier & Bengio, 2008), Bayes’s pointmachine (Chang, Goh, Sychay, & Wu, 2003), and boosting(Fan et al., 2007). Qi, Hua, Rui, Tang, Mei, and Zhang(2007) formulated the annotation task as a multilabel clas-sification problem and proposed a correlative multilabelingapproach by exploiting semantic correlations between key-words. In spite of their encouraging achievements, the effec-tiveness of both families of model-driven methods heavilyrelies on sufficient training images with high-quality labels,which is not easily available in real-world, large-scale imagedata sets.

Data-Driven Methods. With the increasing amount ofannotated resources on the web, data-driven methods haveattracted much more attention in the field of image anno-tation. The task of automatically annotating an image bymining knowledge from web images and their associatedcontext, such as surrounding text and existing tags, is alsoknown as image tagging in the literature. Several researchendeavors have been made to leverage the label diffusionover a similarity graph of labeled and unlabeled images totackle the tagging problem (Liu, Li, Ma, Liu, & Lu, 2006;Liu, Li, Liu, Lu, & Ma, 2009). Wang, Jing, Zhang, andZhang (2006) collected candidate tags from textual infor-mation (e.g., captions and surrounding text) by exploitingterm frequency-inverse document frequency (tf-idf) weightand then used random walk algorithm to rerank these can-didate tags based on visual similarity of images. Yang,Huang, Shen, and Zhou (2011) handled the tag incomple-tion problem by using the multitag associations minedfrom near-duplicate image clusters. Yang, Yang, and Shen(2013) proposed a cross-media tag transfer framework,which used a graph-based semisupervised learning modelto transfer tags from images to videos. Recently, Makadiaet al. (2008) proposed the nearest-neighbor–based scheme,where they propagated the keywords (tags) of the nearestneighbors to a new image. In this method, the nearestneighbors were determined by the average of several dis-tances computed from different visual feature spaces.Because distance calculation is involved, it is a naturalthought to integrate distance metric learning (Liu &Jin, 2006) into the nearest-neighbor–based scheme.Guillaumin, Mensink, Verbeek, and Schmid (2009) pro-posed the TagProp model, which weighted differentbase distances by maximizing the likelihood of the anno-tations of training images. To boost the recall ofrare keywords, they also introduced a word-specific modu-lation of the weighted neighbor predictions. Wu et al.(2009) focused on learning an optimal distance metric byexploring implicit side information associated with webimages, such as surrounding text and existing tags. Byextending this idea, they further proposed a unified dis-tance metric learning approach, which not only exploits

84 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—January 2015DOI: 10.1002/asi

side information, but also unifies both inductive and trans-ductive metric learning formulations (Wu et al., 2011).Most recently, much work has used the sparse coding tech-niques (Wright, Ma, Mairal, Sapiro, Huang, & Yan, 2010)to explore the relations among images. Zhang et al. (2012)proposed a regularization-based feature selection algorithmto leverage both the sparsity and clustering properties offeatures. The selected features were then combined to con-struct the distance metric between images. Tang, Hong,Yan, Chua, Qi, and Jain (2011) proposed sparsely recon-structing an image from its neighbors in feature space, andthe reconstruction coefficients were further used toperform the graph-based label diffusion. Similarly, Yang,Yang, Huang, Shen, and Nie (2011) used joint group spar-sity to localize tags to the image regions with spatial cor-relations. These studies have demonstrated the benefits ofintegrating distance metric learning into the nearest-neighbor–based scheme. However, higher accuracy in dis-tance prediction does not necessarily lead to betterneighbor ordering, as discussed in the Introduction. There-fore, in this article, we seek to directly optimize the rela-tive ordering of training images rather than their absolutedistance for a given image.

The studies discussed earlier are mainly concerned withthe problem of nearest neighbor search. On the contrary, fewarticles focus on the subsequent phase of keyword propaga-tion. After finding the nearest neighbors, many methodsmerely selected the propagated keywords by the majority orweighted neighbor voting (Mei, Wang, Hua, Gong, & Li,2008; Wu et al., 2009, 2011). Li et al. (2009) demonstratedthat the difference between the kf in local neighbor set andentire image collection is a good keyword relevance indica-tor. Nonetheless, the assumption required by this conclusioncan only be satisfied with a low probability (less than 40%) inreal situations. In our approach, we attempt to apply thesupervised learning techniques to derive a reliable criterionfor evaluating the keyword relevance.

LTR

LTR has attracted extensive attention in the machinelearning community because of its importance in a widevariety of applications such as information retrieval, collab-orative filtering, among others. Previous work can bedivided into three categories: pointwise methods, pairwisemethods, and listwise methods. In pointwise methods (Li,Burges, & Wu, 2007), ranking is modeled as regression orclassification on individual items to predict the relevancedegree of each item. In pairwise methods (Burges et al.,2005; Joachims, 2002), ranking is transformed to classifica-tion on item pairs to predict the preference between twoitems. In listwise methods (Cao, Qin, Liu, Tsai, & Li, 2007;Xia, Liu, Wang, Zhang, & Li, 2008), instances as documentlists are generated through the comparison over item pairs. Acomprehensive survey of LTR can be found in Liu (2009).

In real ranking scenarios, it is well accepted that usersmainly care about the top results, which means the ordering

of the top results is critical for the user’s experience. Thischaracteristic of ranking problem has been explored inrecent studies (Niu, Guo, Lan, & Cheng, 2012; Xia, Liu, &Li, 2009), which is referred to as the top-K ranking problem.In our study, the neighbor ranking problem also belongs tothis research topic.

Structural Learning

Structural learning is the problem of learning to predictoutputs that are not simple labels, but instead have a morecomplex structure, such as sets, sequences, or graphs. Theprobabilistic graphical models, for example, Bayesian net-works and conditional random fields, provide a powerfulframework to express the structured objects via a graph-based representation, and have been shown to be quite effec-tive for many structured prediction problems (Nowozin &Lampert, 2011). Tsochantaridis et al. (2006) introduced thestructural SVM, which extends traditional SVM classifica-tion to address prediction problems involving complexoutputs. The structural SVM framework has been widelyused in computer vision and multimedia domains, includingimage segmentation (Bertelli, Yu, Vu, & Gokturk, 2011),object localization (Blaschko & Lampert, 2008), imageranking (Siddiquie, Feris, & Davis, 2011), and video sum-marization (Li, Zhou, Xue, Zha, & Yu, 2011). To the best ofour knowledge, our study is the first attempt to embed thestructural SVM into the nearest-neighbor–based scheme forimage annotation.

Ranking-Oriented Neighbor Search Mechanism

In this section, we propose a ranking-oriented neighborsearch mechanism (RNSM), which aims to exploit the LTRtechniques to directly optimize the ordering of all labeledimages for a given image. The top-K ranked results are thenselected as the K nearest neighbors. For the ease of expla-nation, we first give some notations.

Let χ denote an image collection, and all keywordsappearing in the collection are T = { }t t tc1 2, , ,… , where c isthe total number of unique keywords. In the image annota-tion task, we are given a set of n labeled training images,S X= ∈ ={ }x i ni 1, ,… , in which each labeled imagexi is associated with a c-dimensional binary label vector

yi ∈ {0, 1}c, whose jth element yij( ) indicates the presence of

keyword tj in xi, that is, yij( ) = 1 if xi is labeled by tj and

yij( ) = 0 otherwise. Given a new image xnew ∈X , our goal is

to learn a ranking function H R:X S× → from the trainingdata, such that H(xnew, xi) can represent the relevance of thelabeled image xi with respect to xnew, and xi is ranked beforexj if H(xnew, xi) > H(xnew, xj).

Although LTR has been extensively studied (Liu, 2009),it is not straightforward to directly apply standard LTRmethods to our problem for the following two reasons. First,


unlike common LTR tasks where some preference informa-tion (in the form of pairwise or listwise constraints) has beenexplicitly given to supervise the learning process, in ourproblem, preference information is only implicitly availablein the training data. Moreover, generally, we only considerthe K nearest neighbors for a new image to prohibit thepotential noisy keywords introduced by those distant neigh-bors. Therefore, in our ranking task, the correct ordering ofthe top-ranked results is crucial, but the mistakes in lowranks may not degrade the final performance. It is necessaryto redesign the training procedure to ensure that the top-ranked results are as accurate as possible.

Based on the earlier analysis, in the following subsec-tions, we first present the process of extracting the implicitpreference information hidden in the training data. Withthe preference information, we then introduce a newLTR algorithm that underlines the accuracy of the top-Kresults.

Extraction of Implicit Preference Information

Because no explicit preference information is given forour problem, to facilitate LTR, we first derive some prefer-ence information from the training data. In other words, fora labeled training image, we need to find the informationthat can reflect the relative ordering among the other labeledimages with respect to it.

It is notable that the intuition behind the nearest-neighbor–based methods is that similar images should sharemore common keywords. That is, given an image, its closeneighbors may have a higher keyword agreement with itcompared with those distant neighbors. In accordance withthis principle, we measure the degree of relevance betweentwo labeled images by the consistency of their keywords.Let y and y be two label vectors, and their consistencyCON y y, ˆ( ) is defined in a manner similar to F1 measure:

CON y ypr

p rp

y y

y yr

y y

y y

T

T

T

T,

2.ˆ

ˆ

ˆ ˆ

ˆ( ) =+

= = (1)

With this definition, for a labeled image xi, we defineRi

iii

ii

nir r r r= { }− +1 1 1… …, to correspond to the relevance

degrees of the other labeled images, that is,S \x x x x xi i i n= { }− +1 1 1… …, , where r CON y yj

ii j= ( ), . Then,

we denote πi as the total ranking of S \xi with respect to xi,which is derived in descending order of Ri, and πi(xj) standsfor the position of the labeled image x xj i∈S \ in theranking.

Although πi seems a natural way of representing thepreference information associated with xi, the variance in theimportance of partial orderings in different ranking positionscannot be easily reflected in πi. To allow for the encodingof this kind of information, we construct the preferenceinformation in the form of ordered pairs of instances.In particular, a subset of m ordered pairs Pi is randomlypicked up from πi; that is, P Oi i⊆ and Pi m= , where

Oi i ix x x x= ( ) ( ) < ( ){ }, ˆ ˆπ π . Moreover, for an arbitraryordered pair x xj q i,( ) ∈P , we define a function W Ri i: P →such that Wi(xj, xq) represents the importance of (xj, xq) beingsatisfied. To determine the value of Wi(xj, xq), we first denote′π i as a new ranking of S \xi, which exchanges the positions

of xj and xq in πi. Then we calculate the normalized dis-counted cumulative gain, NDCG@K metric (Järvelin &Kekäläinen, 2002) for ′π i :

NDCG KZ li

l

K

r

l

K

′=

= −+( )∑π @

log,

1 2 1

121

(2)

where rl denotes the relevance degree of the image withposition l in ′π i , and ZK is a normalization factor chosen sothat the NDCG@K of the correct ranking πi is 1. Further-more, the exact form of Wi(xj, xq) is given as

W x xNDCG K x K

otherwisei j q

i ji,@

.( ) = − ( ) ≤⎧

⎨⎩

′1 π πη

(3)

If xj or xq involves the top-K instances in πi, we take thedecrease in ′π i in terms of NDCG@K as the value of Wi(xj,xq). Intuitively, because the NDCG metric includes a posi-tion discount factor in its definition, incorrectly orderinghigher ranks of πi can lead to greater losses in terms ofNDCG@K. As a result, a large weight will be assigned to theordered pair at high positions. On the contrary, if both xj andxq appear after the Kth position in πi, there is no effect on thevalue of NDCG@K when exchanging their positions. There-fore, we set Wi(xj, xq) to be a small constant η. Finally, werepeat the process for each labeled training image, and theultimate set of preference information is denoted asP P= =∪ i

ni1 , which will be used as input for the following

LTR algorithm.

Top-K Focused Ranking Algorithm

With the derived preference information set P , we nowpresent the formulation of our proposed top-K focusedLTR algorithm. In our study, we are more interested in therelative ordering of two labeled images rather than theirabsolute ranking scores or pairwise distances with respectto a given image. Therefore, the basic idea of our approachis to construct the ranking function H that is consistentwith the preference pairs in P as much as possible. To thisend, we define the ranking error of H with respect to P asfollows:

err W I H Hijq iq ijx xi

n

j q i

= ≥( )( )∈=∑∑,

.P1

(4)

Here, we introduce Wijq = Wi(xj, xq) and Hij = H(xi, xj) for thesimplicity of expression. I(·) is an indicator function thatoutputs 1 if the input Boolean expression is true and zerootherwise. In fact, err measures the weighted number of thepreference pairs misordered by H.


As described earlier, the preference pairs at high posi-tions have larger weights; thus, incorrect orderings of thesepairs will result in a more severe ranking error, whereasthose pairs that only involve instances after the Kth positionhave been assigned smaller weights, and misordering themmay have little impact on the error. Therefore, through mini-mizing err, we can find the optimal ranking function H thatgives priority to ensuring the correctness of the top-Kresults.

However, the ranking error defined in Equation (4) is anonsmooth function because the indicator function I(·) isnonsmooth. It is well-known that directly optimizing a non-smooth function is computationally infeasible (Rockafellar,1996). To address the problem, we follow the idea of theAdaBoost algorithm (Schapire, 1999) by replacing the indi-cator function I(x ≥ y) with an exponential functionexp(x − y). The resulting new ranking error is

err W H Hijq iq ijx xi

n

j q i

� = −( )( )∈=∑∑ exp ., P1 (5)

Because it always holds that exp(x − y) ≥ I(x ≥ y), by mini-

mizing the new error err� , we effectively reduce the original

ranking error err. Besides, another advantage of using err� isfrom the theoretical property of AdaBoost; that is, minimiz-ing the exponential loss can not only reduce the trainingerror but also increase the margins of the training samples,and the enlarged margins are the key to ensuring a lowgeneralization error for test instances (Schapire, 1999).

In our method, the ranking error err� is similar to theobjective function used by RankBoost (Freund et al., 2003);we thus can adopt the RankBoost algorithm to learn theoptimal ranking function H. To guarantee the correct execu-tion of the algorithm, we need to define a set of g rankingfeatures F = { }f fg1, ,� , each of which corresponds to alinear ordering of the images to be ranked. Specifically, wecalculate the distance from the query image to all rankedimages in a certain visual feature space, and a rankingfeature is generated in ascending order of the distance. It isworth noting that the ranking features are related only to theordering of the ranked images rather than the actual numeri-cal values of their distance.

Algorithm 1 shows the details of the RankBoost algo-rithm. The algorithm operates for T iterations. For eachsuccessive iteration t = 1,2, . . . , T, it maintains a weightdistribution D(t) over the preference pairs in P , and Dijq

t( )

denotes the weight on the pair (xj, xq) associated with xi.Initially, the weights are set according to the importance ofthe preference pairs (line 1). At iteration t, a weak ranker ht

is created from F based on the current weight distributionD(t) (line 3). We use the same generation process of weakranker as described by Freund et al. (2003), whose compu-tational cost scales as O(gn + mn). Then the algorithmchooses a weight coefficient αt for ht by measuring itsranking accuracy on all preference pairs (line 4). Intuitively,

a greater coefficient is given to a more accurate weak ranker.Meanwhile, the weight distribution D(t) is also updatedaccording to the accuracy of ht (line 5). The preference pairsmisordered by ht have their weights increased, whereas theweights are decreased for those pairs that are ordered cor-rectly. Therefore, the weak ranker in the next iteration ht+1

will concentrate more on the “hard” pairs for ht. As theiterations proceed, the preference pairs that are difficult toorder correctly will receive ever-increasing weight. Once allthe weak rankers have been created, the algorithm outputsthe final ranking function H as their weighted combination,and αt is the corresponding contribution of ht (line 7).

ALGORITHM 1 RankBoost algorithm for minimizing err� .

Input: P, F and TOutput: H1: Initialize a distribution D over all preference pairs in P, i.e.,

DW

Zijq

ijq1

0

( ) = where Z Wijqx xi

n

j q i

01

=( )∈=∑∑, P

2: for t = 1, . . . , T do3: Create a weak ranker h Rt : χ × →S from F based on D(t)

4: Compute α tr

r= +

−⎛⎝⎜

⎞⎠⎟

12

11

ln where

r D h x x h x xijqt

t i jx x

t i qi

n

j q i

= ( )( ) − ( )( )

( )∈=∑∑ , ,, P1

5: Update DD h x x h x x

Zijqt ijq

tt i q t i j

t

+( )( )

= ( ) − ( )( )1 exp , ,

where Z D h x x h x xt ijqt

t i qx x

t i ji

n

j q i

= ( )( ) − ( )( )

( )∈=∑∑ , ,, P1

6: end for

7: return H x x h x xnew i t t new it

T

, ,( ) = ( )=∑α

1

After obtaining the ranking function H, given the newimage xnew, we use H to rank all labeled images for xnew andtake the top-K results as the set of its K nearest neighbors,which is denoted by NH newx( ).

Learning-Based Keyword Propagation Strategy

With the set of neighbor images NH newx( ), the next stepis to evaluate the relevance of the keywords associated withthe neighbors and to propagate the most relevant ones to thenew image xnew. Unlike the conventional methods usingsome heuristic rules to select the propagated keywords, inthis section, we present a learning-based keyword propaga-tion strategy (LKPS).

Recall that in our study, each labeled training image hasbeen assigned a label vector encoding its associatedkeywords. From this point of view, the image annotationtask can be viewed as generating a label vector that encodesthe predicted keywords for a new image. In other words, thegoal is to find a hypothesis G : χ → Y , where χ denotes theentire image collection and Y is the set of all possible labelvectors. Given the new image xnew ∈χ, we can use G topredict a label vector y*∈Y for xnew.


In our LKPS, we seek to learn a scoring functionF R: χ × →Y from the training data, and F(xnew, y) mea-sures how well a group of candidate annotation keywordsrepresented by y fits for xnew. Under the nearest-neighbor–based scheme, we determine the value of F(xnew, y) by therelations between the candidate keywords and the neighborsof xnew. Once the scoring function F(xnew, y) is learned, thehypothesis G can predict the label vector y* for xnew bymaximizing F(xnew, y) over all possible y ∈Y:

y G x F x ynewy

new* argmax , .= ( ) = ( )∈Y (6)

To achieve our purpose, we represent the image-label vectorpair (xnew, y) using a joint feature vector Ψ(xnew, y) andassume the scoring function F(xnew, y) to be linear in terms ofΨ(xnew, y):

F x y w x ynewT

new, , ,( ) = ( )Ψ (7)

where w is a weight vector that needs to be learned duringthe training phase. In our method, Ψ(xnew, y) is used toexpress the various relations of the keywords represented byy with respect to different neighbors of xnew, and an elementof the weight vector w essentially reflects the importance fora certain kind of relation with a certain neighbor. We discussthe exact form of Ψ in the next subsection.

In this article, we use the structural SVM (Tsochantaridiset al., 2006) as the backbone of our learning approach.With the training examples x y i ni i, , ,( ) ∈ × ={ }χ Y 1 � ,the structural SVM conducts the learning of F by requir-ing the resulting hypothesis G to minimize the empiricalrisk,

R Gn

y G xi ii

n

Δ Δ( ) = ( )( )=∑1

1

, . (8)

Here, Δ(yi, G(xi)) quantifies the loss of the predicted labelvector G(xi) compared with the ground truth yi. We definethe loss function based on the consistency measure in Equa-tion (1), that is,

Δ y y CON y y

y y

y y y y

T

T T

, 1 ,

12

,

ˆ ˆ

ˆ

ˆ ˆ

( ) = − ( )

= −+

(9)

where y and y denote two label vectors. The details ofour structural learning framework are given in a latersubsection.

Joint Feature Representation

In this subsection, we discuss the joint feature represen-tation Ψ in Equation (7). As mentioned earlier, given animage-label vector pair (x, y), Ψ(x, y) reveals the multiplerelations between the set of candidate keywords represented

by y and the K nearest neighbors NH x( ) of x. Specifically,we define the form of Ψ(x, y) as follows:

Ψ x y M y

t NN t NN

t NN t NN

xT

K

c c K

,

, ,

, ,

( ) =

=( ) ( )

( ) ( )

⎡

⎣

⎢⎢⎢

φ φ

φ φ

1 1 1

1

��

�

⎤⎤

⎦

⎥⎥⎥

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

( )

( )

T

c

y

y

1

� . (10)

Here, y(i) denotes the ith element of y, which indicateswhether the ith keyword ti appears in this set of keywords.Mx is a c-by-K block matrix, whose ith row and jth columnblock ϕ(ti, NNj) is a row vector encoding the relationsbetween the ith keyword ti and the jth neighborNN xj H∈ ( )N of x. From the above definition, we can seethat Ψ(x, y) is actually the linear combination of the relationblocks between each selected keyword and differentneighbors.

To describe the relations between the keyword ti and theneighbor NNj, some previous work (Siddiquie et al., 2011)set ϕ(ti, NNj) to be the output of a trained concept detectorconcerning the presence or absence of ti in NNj. In our study,we formulate ϕ(ti, NNj) in an unsupervised manner, whichconsiders the aspects of frequency, co-occurrence, andsemantic similarity of ti with respect to NNj.

According to the frequency of ti in training images, wecan estimate the probability of annotating NNj with ti. Obvi-ously, an indicator function can be used to represent theinformation, that is, the probability of each keyword givenan image is 0 or 1. However, this is an absolute representa-tion because an image is labeled by a keyword with someprobability. Therefore, we adopt a probabilistic measure andestimate the probability of ti given NNj using a multipleBernoulli model (Feng et al., 2004):

P t NNn

ni j

t NN ti j i( ) = ++

μδμ,

. (11)

where μ is a smoothing factor. δ t NNi j, = 1 if ti occurs in theannotations of NNj and zero otherwise. nti denotes thenumber of training images labeled by ti, and n is the totalnumber of training images.

Generally speaking, if ti has a high co-occurrence withthe annotation keywords of NNj, their corresponding con-cepts are more likely to be relevant. In our work, wecompute the normalized co-occurrence between two key-words in a similar way to Sigurbjörnsson and van Zwol(2008):

S t tt f t t

t f tco ,

,,ˆ

ˆ

ˆ( ) = ( )

( ) (12)

where t and t are two keywords, t f t( ) denotes the totalfrequency of t in the training set, and t f t t, ˆ( ) is thenumber of images containing both t and t . In fact, S t tco , ˆ( )


approximates the probability of an image being annotatedwith t given that it was annotated with t . The co-occurrencebetween ti and NNj is subsequently computed by looking forthe “closest” keyword of NNj with respect to ti:

R t NN S t tco i jt NN

co ij

, max , .( ) = ( )∈ (13)

Keyword co-occurrence only reflects a local statistical cor-relation dependent on the training set. We further capture thesemantic correlation between ti and NNj via WordNetlexicon. Lin’s (1998) similarity measure is used to estimatethe semantic similarity between two keywords:

S t tIC lcs t t

IC t IC tLin ,

2 ,,ˆ

ˆ

ˆ( ) = ∗ ( )( )

( ) + ( ) (14)

where IC(t) and IC t( ) are the information content of tand t , respectively, and lcs t t, ˆ( ) is their least common sub-sumer in WordNet that represents the common ancestorhaving the maximum information content. Then, similar toEquation (13), the semantic similarity between ti and NNj isgiven as

R t NN S t twn i jt NN

Lin ij

, max , .( ) = ( )∈ (15)

Following Equations (11), (13), and (15), the exact form ofϕ(ti, NNj) is expressed as a three-dimensional vector:

φ t NN

P t NN

R t NN

R t NNi j

i j

co i j

wn i j

T

, ,

,

.( ) =( )( )( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

(16)

As a result, the final dimension of the feature representa-tion Ψ(x, y) is 3K when we consider the K nearest neigh-bors of x.

Learning With Structural SVM

In this subsection, we use the structural SVM frameworkto find the optimal scoring function F for our LKPS. Inanalogy to the linear SVM, the structural SVM determinesthe weight vector w of F by solving the following quadraticprogramming problem (Tsochantaridis et al., 2006).

Optimization Problem 1 (structural SVM):

min ,,w

ii

n

i

wC

nξξ

≥ =

+ ∑0

2

1

1

2(17)

subjected to

∀ ∀ ∈( ) ≥ ( ) + ( ) −

i y Y y

w x y w x y y y

i

Ti i

Ti i i

, :

, , , .

\

ψ ψ ξΔ(18)

In the given formulation, the constraint condition 18requires that for a training example (xi, yi), if the scoring

value of an incorrect label vector y is greater than that of theground truth yi, that is, wTψ(xi, y) > wTψ(xi, yi), then thecorresponding slack variable ξi must be at least Δ(yi, y).Under this constraint, it can be demonstrated that the

average of the slack variables, 11n

ii

n ξ=∑ , is an upper bound

on the empirical risk RΔ in Equation (8) (Tsochantaridiset al., 2006). Therefore, the parameter C in the objectivefunction (Equation [17]) essentially controls the tradeoffbetween the model complexity ||w||2 and the correspondingempirical risk.

The main difficulty in solving the earlier optimizationproblem is that there is an exponential number of constraintswith respect to the total number of unique keywords to beconsidered. For each training image, each incorrect labelvector is associated with a constraint. In this article, we usethe cutting plane algorithm (Joachims et al., 2009) to solvethe optimization problem. The idea behind the algorithm is tofind a subset of constraints so that the solution for this subsetcan also fulfill all the constraints at a precision of ε. Thepseudocode of the algorithm is presented in Algorithm 2. Foreach training example (xi,yi), it starts with an empty workingset Vi of active constraints (line 1) and iteratively finds the y,which generates the most violated constraint for xi with thecurrent w (line 5). If the corresponding constraint is violatedby more than ε, the algorithm adds y into the working set Vi

and then reoptimizes Equation (17) using the constraints inthe updated working sets (lines 7–9). Note that the problem atthis step differs only in a single constraint from iteration toiteration. We therefore restart the optimizer from the currentsolution, which greatly reduces the run time. Algorithm 2 is

guaranteed to terminate within O1

2ε⎛⎝⎜

⎞⎠⎟

iterations for any

desired precision ε (Tsochantaridis et al., 2006).

ALGORITHM 2 Cutting plane algorithm.

Input: (x1, y1), . . . , (xn, yn), C, εOutput: w

1: Initialize for all i = 1, . . . , n2: repeat3: for i = 1, . . . , n do4: H(y; w) ≡ Δ(yi, y) + wTΨ(xi, y)5: Compute ˆ argmax ;y H y wy= ( )∈Y

6: Compute ξi y i H y w= ( ){ }∈max , max ;0 V

7: if H y w iˆ;( ) > +ξ ε then8: V Vi i y← { }∪ ˆ9: Optimize 17 over V V= ii∪

10: end if11: end for12: until no Vi has changed during iteration.13: return w

In Algorithm 2, we need to look for the most violatedconstraint in each iteration, that is, to solve the maximizationproblem in line 5:

ˆ argmax , , .y y y w x yy

iT

i= ( ) + ( )∈Y

Δ Ψ (19)


By introducing Equations (9) and (10) into the givenformula, the problem becomes

ˆ argmax

.

y z y

z M wy

y y

yiT

i xi

ii

=

= −+

∈Y

where22 2

(20)

Here, it should be noted that the value of zi depends only onthe number of nonzero elements in y. Suppose that werestrict the range of y to the set of label vectors with exactlyk nonzero elements in Y, that is, Y Yk y y y k= ∈ ∧ ={ }: 2 .

In this case, zi becomes a constant with respect to y; thus,Equation (20) can be solved efficiently by finding theindexes of the largest k entries in zi and setting the corre-sponding elements of y to be 1. At the end, we repeat theprocess for k = 1, . . . , c, and y is the optimal solutionduring iterations. Algorithm 3 describes these steps indetail. It has been demonstrated that finding the largest kelements of a list of size c can be finished in O(c) time(Knuth, 1998). Therefore, the overall complexity of Algo-rithm 3 is O(c2).

ALGORITHM 3 Most violated constraint generation.

Input: (xi, yi), w, Mxi

Output: y1: Initialize max = −∞2: for k = 1, . . . , c do

3: z M wy

y ki x

i

ii= −

+2

2

4: Compute y z yy iT

k= ∈argmax Y

5: Compute current z yiT=

6: if current > max then7: max = current8: y y=9: end if

10: end for11: return y

Once the weight vector w is learned, we can predict thelabel vector for the new image xnew by solving Equation (6).The problem can also be transformed to a similar form toEquation (20):

y z y

z M w

ynew

T

new xnew

* argmax

.

=

=∈Y

where(21)

Because znew here is a constant vector independent of y, thesolution y* is the binary vector that indicates the positiveelements of znew, which can be found in O(c) time. Moreover,it is worth noting that znew can be regarded as a relevancescoring vector, where the ith element actually reflects therelevance of the ith keyword with respect to xnew. Therefore,according to znew, we can further rank all the keywords orselect a fixed number of the most relevant ones for xnew,which is usually required in most testing scenarios for theimage annotation task.

Experiments

In this section, we report the experimental results of theproposed approach for image annotation. As describedearlier, our approach simultaneously improves both phasesof the nearest-neighbor–based scheme. For simplicity, wedenote it as Improved-NN in the following.

Data Sets

To evaluate Improved-NN, we conduct extensiveexperiments on two publicly available data sets: the expert-annotated benchmark data set—Corel 5K—and the real-world user-tagged image data set—MIR Flickr.

The Corel 5K data set was first used by Duygulu,Barnard, De Freitas, and Forsyth (2002). Since then, it hasbecome a benchmark for image annotation and has beenwidely used in many methods. As a result, we can directlycompare the experimental results of our approach againstthe published results of previous studies. The data set con-tains around 5,000 images that are split into 4,500 trainingimages and 499 testing images. The annotations associatedwith images are generally produced by professional asses-sors with well-defined procedures. Each image is annotatedwith one to five keywords. In total, 260 distinct keywordsare found in both the training and testing sets.

The MIR Flickr data set has been recently introduced(Huiskes & Lew, 2008) for evaluating the image retrievalmethods. The data set consists of 25,000 images and theirassociated tags, which are crawled from Flickr. It isobserved that many tags seldom occur in the data set (lessthan five times), which are usually misspellings or used forspecific names. Although these tags may contain valuableinformation for users, they are meaningless for generalannotation or retrieval tasks. Therefore, in our experiments,the collection of tags was limited to the ones appearing atleast 20 times. Finally, 1,386 tags were left, and the averagenumber of tags per image was 3.77. We took half of the totalimages for training by using their associated tags as traininglabels, and the rest for testing by ignoring the user-contributed tags associated with them. Experimental resultswere evaluated based on 38 concepts defined in MIR Flickr,each of which corresponds to a frequent tag describing ascene or an object of the image content, such as car or sky.The ground truth of these concepts for all images has beenprovided in the data set.

In this article, each image in both data sets was repre-sented by the same visual features as described byGuillaumin et al. (2009). The representation includes twotypes of global image features: Gist and color histograms.The color histograms were extracted in three different colorspaces: RGB, LAB, and HSV, which are the most commonlyused color spaces in computer vision, and have been appliedin many proposed image annotation studies (Makadia et al.,2008; Zhang et al., 2012). We divided the color histogramsinto 16 bins in each color channel, yielding 163 = 4,096dimensional histograms for each image. These features


characterize the image content from a global view. Becauselocal features can capture more semantic content of imagesthan global features, the scale-invariant feature transform(SIFT) and a robust hue descriptor were also adopted as twolocal features. Both of them were extracted on multiscalegrids and around Harris–Laplacian interest points. Eachlocal descriptor was quantized into visual words usingK-means clustering; then an image was represented as a“bag-of-words” histogram. Furthermore, to capture thespatial layout of images, all histogram features, except Gist,were computed over three horizontal regions of an image,and the resulting three histograms were concatenated toform a new feature. It should be mentioned that the newcolor histograms were requantized to 12 bins in each colorchannel, yielding 3 * 123 = 5,184 dimensional histogramsfor each image. The choice of 12 bins is a compromisebetween limiting the feature size and avoiding excessive lossof color distribution information. Finally, a total of 15 visualfeatures were extracted from each image. In our approach,the distance in each feature space needs to be defined respec-tively. Specifically, we used L2-distance as the base metricfor Gist, L1-distance for color histograms, and χ2 statistic forthe others. Table 1 summarizes the basic information aboutour data sets.

Evaluation Criteria

To compare the results of different annotation methods,we adopted the standard performance measures widely usedin previous work, where the quality of the predicted anno-tations is evaluated by retrieving test images using thekeywords or concepts in annotation vocabulary.

On the Corel 5K data set, for ease of comparison withpublished results, we annotated each test image with the fivemost relevant keywords as done by Makadia et al. (2008),Guillaumin et al. (2009), and Zhang et al. (2012). Then for akeyword t, its precision and recall were computed asfollows:

Precision tN

Nt

p

( ) = (22)

Recall tN

Nt

r

( ) = , (23)

where Nt denotes the number of images correctly annotatedwith t, Np denotes the number of images predicted to have t,and Nr is the number of images annotated with t in groundtruth. The average precision (AP) and recall were computedover all keywords as two evaluation measures. In addition,we also considered another measure to assess the coverageof correctly annotated keywords, that is, the number of key-words with nonzero recall, which is denoted by “NonWord”for short.

On MIR Flickr, because the ground truth is related onlyto the predefined concepts, we no longer selected a fixednumber of tags from the whole vocabulary for each testimage. Instead, given a concept, we ranked all test imagesaccording to their predicted relevance to the concept. Thenwith the ranked list, we used the following three commoninformation retrieval evaluation metrics (Manning,Raghavan, & Schtze, 2008) to evaluate the annotation per-formance:

• Precision at 10 (P@10): P@10 is defined as the proportion ofrelevant images in the top 10 ranked results:

Pof relevant images in the top results

@#

.1010

10= (24)

• Break-even point (BEP): It is often called R precision andmeasures the percentage of relevant images within the top Rranking positions:

BEPof relevant images in the top R results

R= #

, (25)

where R is the number of relevant images to the concept inground truth. Compared with P@10, BEP can avoid overrat-ing the results for the concepts with many relevant images.

• Average precision (AP): AP measures ranking quality of thewhole list. It is calculated as

APP i

Ri

l

i=

×=∑ @

,1δ (26)

where l is the length of the ranked list, δi = 1 if the ith imageis relevant, and δi = 0 otherwise. In our experiments, wereported the average P@10 and BEP, as well as the mean valueof AP (MAP) over all concepts to evaluate the overallperformance.

Parameter Settings

In this subsection, we study the influence of the param-eters on the performance of Improved-NN. There are threeparameters to be tuned in our approach: the number ofneighbors considered in the nearest-neighbor–based schemeK, the smoothing factor μ in Equation (11), and the regular-ization parameter C in Equation (17).

In our preliminary experiments, we experimented withseveral values for the parameter μ and found consistentlysuperior performance when μ varied between 5 and 15.Here, we let μ = 10.

TABLE 1. Summary of the experimental data sets.

Data set and feature Description

Corel 5K 5,000 images and 260 keywordsMIR Flickr 25,000 images and 1,386 tagsColor histogram In RGB, LAB and HSV spaces with two spatial

layouts; using L1-distanceGist Using L2-distanceSift histogram Dense and interest point sampling with two spatial

layouts; using χ2 statisticHue histogram Same as above


For K and C, their optimal values were chosen via anexhaustive grid search with 5-fold cross-validation on thetraining set. First, we used the Corel 5K data set to optimizeparameter C, which balances the model complexity and theempirical risk in the LKPS. Figure 1 shows the impact of Con different performance metrics when K = 150, and we cansee that the best performance can be achieved when C has arelatively small value (C = 0.1 or C = 1.0). This indicatesthat the model trained with more emphasis on model robust-ness rather than on low training loss can perform better inour problem. In the experiments, we also found that the bestC choice is not sensitive to the change of K. Therefore, weset C = 0.1 in our later experiments.

Figure 2 presents the performance comparisons byvarying K from 30 to 500 on the Corel 5K data set. Fromthe results, we can see that the performance is desirablewhen K = 200, which is in accordance with a previouslypublished report (Guillaumin et al., 2009). It can also beobserved that a very small or large value of K degrades the

performance. This is reasonable because a small number ofneighbors cannot provide sufficient information to reflectthe characteristics of a new image, whereas too manyneighbors may introduce information irrelevant to thatimage. In addition, Figure 3 shows the effect of K on theMIR Flickr data set, where similar performance variationtrend can be observed as K changes, and the optimal valueof K is around 500. Therefore, we set K = 200 for theCorel 5K data set and K = 500 for the MIR Flickr data setin the rest of the experiments.

Comparisons With State-of-the-Art Methods

To demonstrate the efficacy of Improved-NN for imageannotation, we compared its performance with that of state-of-the-art methods on both data sets.

On Corel 5K, we directly compared the results obtainedby Improved-NN against the results of previously publishedstudies. Table 2 shows the performance comparisons of

FIG. 1. Effect of the variation of C on Corel 5K. The horizontal axis gives the different values of C. Bars present AP and recall according to the left axis,and the line denotes NonWord according to the right axis. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

FIG. 2. Effect of the variation of K on Corel 5K. The horizontal axis stands for the different values of K, and the vertical axis is identical with that inFigure 1. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]


different methods. As can be seen in Table 2, Improved-NNoutperforms 9 of the 10 listed methods in all evaluationcriteria. The maximum performance increases are 17%,19%, and 51 in terms of Precision, Recall, and NonWord,whereas the minimum gains still achieve 3%, 5%, and 12%,respectively. These improvements suggest the validity ofImproved-NN in the image annotation task. The onlymethod that we fail to match is the TagProp model, to whichImproved-NN is inferior in terms of Recall and NonWord.This may be attributed to the fact that to boostthe performance in Recall, TagProp adopts word-specificmodels and separately modulates the parameters for eachkeyword, whereas in our setting, we do not adjust themethod configuration for a single keyword.

In our experiments, we performed paired t tests(Smucker, Allan, & Carterette, 2007) to analyze whether theimprovements of Improved-NN over other methods werestatistically significant. On Corel 5K, we conducted paired ttests between Improved-NN and each of CRM, MBRM,JEC, LASSO, and GS, because their source code wasavailable or the experimental settings were described inenough detail, allowing us to reproduce the results for eachkeyword. Table 3 lists the p values of paired t tests in termsof precision and recall, respectively. The results indicate thatthe improvement of Improved-NN over the other methodswas statistically significant at a significance level of .05.

On MIR Flickr, where there are no published resultsavailable, we compared Improved-NN with the followingthree approaches for image annotation: TagProp(Guillaumin et al., 2009), TagRelevance (Li et al., 2009),and kNN-SGSSL (Tang et al., 2011). In addition, we alsoincluded comparisons with a baseline method, which usedSVM to learn a binary classifier for each concept and rankedthe test images by the classifier output score. The SVMclassifier was trained with the RBF kernel that defined thekernel value of two images as k(xi, xj) = exp(−d(xi, xj)/λ),where d(xi, xj) denotes their average distance over differentvisual feature spaces and λ is the average of all pairwisedistances between training images. The summary of theexperimental results is presented in Table 4.

From Table 4, we can see that Improved-NN achieves thebest performance, whereas TagRelevance obtains theworst performance. Perhaps surprisingly, the results of

FIG. 3. Effect of the variation of K on MIR Flickr. The horizontal axis gives the different values of K. [Color figure can be viewed in the online issue, whichis available at wileyonlinelibrary.com.]

TABLE 2. Performance comparisons between our approach and previouspublished work on Corel 5K.

Methods

Evaluation criteria

Precision % Recall % NonWord

CRM (Lavrenko et al., 2004) 16 19 107InfNet (Metzler &

Manmatha, 2004)17 24 112

NPDE (Yavlinsky, Schofield,& Rger, 2005)

18 21 114

SML (Carneiro, Chan,Moreno, & Vasconcelos,2007)

23 29 137

TGLM (Liu et al., 2009) 25 29 131MBRM (Feng et al., 2004) 24 25 122JEC (Makadia et al., 2008) 27 32 139LASSO (Makadia et al.,

2008)24 29 127

TagProp (Guillaumin et al.,2009)

33 42 160

GS (Zhang et al., 2012) 30 33 146Improved-NN 33 38 158

TABLE 3. Paired t tests results on Corel 5K.

CRM MBRM JEC LASSO GS

Precision <.001 <.001 <.001 <.001 .0316Recall <.001 <.001 .0107 <.001 .0224


TagRelevance are even worse than those of the baselineSVM classifier in all measures. A possible reason is thatTagRelevance adopts Euclidean distance to look for thenearest neighbors, which may lead to an inaccurate neighborsearch result. The other three approaches are all superior tothe SVM classifier. Compared with kNN-SSGSL, the pro-posed Improved-NN gains improvements of 2.9%, 4.3%,and 1.9% in terms of BEP, MAP, and P@10, respectively,whereas in contrast with TagProp, Improved-NN gets com-parable performance in P@10, but is much better in BEP andMAP. Table 5 shows the results of paired t tests betweenImproved-NN and the other methods on MIR Flickr.According to the p values, we can see that althoughImproved-NN does not have significant difference withTagProp in terms of P@10, it outperforms its counterpartson all the other measures with statistical significance of0.05. In summary, among these comparative methods,Improved-NN can provide more satisfactory annotations forusers, especially when dealing with real-world web imagerepositories.

Benefits From Individual Approach Components

Different from previous methods, our Improved-NNsimultaneously improves both phases of the nearest-neighbor–based scheme through the proposed two approachcomponents, that is, RNSM and LKPS. In this subsection,we intend to investigate the efficacy of each individual com-ponent for image annotation. To this end, we design a groupof modified versions of the original Improved-NN, each ofwhich is constructed by replacing either RNSM or LKPSwith alternative techniques. For clarity, we list the modifiedmethods as follows:

• Distance + LKPS (DL): Instead of using RNSM, it deter-mines the nearest neighbors by the average distance betweentwo images over different visual feature spaces.

• RankSVM + LKPS (RL): The approach substitutes RNSMwith the RankSVM algorithm (Herbrich, Graepel, &Obermayer, 1999) for producing the ordering of all labeledimages.

• EqualWeight + LKPS (EL): This method adopts a similarranking algorithm to our RNSM, but weights all preferencepairs equally.

• RNSM + Majority (RM): This method replaces LKPS with aheuristic rule, which propagates keywords according to themajority voting of the nearest neighbors.

• RNSM + Weighted (RW): It goes a step further than theprevious method by assigning different voting weights to theneighbors. The weight of a neighbor is set according to itsranking position in RNSM.

A series of experiments was conducted by applying theearlier modified methods to the image annotation task.Figure 4 gives the performance comparisons between theoriginal Improved-NN and the modified methods on theCorel 5K data set.

As shown in Figure 4, compared with Improved-NN, theperformance of all modified methods shows some decline inalmost every evaluation measure. However, in contrast withthe results listed in Table 2, although implemented with onlyRNSM or LKPS involved, these modified methods still out-perform most of the existing approaches. Therefore, it canbe concluded that each individual component of ourImproved-NN can offer significant help in improving theperformance of image annotation.

In Figure 4, the first three modified methods replaceRNSM with other techniques to look for the nearest neigh-bors, among which RL and EL achieve better performancethan DL. This indicates that during the process of nearestneighbor search, the ranking-oriented approaches (e.g., RLand EL) may find more reliable neighbors than the conven-tional distance-oriented approaches like DL. We can also seethat EL gains a slight improvement of NonWord againstImproved-NN, but loses a lot in terms of precision andrecall, respectively. This may be attributed to the fact thatunlike the original implementation of RNSM, EL assignsequal weight to all preference pairs. As a result, it is difficultfor EL to sufficiently ensure the correctness of the top-ranked results, which will be subsequently selected as thenearest neighbors. Although the coverage of the keywordsassociated with these neighbors may become larger, theaccuracy of the annotated keywords is sacrificed to someextent. Such results prove the importance of focusing moreon the ordering of the top-ranked results for the success ofour RNSM.

In the experiments, RM and RW are the two modifiedmethods that select the propagated keywords by the rule ofneighbor voting instead of the proposed LKPS. According totheir performance, we find weighted voting performsslightly better than majority voting in the phase of keywordpropagation. In addition, it can be observed that, compared

TABLE 4. Results of different annotation methods on MIR Flickr.

Methods

Evaluation criteria

BEP% MAP% P@10%

SVM 29.1 26.4 56.3TagRelevance 28.5 26.0 55.0kNN-SGSSL 30.0 27.1 56.6TagProp 31.5 29.4 58.6Improved-NN 32.9 31.4 58.5

TABLE 5. Paired t tests results on MIR Flickr.

SVM TagRelevance kNN-SGSSL TagProp

BEP <.001 <.001 .0076 .0279MAP <.001 <.001 <.001 .0012P@10 .0334 .0140 .0471 .3925


with Improved-NN, the performance of RM and RW whereLKPS is replaced shows a larger decrease than that of thepreceding three methods where RNSM is replaced. There-fore, we may conclude that LKPS makes a greater contribu-tion than RNSM to the effectiveness of our Improved-NN.

Figure 5 shows the results of different approaches on theMIR Flickr data set. As expected, Improved-NN still outper-forms all modified methods. Besides, in contrast with theresults shown in Figure 4, we can see that, on average, therelative performance gap between Improved-NN andthe modified methods increases slightly from 6.7% on Corel5K to 8.4% on MIR Flickr. Meanwhile, many resultsobtained by the modified methods seem inferior to those ofthe previous work shown in Table 4. This suggests that con-fronted with large amount of web images with noisy tags, itis necessary to rely on the collaboration of RNSM and LKPSto achieve a desirable performance for image annotation.

Computational Cost

Lastly, we examine the efficiency of the proposed frame-work for image annotation. All experiments were run on aserver equipped with 2.20 GHz Intel Xeon processor (12cores) and 12 GB RAM. We measured the run time forextracting visual features on both data sets, and Table 6reports the average extraction time for each feature perimage. In total, it needed about 934.7 ms to extract all visualfeatures from an image. The extraction process was carriedout with our Matlab scripts.

In our approach, most of the computational cost resultsfrom the time for training models on both phases of thenearest-neighbor–based scheme. For the RNSM, as shownin Algorithm 1, the time complexity of the training step isO(T[gn + mn]), where T denotes the number of iterations, gthe number of ranking features, n the number of training

FIG. 4. Performance comparisons between Improved-NN and different modified models on Corel 5K. Bars present precision and recall according to theleft axis, as well as NonWord according to the right axis. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

FIG. 5. Performance comparisons between Improved-NN and different modified models on MIR Flickr. [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]


images, and m the number of preference pairs associatedwith each training image. To balance model accuracy andtime complexity, we chose T = 200 and m = 50, respectively.The algorithm was implemented in Java,1 and all well-performing models could be trained within 2 hours on bothdata sets. For the learning-based keyword propagation strat-

egy, during training, Algorithm 2 loops for O1

2ε⎛⎝⎜

⎞⎠⎟ itera-

tions, and in each iteration, Algorithm 3 is called at mostO(n) times, whose time complexity is O(c2), where c is thetotal number of unique keywords. For ease of development,we implemented the algorithms using a Python interface2 toSVMstruct, and it took approximately 2 hours to train a robustmodel on the Corel 5K data set and 4.5 hours on the MIRFlickr data set. The process can still be speeded up byadopting the recently developed more efficient optimizers(Teo, Vishwanthan, Smola, & Le, 2010) or using the stan-dard C implementation3 of SVMstruct.

Once training is completed, given a new image, ourapproach can first find its nearest neighbors in O(n) time andthen produce its predicted labels in O(c) time. According tothe measured elapsed time during testing, we found that ourapproach took an average of 20.9 and 91.7 ms, respectively,to accomplish the label prediction for a new image on theCorel 5K data set and the MIR Flickr data set. This meansthat our trained model can be used interactively by userswithout any perceived delay.

Conclusions

In this article, we sought improved image annotationaccuracy through adaptation of the nearest-neighbor–basedscheme. In the neighbor search phase, we introduced aranking-oriented neighbor search mechanism (RNSM),which uses the LTR framework to directly optimize therelative ordering of labeled images rather than their absolutedistance with respect to a given image. In the keywordpropagation phase, we proposed a learning-based keywordpropagation strategy (LKPS), which derives a scoring

function from training data to evaluate the relevance of key-words according to their multiple relations with the nearestneighbors. Experimental results have shown that the perfor-mance of our approach is superior to that of many state-of-the-art methods. Furthermore, it has also been proved thateach individual component of our approach has a positiveeffect on improving the image annotation performance.

In the future, we plan to experiment with other LTRalgorithms for improving the neighborhood quality in ourRNSM. We also intend to incorporate the kernel method intothe current structural SVM framework in our LKPS. Finally,we will further investigate the scalability of our approach byexperimenting on larger image data sets.

Acknowledgments

This work was supported by the Natural ScienceFoundation of China (60970047, 61103151, 61272240),the Doctoral Fund of Ministry of Education of China(20110131110028), and the Natural Science Foundation ofShandong Province (ZR2012FM037, 2012BSB01550).

References

Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan,M. I. (2003). Matching words and pictures. Journal of Machine LearningResearch, 3, 1107–1135.

Bertelli, L., Yu, T., Vu, D., & Gokturk, B. (2011). Kernelized structural svmlearning for supervised object segmentation. In Proceedings of the 2011IEEE Conference on Computer Vision and Pattern Recognition (pp.2153–2160). Los Alamitos, CA, USA: IEEE Computer Society.

Blaschko, M. B., & Lampert, C. H. (2008). Learning to localize objectswith structured output regression. In Proceedings of the Tenth EuropeanConference on Computer Vision (pp. 2–15). Marseille, France: Springer-Verlag.

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N.,& Hullender, G. (2005). Learning to rank using gradient descent. InProceedings of the Twenty-Second International Conference on MachineLearning (pp. 89–96). New York, NY, USA: ACM.

Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank:From pairwise approach to listwise approach. In Proceedings of theTwenty-Fourth International Conference on Machine Learning (pp. 129–136). New York, NY, USA: ACM.

Carneiro, G., Chan, A., Moreno, P., & Vasconcelos, N. (2007). Supervisedlearning of semantic classes for image annotation and retrieval. IEEETransactions on Pattern Analysis and Machine Intelligence, 29, 394–410.

1http://people.cs.umass.edu/vdang/ranklib.html2http://tfinley.net/software/svmpython2/3http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html

TABLE 6. Average extraction time (in milliseconds) of each feature per image.

Feature Extraction time Feature Extraction time

Dense sift 52.2 RGB histogram 18.0Dense sift (r) 72.0 RGB histogram (r) 19.3Interest point sift 39.0 LAB histogram 3.8Interest point sift (r) 73.1 LAB histogram (r) 3.9Dense hue 97.3 HSV histogram 6.8Dense hue (r) 144.1 HSV histogram (r) 8.3Interest point hue 103.2 Gist 121.8Interest point hue (r) 172.3 Total 934.7

Note. The mark (r) indicates that the feature is computed over three horizontal regions of an image as illustrated in the Data Sets subsection.


http://people.cs.umass.edu/vdang/ranklib.html

http://tfinley.net/software/svmpython2/

http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html

Chang, E., Goh, K., Sychay, G., & Wu, G. (2003). Cbsa: Content-based softannotation for multimodal image retrieval using Bayes point machines.IEEE Transactions on Circuits and Systems for Video Technology, 13,26–38.

Cusano, C., Ciocca, G., & Schettini, R. (2004). Image annotation usingSVM. In Proceedings of the SPIE (Vol. 5304, pp. 330–338). San Jose,CA, USA: SPIE.

Duygulu, P., Barnard, K., De Freitas, J., & Forsyth, D. (2002). Objectrecognition as machine translation: Learning a lexicon for a fixed imagevocabulary. In Proceedings of the Seventh European Conference onComputer Vision (Vol. 4, pp. 97–112). Berlin, Heidelberg: Springer.

Fan, J., Gao, Y., & Luo, H. (2007). Hierarchical classification for automaticimage annotation. In Proceedings of the Thirtieth Annual InternationalACM SIGIR Conference on Research and Development in InformationRetrieval (pp. 111–118). New York, NY, USA: ACM.

Feng, S., Manmatha, R., & Lavrenko, V. (2004). Multiple bernoullirelevance models for image and video annotation. In Proceedings of the2004 IEEE Conference on Computer Vision and Pattern Recognition(Vol. 2, pp. 1002–1009). Los Alamitos, CA, USA: IEEE ComputerSociety.

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficientboosting algorithm for combining preferences. Journal of MachineLearning Research, 4, 933–969.

Grangier, D., & Bengio, S. (2008). A discriminative kernel-based approachto rank images from text queries. IEEE Transactions on Pattern Analysisand Machine Intelligence, 30, 1371–1384.

Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009). Tagprop:Discriminative metric learning in nearest neighbor models for imageauto-annotation. In Proceedings of the International Conference onComputer Vision (pp. 309–316). Los Alamitos, CA, USA: IEEE Com-puter Society.

Herbrich, R., Graepel, T., & Obermayer, K. (1999). Large margin rankboundaries for ordinal regression. Advances in Neural Information Pro-cessing Systems, 88, 115–132.

Huiskes, M. J., & Lew, M. S. (2008). The mir flickr retrieval evaluation. InProceedings of the First ACM International Conference on MultimediaInformation Retrieval (pp. 39–43). New York, NY, USA: ACM.

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation ofIR techniques. ACM Transactions on Information Systems, 20(4), 422–446.

Jeon, J., Lavrenko, V., & Manmatha, R. (2003). Automatic image annota-tion and retrieval using cross-media relevance models. In Proceedings ofthe Twenty-Sixth Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval (pp. 119–126). NewYork, NY, USA: ACM.

Joachims, T. (2002). Optimizing search engines using clickthrough data. InProceedings of the Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (pp. 133–142). New York, NY,USA: ACM.

Joachims, T., Finley, T., & Yu, C.-N. (2009). Cutting-plane training ofstructural svms. Machine Learning, 77, 27–59.

Knuth, D. E. (1998). The art of computer programming: Fundamentalalgorithms (Vol. 1). Redwood City, CA, USA: Addison Wesley LongmanPublishing.

Lavrenko, V., Manmatha, R., & Jeon, J. (2004). A model for learning thesemantics of pictures. In Advances in Neural Information ProcessingSystems. Cambridge, MA, USA: MIT Press.

Li, L., Zhou, K., Xue, G.-R., Zha, H., & Yu, Y. (2011). Video summariza-tion via transferrable structured learning. In Proceedings of the TwentiethInternational Conference on World Wide Web (pp. 287–296). New York,NY, USA: ACM.

Li, P., Burges, C. J. C., & Wu, Q. (2007). Mcrank: Learning to rank usingmultiple classification and gradient boosting. In Proceedings of theTwenty-First Annual Conference on Neural Information ProcessingSystems (pp. 845–852). Red Hook, NY, USA: Curran AssociatesInc.

Li, X., Chen, L., Zhang, L., Lin, F., & Ma, W.-Y. (2006). Image annotationby large-scale content-based image retrieval. In Proceedings of the

Fourteenth Annual ACM International Conference on Multimedia (pp.607–610). New York, NY, USA: ACM.

Li, X., Snoek, C., & Worring, M. (2009). Learning social tag relevance byneighbor voting. IEEE Transactions on Multimedia, 11(7), 1310–1322.

Lin, D. (1998). An information-theoretic definition of similarity. In Pro-ceedings of the Fifteenth International Conference on Machine Learning(Vol. 1, pp. 296–304). San Francisco, CA, USA: Morgan KaufmannPublishers Inc.

Liu, J., Li, M., Liu, Q., Lu, H., & Ma, S. (2009). Image annotation via graphlearning. Pattern Recognition, 42, 218–228.

Liu, J., Li, M., Ma, W.-Y., Liu, Q., & Lu, H. (2006). An adaptive graphmodel for automatic image annotation. In Proceedings of the EighthACM International Workshop on Multimedia Information Retrieval (pp.61–70). New York, NY, USA: ACM.

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundationsand Trends in Information Retrieval, 3, 225–331.

Liu, Y., & Jin, R. (2006). Distance metric learning: A comprehensivesurvey. Technical report, Mahwah, NJ, USA: Michigan State University.

Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline forimage annotation. In Proceedings of the Tenth European Conferenceon Computer Vision (pp. 316–329). Berlin, Heidelberg: Springer-Verlag.

Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction toinformation retrieval. New York, NY, USA: Cambridge UniversityPress.

Mei, T., Wang, Y., Hua, X.-S., Gong, S., & Li, S. (2008). Coherent imageannotation by learning semantic distance. In Proceedings of the 2008IEEE Conference on Computer Vision and Pattern Recognition (pp.1–8). Los Alamitos, CA, USA: IEEE Computer Society.

Metzler, D., & Manmatha, R. (2004). An inference network approach toimage retrieval. In Proceedings of the International Conference on Imageand Video Retrieval (pp. 42–50). Berlin, Heidelberg: Springer.

Monay, F., & Gatica-Perez, D. (2004). Plsa-based image auto-annotation:Constraining the latent space. In Proceedings of the Twelfth AnnualACM International Conference on Multimedia (pp. 348–351). NewYork,NY, USA: ACM.

Niu, S., Guo, J., Lan, Y., & Cheng, X. (2012). Top-k learning to rank:Labeling, ranking and evaluation. In Proceedings of the Thirty-FifthInternational ACM SIGIR Conference on Research and Development inInformation Retrieval (pp. 751–760). New York, NY, USA: ACM.

Nowozin, S., & Lampert, C. H. (2011). Structured learning and predictionin computer vision. Foundations and Trends in Computer Graphics andVision, 6, 185–365.

Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Mei, T., & Zhang, H.-J. (2007).Correlative multi-label video annotation. In Proceedings of the FifteenthInternational Conference on Multimedia (pp. 17–26). New York, NY,USA: ACM.

Rockafellar, R. (1996). Convex Analysis (Vol. 28). Princeton, NJ, USA:Princeton University Press.

Schapire, R. E. (1999). Theoretical views of boosting and applications. InAlgorithmic Learning Theory (Vol. 1720). Lecture Notes in ComputerScience (pp. 13–25). Berlin, Heidelberg: Springer.

Siddiquie, B., Feris, R., & Davis, L. (2011). Image ranking and retrievalbased on multi-attribute queries. In Proceedings of the 2011 IEEE Con-ference on Computer Vision and Pattern Recognition (pp. 801–808). LosAlamitos, CA, USA: IEEE Computer Society.

Sigurbjörnsson, B., & van Zwol, R. (2008). Flickr tag recommendationbased on collective knowledge. In Proceedings of the Seventeenth Inter-national Conference on World Wide Web (pp. 327–336). New York, NY,USA: ACM.

Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison ofstatistical significance tests for information retrieval evaluation. In Pro-ceedings of the Sixteenth ACM Conference on Information and Knowl-edge Management (pp. 623–632). New York, NY, USA: ACM.

Tang, J., Hong, R., Yan, S., Chua, T.-S., Qi, G.-J., & Jain, R. (2011). Imageannotation by knn-sparse graph-based label propagation over noisilytagged web images. ACM Transactions on Intelligent Systems and Tech-nology, 2(2), 14:1–14:15.


Teo, C. H., Vishwanthan, S., Smola, A. J., & Le, Q. V. (2010). Bundlemethods for regularized risk minimization. Journal of Machine LearningResearch, 11, 311–365.

Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: Alarge data set for nonparametric object and scene recognition. IEEETransactions on Pattern Analysis and Machine Intelligence, 30(11),1958–1970.

Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2006). Largemargin methods for structured and interdependent output variables.Journal of Machine Learning Research, 6, 1453–1484.

Wang, C., Jing, F., Zhang, L., & Zhang, H.-J. (2006). Image annotationrefinement using random walk with restarts. In Proceedings of the Four-teenth ACM International Conference on Multimedia (pp. 647–650).New York, NY, USA: ACM.

Wang, C., Jing, F., Zhang, L., & Zhang, H.-J. (2008). Scalable search-basedimage annotation. Multimedia Systems, 14(4), 205–220.

Wang, X.-J., Zhang, L., Jing, F., & Ma, W.-Y. (2006). Annosearch: Imageauto-annotation by search. In Proceedings of the 2006 IEEE Conferenceon Computer Vision and Pattern Recognition (Vol. 2, pp. 1483–1490).Los Alamitos, CA, USA: IEEE Computer Society.

Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T., & Yan, S. (2010).Sparse representation for computer vision and pattern recognition. Pro-ceedings of the IEEE, 98, 1031–1044.

Wu, L., Hoi, S. C., Jin, R., Zhu, J., & Yu, N. (2009). Distance metriclearning from uncertain side information with application to automatedphoto tagging. In Proceedings of the Seventeenth ACM InternationalConference on Multimedia (pp. 135–144). New York, NY, USA:ACM.

Wu, P., Hoi, S. C.-H., Zhao, P., & He, Y. (2011). Mining social images withdistance metric learning for automated image tagging. In Proceedings of

the Fourth ACM International Conference on Web Search and DataMining (pp. 197–206). New York, NY, USA: ACM.

Xia, F., Liu, T.-Y., & Li, H. (2009). Statistical consistency of top-k ranking.In Proceedings of the Twenty-Third Annual Conference on Neural Infor-mation Processing Systems (pp. 2098–2106). Red Hook, NY, USA:Curran Associates Inc.

Xia, F., Liu, T.-Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approachto learning to rank: Theory and algorithm. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (pp. 1192–1199).New York, NY, USA: ACM.

Yakhnenko, O., & Honavar, V. (2008). Annotating images and imageobjects using a hierarchical dirichlet process model. In Proceedings ofthe Ninth International Workshop on Multimedia Data Mining (pp. 1–7).New York, NY, USA: ACM.

Yang, Y., Huang, Z., Shen, H., & Zhou, X. (2011). Mining multi-tagassociation for image tagging. World Wide Web, 14, 133–156.

Yang, Y., Yang, Y., Huang, Z., Shen, H. T., & Nie, F. (2011). Tag localiza-tion with spatial correlations and joint group sparsity. In Proceedings ofthe 2011 IEEE Conference on Computer Vision and Pattern Recognition(pp. 881–888). Los Alamitos, CA, USA: IEEE Computer Society.

Yang, Y., Yang, Y., & Shen, H. T. (2013). Effective transfer tagging fromimage to video. ACM Transactions on Multimedia Computing, Commu-nications, and Applications, 9, 14:1–14:20.

Yavlinsky, A., Schofield, E., & Rger, S. (2005). Automated image annota-tion using global features and robust nonparametric density estimation.In Proceedings of the International Conference on Image and VideoRetrieval (pp. 507–517). Berlin, Heidelberg: Springer.

Zhang, S., Huang, J., Li, H., & Metaxas, D. (2012). Automatic imageannotation and retrieval using group sparsity. IEEE Transactions onSystems, Man, and Cybernetics, 42, 838–849.


Documents

Improving image annotation via ranking‐oriented neighbor ...users.jyu.fi/~swang/publications/JASIST15.pdf · Improving Image Annotation via Ranking-Oriented Neighbor Search and