12
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115 Image Attribute Adaptation Yahong Han, Yi Yang, Zhigang Ma, Haoquan Shen, Nicu Sebe, Senior Member, IEEE, and Xiaofang Zhou, Senior Member, IEEE Abstract—Visual attributes can be considered as a middle-level semantic cue that bridges the gap between low-level image fea- tures and high-level object classes. Thus, attributes have the ad- vantage of transcending specic semantic categories or describing objects across categories. Since attributes are often human-name- able and domain specic, much work constructs attribute annota- tions ad hoc or take them from an application-dependent ontology. To facilitate other applications with attributes, it is necessary to develop methods which can adapt a well-dened set of attributes to novel images. In this paper, we propose a framework for image attribute adaptation. The goal is to automatically adapt the knowl- edge of attributes from a well-dened auxiliary image set to a target image set, thus assisting in predicting appropriate attributes for target images. In the proposed framework, we use a non-linear mapping function corresponding to multiple base kernels to map each training images of both the auxiliary and the target sets to a Reproducing Kernel Hilbert Space (RKHS), where we reduce the mismatch of data distributions between auxiliary and target images. In order to make use of un-labeled images, we incorpo- rate a semi-supervised learning process. We also introduce a robust loss function into our framework to remove the shared irrelevance and noise of training images. Experiments on two couples of aux- iliary-target image sets demonstrate that the proposed framework has better performance of predicting attributes for target testing images, compared to three baselines and two state-of-the-art do- main adaptation methods. Index Terms—Image attributes, domain adaptation, transfer learning, semi-supervised learning, multiple kernel learning, robust multiple kernel regression. I. INTRODUCTION L EARNING-BASED object recognition has become a major focus of computer vision and multimedia retrieval [1], [2], [3]. For example, tagging images has been well ex- Manuscript received May 05, 2013; revised September 11, 2013; accepted January 17, 2014. Date of publication February 12, 2014; date of current ver- sion May 13, 2014. This work was supported in part by NSFC (under Grant 61202166), National Program on the Key Basic Research Project (under Grant 2013CB329301), and Doctoral Fund of Ministry of Education of China (under Grant 20120032120042). The work of Y. Yang was supported in part by UQ Early Career Researcher (ECR) Grants Scheme (under Grant 2013002401). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Alan Hanjalic. Y. Han is with the School of Computer Science and Technology and the Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin Uni- versity, Tianjin, China (e-mail: [email protected]). Y. Yang and X. Zhou are with the School of Information Technology and Electrical Engineering, the University of Queensland, Brisbane, Australia (e-mail: [email protected]; [email protected]). Z. Ma and N. Sebe are with the Department of Information Engineering and Computer Science, University of Trento, Trento, Italy (e-mail: [email protected]; [email protected]). H. Shen is with the College of Computer Science, Zhejiang University, Hangzhou, China (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2014.2306092 plored in recent years, which also facilitates the tasks of image semantic understanding and image retrieval [4], [5], [6]. How- ever, assigning labels or tags to images only provides limited semantic descriptions of objects. Moreover, from a machine learning point of view, tagging images via trained models can only predict labels within the training set and thus lacks the ability to infer semantics not present in the training set for testing images. Today, much work has proposed to progress beyond “labels” to even richer semantic descriptions of images, such as attributes [7], [8]. Visual attributes [9] are observable properties (e.g., “red”, “furry”, and “round”) in images, which provide more informa- tive descriptions of object classes. Visual attributes and object classes are both visual concepts, whereas visual attributes can be taken as a middle-level semantic cue that bridges the gap be- tween low-level image features and high-level object classes. In recent studies, attributes have been shown to be benecial to im- prove the performance of object categorization [7], generating more informative descriptions of images [10], and also facil- itating image retrieval [11]. Furthermore, as a middle-level se- mantic cue, attributes can be used to describe objects both within and across object categories [7], and even in the case of unseen object classes [7], [8]. For example, in [8], unseen object classes (disjoint training and testing classes) are detected by mining class-attribute relations. Because attributes transcend the spe- cic semantic category, they can be used to enable richer tex- tual descriptions for new images [7], [10]. From above discus- sions, it can be seen that automatic learning and recognition of attributes can augment category-level recognition and therefore improve the degree to which machines perceive visual objects [9]. Attributes are often human-nameable and domain specic. In much previous work, the sets of attributes used are either constructed ad hoc or taken from an application-dependent ontology [7], [8], [12]. The dataset “Animals with Attributes (AwA)” [8] uses existing class-attribute association data from the work in cognitive science [13]. Since the images in these benchmarks come from specic categories and vary little in visual and semantic content, existing attribute sets can be used directly to describe them. In order to scale the attributes to a large number of web images, in [12], attributes for specic types of objects are automatically discovered by extracting the highly ranked associated texts of web images. However, because associated texts of web images are often noisy, good performance of attribute discovery is only obtained for images from some specic shopping websites. Thus the current status is: on one hand, there are many sets of attributes which are domain specic (like attributes in AwA), and these attributes have been labeled manually to image sets of moderate size; on the other hand, abundant web images’ at- tributes need to be learned to facilitate other applications with 1520-9210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

Image Attribute AdaptationYahong Han, Yi Yang, Zhigang Ma, Haoquan Shen, Nicu Sebe, Senior Member, IEEE, and

Xiaofang Zhou, Senior Member, IEEE

Abstract—Visual attributes can be considered as a middle-levelsemantic cue that bridges the gap between low-level image fea-tures and high-level object classes. Thus, attributes have the ad-vantage of transcending specific semantic categories or describingobjects across categories. Since attributes are often human-name-able and domain specific, much work constructs attribute annota-tions ad hoc or take them from an application-dependent ontology.To facilitate other applications with attributes, it is necessary todevelop methods which can adapt a well-defined set of attributesto novel images. In this paper, we propose a framework for imageattribute adaptation. The goal is to automatically adapt the knowl-edge of attributes from awell-defined auxiliary image set to a targetimage set, thus assisting in predicting appropriate attributes fortarget images. In the proposed framework, we use a non-linearmapping function corresponding to multiple base kernels to mapeach training images of both the auxiliary and the target sets toa Reproducing Kernel Hilbert Space (RKHS), where we reducethe mismatch of data distributions between auxiliary and targetimages. In order to make use of un-labeled images, we incorpo-rate a semi-supervised learning process.We also introduce a robustloss function into our framework to remove the shared irrelevanceand noise of training images. Experiments on two couples of aux-iliary-target image sets demonstrate that the proposed frameworkhas better performance of predicting attributes for target testingimages, compared to three baselines and two state-of-the-art do-main adaptation methods.

Index Terms—Image attributes, domain adaptation, transferlearning, semi-supervised learning, multiple kernel learning,robust multiple kernel regression.

I. INTRODUCTION

L EARNING-BASED object recognition has become amajor focus of computer vision and multimedia retrieval

[1], [2], [3]. For example, tagging images has been well ex-

Manuscript received May 05, 2013; revised September 11, 2013; acceptedJanuary 17, 2014. Date of publication February 12, 2014; date of current ver-sion May 13, 2014. This work was supported in part by NSFC (under Grant61202166), National Program on the Key Basic Research Project (under Grant2013CB329301), and Doctoral Fund of Ministry of Education of China (underGrant 20120032120042). The work of Y. Yang was supported in part by UQEarly Career Researcher (ECR) Grants Scheme (under Grant 2013002401). Theassociate editor coordinating the review of this manuscript and approving it forpublication was Prof. Alan Hanjalic.Y. Han is with the School of Computer Science and Technology and the

Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin Uni-versity, Tianjin, China (e-mail: [email protected]).Y. Yang and X. Zhou are with the School of Information Technology and

Electrical Engineering, the University of Queensland, Brisbane, Australia(e-mail: [email protected]; [email protected]).Z. Ma and N. Sebe are with the Department of Information Engineering and

Computer Science, University of Trento, Trento, Italy (e-mail: [email protected];[email protected]).H. Shen is with the College of Computer Science, Zhejiang University,

Hangzhou, China (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2014.2306092

plored in recent years, which also facilitates the tasks of imagesemantic understanding and image retrieval [4], [5], [6]. How-ever, assigning labels or tags to images only provides limitedsemantic descriptions of objects. Moreover, from a machinelearning point of view, tagging images via trained models canonly predict labels within the training set and thus lacks theability to infer semantics not present in the training set fortesting images. Today, much work has proposed to progressbeyond “labels” to even richer semantic descriptions of images,such as attributes [7], [8].Visual attributes [9] are observable properties (e.g., “red”,

“furry”, and “round”) in images, which provide more informa-tive descriptions of object classes. Visual attributes and objectclasses are both visual concepts, whereas visual attributes canbe taken as a middle-level semantic cue that bridges the gap be-tween low-level image features and high-level object classes. Inrecent studies, attributes have been shown to be beneficial to im-prove the performance of object categorization [7], generatingmore informative descriptions of images [10], and also facil-itating image retrieval [11]. Furthermore, as a middle-level se-mantic cue, attributes can be used to describe objects bothwithinand across object categories [7], and even in the case of unseenobject classes [7], [8]. For example, in [8], unseen object classes(disjoint training and testing classes) are detected by miningclass-attribute relations. Because attributes transcend the spe-cific semantic category, they can be used to enable richer tex-tual descriptions for new images [7], [10]. From above discus-sions, it can be seen that automatic learning and recognition ofattributes can augment category-level recognition and thereforeimprove the degree to which machines perceive visual objects[9].Attributes are often human-nameable and domain specific.

In much previous work, the sets of attributes used are eitherconstructed ad hoc or taken from an application-dependentontology [7], [8], [12]. The dataset “Animals with Attributes(AwA)” [8] uses existing class-attribute association data fromthe work in cognitive science [13]. Since the images in thesebenchmarks come from specific categories and vary little invisual and semantic content, existing attribute sets can be useddirectly to describe them. In order to scale the attributes to alarge number of web images, in [12], attributes for specifictypes of objects are automatically discovered by extractingthe highly ranked associated texts of web images. However,because associated texts of web images are often noisy, goodperformance of attribute discovery is only obtained for imagesfrom some specific shopping websites.Thus the current status is: on one hand, there are many sets

of attributes which are domain specific (like attributes in AwA),and these attributes have been labeled manually to image setsof moderate size; on the other hand, abundant web images’ at-tributes need to be learned to facilitate other applications with

1520-9210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1116 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

attributes. As attributes own the advantages of transcending spe-cific semantic categories and describing objects across cate-gories, it is possible and necessary to develop methods whichcan adapt a well-defined set of attributes to another set of im-ages with variant visual and semantic content.This paper explores automatic adaptation of available labeled

attributes and the prediction of appropriate attributes for novelimages, which is a typical application of domain adaptation[14], [15], [16], [17], [18]. Taking an image set with well-de-fined attributes as the auxiliary domain (or source domain) andanother image set with attributes to be learned as the target do-main, we want to adapt the knowledge in the auxiliary domainto the target images and help improving the performance of at-tribute prediction of target images. Thus, we call our methodImage Attribute Adaptation (IAA). For example, we can learnauxiliary classifiers from auxiliary image sets, from which thetarget classifier for attribute prediction is adapted [14], [17]. Be-cause attributes are a middle-level semantic cue and transcendspecific semantic categories, one attribute may correspond toa latent space or a shared space of common features extractedfrom both the auxiliary and target images [19]. Attributes canbe adapted based on such kind of space [15], [16]. As auxiliaryimages and target images may follow different distributions, inthis paper, we use the criterion of MaximumMean Discrepancy(MMD) [20] to reduce the mismatch measured by the distancebetween means of samples from different domains in the Repro-ducing Kernel Hilbert Space (RKHS).The flowchart of the proposed framework of image attribute

adaptation is illustrated in Fig. 1. Our goal is to predict attributesfor target testing images, given a well-defined auxiliary imageset labeled with attributes and a target training image set. Thesetting of the target training images is: Given a small numberof attribute-labeled images and a large number of un-labeledimages, we manually label some target images with attributesto form the set of labeled target training images. To describedifferent visual aspects of images, we extract multiple typesof visual features to represent each image sample in auxiliaryand target datasets. For example, color, texture, visual words,and edges etc. can be used as base features. As these hetero-geneous types of features may have diverse distributions andstatistical properties, we employ multiple kernels to map eachtraining sample in auxiliary and target datasets to the corre-sponding Reproducing Kernel Hilbert Space (RKHS) througha non-linear mapping function . In order to make use ofun-labeled images in the target training set, we perform localregression on both un-labeled target training images and la-beled images (both in auxiliary and target datasets) to capturethe local structure [1], [21]. Thus, this local regression inducesa semi-supervised learning process in our framework, which canfurther improve the performance of attribute adaptation. Simul-taneously minimizing the mismatch of distributions between thetwo domains and optimizing the local regression process, welearn a target decision function for attribute adaptation. In par-ticular, we represent target testing images in RKHS using thelearned kernel combination coefficients. The decision values ofpredicting testing images’ attributes are then obtained througha multiple kernel regression.In the experiments, two couples of auxiliary-target image sets

are used to evaluate the proposed framework of attribute adapta-tion: (1) Images with well-defined attributes sampled fromAwA

dataset [8] are used as the auxiliary dataset. For target images,we submitted the 50 animal (class) names in AwA as 50 querykeywords to ImageNet and downloaded images for each class.Then we collected these images to form the target image dataset.(2) aPascal [7] and aYahoo [7] are taken as auxiliary and targetimage set, respectively. Experimental results demonstrate thatthe proposed framework has better performance compared tothe baselines and the state-of-the-art methods.In summary, in this paper we propose a framework of Image

Attribute Adaptation (IAA) by collaboratively taking advantageof both transfer learning and semi-supervised learning to im-prove the performance of attribute prediction for new images.The main contributions of our work are summarized as below:(1) A cross-domain visual concept learning framework IAA

is proposed to automatically adapt knowledge from awell-defined auxiliary image set labeled with attributesto a target image set, the goal of which is to improvethe performance of attribute prediction of novel images.As a typical application of domain adaptation, imageattribute adaptation is important to attribute-relatedresearch.

(2) Compared to the common domain adaptation methods,the novel contributions of our IAA framework are: (a) Asemi-supervised learning process is incorporated to uti-lize the labeled and unlabeled data in auxiliary and targetdomains simultaneously; (b) To remove the shared irrel-evance and noise, we introduce a robust loss functioninto our IAA framework.

The rest of this paper is organized as follows. In Section II, webriefly review the recent related work. The framework of imageattribute adaptation and the solution for the objective functionare introduced in Section III and Section IV, respectively. Theexperimental analysis is given in Section V. Conclusions aredrawn in Section VI.

II. RELATED WORK

A. Learning with Attributes

Attribute-based methods have shown promising potential inobject recognition and have recently been receiving much atten-tion [7], [8], [10], [22], [23]. In [7], the authors propose to firstselect features that can predict attributes within an object classand then use only those features to train the attribute classifier.Since one attribute may correlate multiple object classes, thefeature selection process is very time-consuming. To alleviatethis problem, a multi-task learning approach has been proposedin [19] to learn a shared low-dimensional feature space acrossboth visual attributes and object classes.The broad variety of attributes requires different types of fea-

tures to describe different visual aspects [7]. To overcome theproblem of diversity, some works in object recognition manageto design invariant and robust features [24]. Because differentfeatures are discriminative for different object classes, fusinga set of diverse and complementary features is widely used [4],[17], [25].Multiple kernel learning (MKL)methods have showngreat advantage in this task recently [4], [17].In the previous work of attribute learning, the methods pro-

posed in [7] and [10] can be also used to predict attributes fortesting images. However, because the relative orderings of at-tributes need to be manually labeled, the method in [10] cannot

Page 3: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

HAN et al.: IMAGE ATTRIBUTE ADAPTATION 1117

be applied to the scenario of attribute adaptation proposed in thispaper. Though the method in [7] can be used for “across cate-gory” prediction of attributes, the model for attribute predictionis trained from the auxiliary image set, which is different fromour proposed framework in that, we train the model from thetraining images of both the auxiliary and target images. More-over, because the method in [7] trains attributes in auxiliarydataset and directly tests them in target dataset, it fails to addresshow to reduce the mismatch of data distributions of the auxil-iary and target datasets, which is the key issue of the attributeadaptation. The transfer methods of attribute-based classifica-tion have been proposed in [8], [26]. However the goal of themethods in [8], [26] is to predict classes but attributes for testingimages, which is different from our proposed attribute adapta-tion method. A method of correlated attribute transfer has beenproposed in [27], whose goal is similar as our proposed method.However, because a linear feature selection matrix was directlylearned from the source images and the distribution mismatchof source and target images is not optimized, the performanceof attribute prediction for target test images is constrained.

B. Domain Adaptation Methods

Domain adaptation aims to adapt a classifier or regressionmodel trained in a source domain to be used in a target domain,where the source and target domains may be different but re-lated [15]. Two categories of techniques can be employed toattain the goal of domain adaptation: classifier adaptation [3],[14], [17] and feature representation adaptation [16], [28]. In[3], multiple feature selection functions of different tasks are si-multaneously learned in a joint framework, which enables theproposed algorithm to utilize the common knowledge of mul-tiple tasks. A-SVMs [14] has been proposed to adapt an auxil-iary classifier to the target domain by learning a “delta function”between the decision functions of the auxiliary and those of thenew classifier using an objective function extended from stan-dard SVMs. Motivated by A-SVMs, a new domain adaptationmethod A-MKL has been proposed in [17] to learn a a robustadapted target classifier adapted from a set of prelearned clas-sifiers (auxiliary classifiers). Recently, knowledge adaptation in[18] is proposed to optimize the concept classifier and the eventdetector jointly.As discussed in Section II-A, the broad variety of attributes

requires different types of features to describe different vi-sual aspects. This variety may be enlarged when adaptingbetween source and target domains. Therefore, it is better toadopt feature representation adaptation for image attributeadaptation. A key issue in feature representation adaptationis how to reduce the difference between the distributions ofsource and target domain data. Recently, Borgwardt et al. [20]have proposed a straightforward criterion, namely MaximumMean Discrepancy (MMD), for comparing distributions basedon Reproducing Kernel Hilbert Space (RKHS). As we fusedifferent types of features using multiple kernels in RKHS forimages from source and target domains, MMD can be utilizedfor image attribute adaptation. From above discussions, theproposed framework of Image Attribute Adaptation (IAA) isa typical application of domain adaptation or cross-domainvisual concept learning.

III. PROPOSED FRAMEWORK OF IMAGEATTRIBUTE ADAPTATION

In this section, we present the details of feature representa-tion adaptation in RKHS. Then we explore in this section howto effectively capture local structures through a local regressionprocess. To learn a decision function for attribute prediction oftarget testing images, feature representation adaptation and localregression are combined with a robust multiple kernel regres-sion. Finally, we arrive at the whole framework of image at-tribute adaptation.

A. Terms and Notations

In this work, we extract multiple heterogeneous visual fea-tures from each image. For example, we can extract color, tex-ture, visual words, and edges etc. as base features and repre-sent each image by combining them together. Suppose we aregiven an auxiliary image set with well-de-fined attributes, where indicates whether the image

is labeled with a certain attribute. Letdenote the training image set of the target domain, where

and are subsets of labeled andunlabeled training images, respectively, and the size of is

. Given a testing image in the target do-main, our goal is to learn from an attribute decisionfunction . Supposing the target image set is under themarginal data distribution , and the auxiliary image set isunder the marginal data distribution , the key issue of attributeadaptation from auxiliary images to target images is how to re-duce the mismatch between and .In the following, the transpose of vector/matrix is denoted

by the superscript and the trace of a matrix is representedas . We define as an -by- identity matrix. andare -by-1 vectors of all zeros and ones, respectively. The

inequality means that for. The element-wise product between vectors and

is represented as . We letdenote the -norm ( ), which is defined as

.

B. Feature Representation Adaptation

As feature representations of images from auxiliary and targetdomains are under different marginal distributions, for image at-tribute adaptation, a straightforward method is to minimize themismatch of different distributions. In this way, we can learn anoptimal common feature space for domain adaptation. To attainsuch a goal, we can use a nonparametric criterion called Max-imum Mean Discrepancy (MMD) [20] to compare data distri-butions based on the distance between means of samples fromtwo domains in a Reproducing Kernel Hilbert Space (RKHS), which has been shown to be effective in domain adaptation

[17], [15]. The criterion of MMD is:

(1)where and are images from the auxiliary and target do-mains, respectively, and denotes the -norm in . Akernel function is induced from the nonlinear feature map-ping function : .

Page 4: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1118 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

Fig. 1. The flowchart of image attribute adaptation. We extract multiple types of visual features to represent each image sample in auxiliary and target datasets.Multiple kernels are employed to map each training sample in auxiliary and target datasets to the corresponding Reproducing Kernel Hilbert Space (RKHS)(through nonlinear mapping function ). The local structure among auxiliary and target training images (labeled and un-labeled) is captured by a local regressionprocess. Representing the target testing images in RKHS using the learned kernel combination coefficients, attributes are adapted from auxiliary and target trainingimages to target testing images.

To simplify the MMD criterion in Eq. (1), we define a columnvector with entries:

then Eq. (1) can be transformed to [20]

(2)

where , and ,

where and are the kernelmatrices defined for the auxiliary and target domain, respec-tively, and is the kernel matrix defined for thecross-domain from the auxiliary images to the target images.As discussed above, in this work, the broad variety of at-

tributes requires different types of features to describe differentvisual aspects. To effectively fuse different types of features forattribute adaptation, we employ multiple base kernels [4], [17]and assume the kernel in Eq. (1) is a linear combination of aset of base kernels ’s, namely,

where and , and ( ) arebase kernels derived from both auxiliary and target images.

Thus, the multiple kernel version of MMD criterion is:

(3)

where andis the vector of kernel combination coeffi-

cients. Through and the induced multiple base kernels, thesamples and are projected into a higher dimensionalor even infinite space, where we can capture the higher orderstatistics of images from the two domains. When we minimize

to be close to zero, the data distributions ofthe two domains are close to each other [20] and therefore thefeature representations from the two domains become matched.

C. Capturing Local Structures by Local Regression

In this section, we perform local regression to capture thelocal structures among both un-labeled target training imagesand labeled images (both in auxiliary and target datasets). Thebasic intuition is that: if two images are close to each other inthe feature space, they may contain common visual aspects andbe assigned similar attributes.

Page 5: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

HAN et al.: IMAGE ATTRIBUTE ADAPTATION 1119

Suppose there are attributes to be adapted,we define a predicted label indicator matrix as

for all the images in the auxiliary set and target trainingset (both labeled and un-labeled). Note that( , ) is the predicted labelvector of . Inspired by [1], we constructa local clique for each image

, which comprises data, including and itsnearest neighbors. We define a linear regression model

to predict of each image ,where is the local projection matrix and isthe bias term. We therefore propose to optimize the followingregularized regression to capture the local structures:

(4)

where denotes the Frobenius norm, is a regulariza-tion parameter, and the regularization term is imposedto avoid overfitting. Let bethe data matrix of images in , and be the corre-sponding predicted label indicator matrix, the objective functionin Eq. (4) can be rewritten as:

(5)

Note that, optimizing the objective function in Eq. (5) onlyinduces smoothness on the local structures of auxiliaryand target training images. To better utilize the un-labeledtraining images in target domain, is supposed to be con-sistent with the ground truth labels of the training data. Let

be the matrix of label ground truth, we propose to optimize thefollowing objective function:

(6)

where is a trade-off parameter that regulates the balance be-tween the two terms, and is a diagonal matrix whose diagonalelement if is labeled and otherwise, and

is the label matrix for images in . Eq. (6) canbe rewritten as:

(7)

It is clear that the optimization of objective function in Eq. (7)simultaneously captures the local structures and induces a semi-supervised learning process, which can better utilize the un-la-beled target images for image attribute adaptation. In the fol-lowing section, we integrate the objective functions in Eq. (7)and (3) into a regularized multiple kernel regression and thenarrive at the whole framework of image attribute adaptation.

D. Robust Multiple Kernel Regression for Attribute Adaptation

To predict attributes for target testing images, in this sec-tion, we show how to learn the decision function for testing im-ages. As multiple base kernels are used in MMD criterion inEq. (3), we employ a multiple kernel regression to learn the de-cision function for target testing images. It has been shown that-norm and its derived sparse loss functions are more robust to

outliers and noise than -norm [18]. Recent studies have shownthat sparsity-inducing norms can be also used to select multiplebase kernels [29]. In this way, we can remove the shared irrele-vance and noise in non-linear spaces by replacing the -normloss function in multiple kernel regression with a sparse lossfunction. Specifically, we propose to minimize an -normloss function to reduce the negative impact of the irrelevanceor noise. Parameter controls the degree of reducing negativeeffects of noisy data. The lower is, the less negative effectsare posed on the whole training process. Since -norm ownsthe robust nature that the regression errors introduced by noisydata are not squared, we therefore form a robust multiple kernelregression as follows:

(8)

where is a trade-off parameter that regulates the balance be-tween the two terms, is the regression co-efficient matrix, and is the bias vector. Note that, weuse bold face of to distinguish it from in Eq. (7). The term

is the regularization term of multiple kernelridge regression. Given the estimated , and , the decisionfunction for a testing image in the target domain is:

(9)

where . To this end, we integrateand Eq. (7) into the robust multiple kernel

regression in Eq. (8) and then arrive at the whole framework ofimage attribute adaptation as follows:

(10)

where is a trade-off parameter. In the following section, weintroduce how to solve the optimization problem in Eq. (10).Since there are multiple variables to be estimated, we proposean efficient alternating algorithm for image attribute adaptation.

Page 6: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1120 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

IV. SOLUTION AND ALGORITHM

A. Optimization and Algorithm

Though the objective function in Eq. (10) is not jointlyconvex w.r.t. , , , , , and , it is convex w.r.t. onevariable when we fix the other variables. Thus, we propose analternating algorithm to solve the optimization problem.Firstly, and only have effect on the third part of local re-

gression in the objective function of Eq. (10). Therefore, we firstfix , , , , and set thederivativeof theobjective functionw.r.tand to zero, respectively. Then we have:

and ,whereis the local centering matrix. Then Eq. (5) can be

written as [1], where and. is a se-

lectionmatrix, in which if is the th element inand otherwise. Note that and are selected from

and .TheobjectivefunctionofEq. (10)becomes:

(11)

Fixing and after some linear algebraic transformation, the opti-mizationproblem inEq. (11) is equivalent to:

(12)

where , and are known matrices with theform of:

(13)

where and is a matrix with all ele-mentsof0.Matrix isadiagonalmatrixcalculatedas follows: let

, then

(14)

Setting thederivativeof theobjective function inEq. (12)w.r.t.to zero, we obtain , withwhichwe can get , , and. Note that is not guaranteed to be invertible. In practice, wecanreplace with toavoidnumerical instability.Then we fix , , and , the objective function becomes:

After some linear algebraic transformation, the objective func-tion is equivalent to:

(15)

where , ,,

......

in which ( ). Let and, optimization problem in Eq. (15) is equivalent

to solving the following Quadratic Programming (QP) problem:

(16)

which can be efficiently solved using the QP optimizationtoolbox [30]. We summarize the detailed solution of imageattribute adaptation in Algorithm 1.

Algorithm 1 Algorithm of Image Attribute Adaptation (IAA)

Input: auxiliary image set and targetimage set , is attribute ground truth.

Output: , , and

1: Initialize ;

2: Initialize , , and randomly;

3: Compute using base kernels

;

4: repeat

5: Compute using Eq. (14);

6: Compute and using Eq. (13);

7: Compute and obtain with;

8: Solve the QP problem ;

9: until Convergence

10: Output , , and .

B. Convergence and Computational Complexity

The optimal obtained from solving Eq. (16) and the fixed, , make the value of objective function in Eq. (11) de-

crease. To this end, when we alternately fix the value ofand , we can find an optimal solution to make the

value of objective function in Eq. (11) decrease. Thus, the alter-nating algorithm of Image Attribute Adaptation (IAA) is guar-anteed to converge.

Page 7: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

HAN et al.: IMAGE ATTRIBUTE ADAPTATION 1121

Fig. 2. Example images from AwA (top row) and AIN (bottom row). Imagesin each column are from the same category, i.e., antelope, bear, dolphin, fox,and zebra. Attribute annotations are in italic face.

Then we discuss the computational complexity of Algo-rithm 1. We first analyze the time complexity of the trainingprocess in Algorithm 1. Here, we suppose multiple kernelscan be precomputed and loaded into memory before the IAAtraining. Then the computational cost for the calculation of thekernel in step 3 of Algorithm 1 takes time, whereis the number of training images in auxiliary and target domainsand is the number of base kernels. Note that the Laplacianmatrix can be precomputed before the iteration steps (fromstep 4 to 9) in Algorithm 1, the computational cost for thecalculation of matrix , , and can be accelerated. Themost time consuming operations in each iteration round are thecalculation of and solving the QP problem in step8. Since the time complexity of step 7 and step 8 is , thecomputational cost of Aglorithm 1 is approximately,where is the number of required iterations for convergence.In the experiments, we observe that the algorithm convergeswithin 20 iterations. In real world applications of image at-tribute adaptation, the size of training images in auxiliary andtarget domains is not very big, compared to the large numberof testing images in target domain. Therefore our algorithm isefficient for obtaining the attributes of large scale images. Itis worth noting that, though the dimensionality of features hasimpact on the computational cost, high feature dimension onlyreduces the calculation speed of the kernel matrix and theLaplacian matrix , which can be precomputed. Thus, if thefeature dimension of images increases, our algorithm is stilleffective in computation. From these discussions, as , thecomputational cost is of our algorithm approximately,which is similar to multiple kernel regression.Secondly, we analyze the time complexity of the testing

process. The core operation in the testing phase is the compu-tation of decision function in Eq. (9), which takes timeto compute the decision value for a testing image. As is thenumber of base kernels and it is a moderate constant in practice(e.g., in the experiments of this paper), the computa-tional complexity of tagging one image is approximately linearw.r.t. the size of training set.

V. EXPERIMENTS

A. Datasets of Auxiliary and Target Images

Two couples of auxiliary-target image sets are used to eval-uate the proposed framework of attribute adaptation.1) “Animals with Attributes” and “Animals from ImageNet”:

A well-devised “Animals with Attributes” (AwA) dataset [8] isused as auxiliary image set, which contains 30475 animal im-

Fig. 3. Example images from a-Pascal (top row) and a-Yahoo (bottom row).Images in a-Pascal and a-Yahoo are from disjoint categories. Attribute annota-tions are in italic face.

ages that are categorized into 50 classes. Lampert et al. alignedthe 50 animals classes with the classical class-attribute matrix[13], thus providing 85 numeric attribute values for each class,which can be taken as the ground truth of class-attribute cor-relations. So AwA can be considered as a well-devised auxil-iary image set. For target images, we submitted the 50 animal(class) names in AwA as 50 query keywords to ImageNet1 anddownloaded images for each class. As a result, we collectedaltogether 48199 images to form the target image dataset andthese images are categorized into the same animal classes as inAwA. Thus, the collected images are appropriate to be consid-ered as the target images and used to evaluate the performanceof the proposed algorithm of image attribute adaptation. In thefollowing, we denote this dataset as “Animals from ImageNet”(AIN). In Fig. 2, we show some example images from AwA andAIN datasets.2) “a-Pascal” and “a-Yahoo”: In this experiment, we take

a-Pascal [7] and a-Yahoo [7] as auxiliary and target imageset, respectively. Images in a-Pascal are selected from PascalVOC 2008 dataset,2 which contains 20 object class names.The number of objects from each category ranges from 150to 1000. For target images in a-Yahoo, additional 12 objectclasses are collected from Yahoo image search and images arelabeled with the list of 64 attributes. This allows us to evaluatethe generalization abilities of attribute adaptation, as attributesown the traits of transcending specific semantic categories ordescribing objects across categories. In Fig. 3, we show someexample images from a-Pascal and a-Yahoo datasets.3) Base Features and Training/Testing Settings: For images

in AwA and AIN, we use a feature extraction toolbox FELib[31] to extract six types of global features: grid color moment,color histogram, edge histogram, local binary pattern, Gaborwavelets texture, and GIST [32]. These six histograms arecombined together, resulting in an 809 dimensional feature,which we refer to as the base feature. For images in a-Pascaland a-Yahoo [7], local texture, HOG, edge, and color descriptorinside the bounding box are binned into individual histograms.To represent shapes and locations, histograms for each featuretype for each cell in a grid of three vertical and two horizontalblocks are generated. Thus, these histograms are combinedtogether to form a 9751 dimensional base feature. These het-erogeneous base features can better describe different visualaspects of images.

1http://www.image-net.org/2http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/

Page 8: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1122 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

For auxiliary images of AwA and a-Pascal, we randomly sam-pled 2000 images, repectively. For target images in AIN, werandomly sample 5000 images to form the target image set, inwhich we sample 1000 images as target training images and takethe remaining images as target testing images. For target imagesin a-Yahoo, we also sample 1000 images as target training im-ages and take the remaining images as testing images. For allthese datasets, the sampling processes are repeated five times togenerate five random auxiliary and target training/testing par-titions, and then the average performance of five-round repeti-tions is reported. To investigate the impact of local structuresamong both labeled and unlabeled target training images, weset different ratios of labeled images in target training set as

, which can also demonstrate the ef-fect of semi-supervised learning in our proposed image attributeadaptation framework.

B. Experimental Configuration

1) Compared Methods and Parameter Settings: We comparethe proposed algorithm of image attribute adaptation (IAA)to three baselines and two state-of-the-art domain adaptationmethods. Firstly, we train a linear SVM on the labeled trainingimage set of target domain and use the learned model to predictattributes for target testing images. To investigate the effectof local regression in the semi-supervised learning process,we compare our algorithm to the semi-supervised subspacelearning algorithm long-term RF (LTRF) proposed in [1]. It isworth noting that, the algorithm of LTRF employs a similarprocess of local regression as in Eq. (7) to explore the distribu-tions of labeled and unlabeled training images, which can beused to better compare the performance to that of our proposedalgorithm. The third baseline is the algorithm of TaylorBoost[33], which is a non-linear boosting algorithm extended fromAdaBoost. Two representative domain adaptation methods,namely A-MKL (Adaptive Multiple Kernel Learning) [17] andSAR (Structural Adaptive Regression) [18], are also comparedin this experiment. Note that, when the size of the auxiliaryimage set is , Eq. (10) is transformed into anoptimization problem of Semi-Supervised Multiple Kernel Re-gression (SSMKR), which is a special case of IAA algorithm.In order to investigate the impact of auxiliary images, we willcompare IAA to SSMKR.Six parameters in our proposed framework need to be set and

tuned.We set the size of local clique to be . Parame-ters in Eq. (7), and , , in Eq. (10) are trade-off parametersto balance the effect of different terms. In this experiments, weset the value of and to be 100. The values of and aretuned in the range of and the best per-formance from the optimal parameter settings is reported. Pa-rameter in the -norm is tuned in the range of .For the comparison algorithms, trade-off parameters in LTRF[1], in A-MKL [17], and in SAR [18], and regularizationparameter used in SVM and A-MKL [17] are also tuned inthe range of . In Fig. 4, we show theimpact of parameters and when they are set to take differentvalues. From the results we can see that the performance of at-tribute adaptation by our method is not so considerably sensitiveto parameters and , especially for .

Fig. 4. Impact of parameters and when they are set to take different valuesfrom . The best performance is indicated by thetext arrow.

2) Base Kernels: We test the performance with four types ofbase kernels: Gaussian Kernel (i.e.,

), Polynomial Kernel (i.e., ),PolyPlus Kernel (i.e., ), and LinearKernel (i.e., ). For Gaussian Kernel, we setthe bandwidth parameter . For Polynomial andPolyPlus Kernel, we set the parameter . Thus in totalwe obtain 8 base kernels. These kernels are used in our proposedalgorithm of image attribute adaptation and A-MKL.3) Evaluation Metric: We evaluate the performance of at-

tribute prediction for target testing images in terms of F1-Score(F-measure). Since there are multiple attributes associated witheach image, to measure the global performance across multipleattributes, we use the microaveraging methods following [34].Therefore, the evaluation criterion we use is . Morespecifically, we present the “micro-” definition as follows. Let

denote the indicator matrix of ground truthfor testing data, and denote the correspondingestimated indicator matrix. Function computes theF1-score between vectors and . Let function denotethe operator that converts matrix to a vector by concatenatingeach column sequentially, then the “micro-” criterion is

C. Experimental Results

1) Results of Image Attribute Prediction: In Table I, wereport the comparison results of image attribute predictionwhen the size of auxiliary image set , where “ ”denotes the direction of adaptation from auxiliary to targetimages. For all these methods, we compare the results whenthe ratio of labeled training images in target domain is 5%,10%, and 50%, respectively. From the results we observe:(1) The proposed algorithm of IAA outperforms the threebaseline algorithms, which are supervised classifier SVM,semi-supervised learning method LTRF [1], and a non-linearboosting algorithm TaylorBoost [33]. Since these baselinesdo not involve an adaptation process and cannot borrow theknowledge from the auxiliary images, e.g., images’ attributeinformation in AwA and a-Pascal, the performance of attributeprediction is lower than that of our method. (2) As IAA suc-cessfully makes use of the local structures among auxiliaryand target training images (both labeled and unlabeled, and seeEq. (7)), we have better performance than A-MKL, which is

Page 9: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

HAN et al.: IMAGE ATTRIBUTE ADAPTATION 1123

TABLE ICOMPARISON RESULTS OF IMAGE ATTRIBUTE PREDICTION ON AND a-Pascal a-Yahoo. THE SIZE OF AUXILIARY IMAGE SET IS 2000.

FOR ALL THESE METHODS, THE RATIO OF LABELED TRAINING IMAGES IN TARGET DOMAIN IS 5%, 10%, AND 50%, RESPECTIVELY

Fig. 5. Performance comparison of image attribute recognition between algo-rithm IAA and SSMKR. The ratio of labeled images in target training set is setto take different values from .

also a domain adaptation method with multiple kernel learning.(3) Owing to the strengths of domain adaptation and the ro-bust nature of sparsity, SAR [18] gets good performance ofattribute prediction. However, because our method involvesa semi-supervised learning process to utilize the unlabeledtraining images in target domain, IAA outperforms SAR fordifferent ratios of labeled training images in target domain.As introduced in Section V-A, AwA and a-Pascal are both

taken as thewell-defined auxiliary image sets in this experiment.Furthermore, as shown in Fig. 2 and 3, images inand a-Pascal a-Yahoo come from different settings of se-mantic categories. In , images of both auxiliaryand target domains are from the same set of semantic categorieswhereas in a-Pascal a-Yahoo, images of auxiliary and targetdomains are from disjoint sets of semantic categories. The dif-ferent settings allow us to evaluate the generalization abilities ofattribute adaptation, as attributes own the traits of transcendingspecific semantic categories or describing objects both withinand across categories. Experimental results in Table I show that,the proposed algorithm IAA can better adapt attributes bothwithin categories and across categories. Moreover, this resultalso demonstrates attributes’ advantage of transcending specificsemantic categories or describing objects across categories.

D. 2) Impact of Auxiliary Images

In this section, we investigate the impact of auxiliary imageset in the adaptation framework. We will compare IAA toSSMKR, which is a special case of IAA, when the size of theauxiliary image set is (see Section V-B1). Theresults are displayed in Fig. 5. From the results we observe thefollowing: Compared to the image attribute recognition withoutadaptation from auxiliary image set, IAA has better perfor-mance. For example, when the ratio of labeled training imagesin target domain is 50%, the performance improvement is about

Fig. 6. Performance comparison when noisy attribute annotations are addedto auxiliary images. The numbers of noisy attributes we added to AwA anda-Pascal are from and , respectively.

3% and 2% for and a-Pascal a-Yahoo, respec-tively. And the performance improvement is about 5% and 8%for and a-Pascal a-Yahoo, respectively, whenthe ratio of labeled training images in target domain is 5%.These results clearly demonstrate the contribution of auxiliaryimage set to the overall performance of IAA.To further show the impact of auxiliary images in our IAA

adaptation framework, we add some noise to the auxiliary imageset deliberately and investigate its effect. Particularly, to simu-late the noisy images, we deliberately add some noise to theattribute ground truth of auxiliary images. As there are re-spectively 85 and 64 attributes annotated inand a-Pascal a-Yahoo, the numbers of noisy attributes weadded to and a-Pascal a-Yahoo are from

and , respectively. The results are dis-played in Fig. 6. From the results we observe the following:(1) Because we deliberately add noise to the attribute groundtruth of auxiliary images, which means we deliberately makesome wrong annotations of attribute in auxiliary images, theperformance decreases compared to the performance of IAAwithout noise. (2) When the ratio of noisy attribute annotationsis increased, the performance of image attribute adaptation isdecreased. The reason is that more noise knowledge is adaptedfrom auxiliary images to target images and thus pull down theperformance of attribute prediction.

E. Investigation of Domain Adaptation

In this section, we investigate the performance of domainadaptation of A-MKL, SAR, and IAA, when the num-bers of auxiliary images are set to take different values of

. The results ofand a-Pascal a-Yahoo are reported in Fig. 7 and Fig. 8,respectively. In each setting of , we also report the results ofimage attribute prediction when the ratios of labeled training

Page 10: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1124 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

Fig. 7. Performance of domain adaptation comparison on . The numbers of auxiliary images are set to take different values of.

Fig. 8. Performance of domain adaptation comparison on a-Pascal a-Yahoo. The numbers of auxiliary images are set to take different values of.

images are set to take different values. From the results we ob-serve: (1) For all the values of , our IAA outperformsA-MKLand SAR on and a-Pascal a-Yahoo. AsA-MKL and SAR are two state-of-the-art domain adaptationmethods, this result shows that our proposed framework IAAis more effective in the application of image attribute adapta-tion. (2) Comparing the results within each single algorithm(A-MKL, SAR, or IAA) w.r.t. different values of we can seethat, for all the three domain adaptation methods, we do not ob-tain better performance of image attribute adaptation with moreauxiliary images. We can draw the conclusion that too manyauxiliary images may even induce some negative impacts onthe performance of attribute adaptation. Thus, in image attributeadaptation, the number of auxiliary images should be well de-termined, which is still an open problem.1) Impact of Unlabeled Training Images: The impact

of unlabeled training images in target domain is twofold:Firstly, we involve a local regression to capture the localstructures among both labeled and unlabeled training im-ages; Secondly, we constrain the decision values of targettesting images to be consistent with the ground truth labels ofthe training data, which induces a semi-supervised learningprocess (see Eq. (7)). In this section, we set the ratio of labeledtraining images in the target domain to take different valuesfrom and compare IAA to othermethods. The results are displayed in Fig. 9. From the resultswe observe the following: (1) As the number of labeled trainingsamples increases, the performance obviously increases.(2) Compared to the supervised baselines and the two domainadaptation comparison methods, our method has noticeableimprovement, thanks to the successful utilization of unlabeledtraining images. (3) Our IAA outperforms the semi-supervisedLTRF significantly on all the ratios of labeled training images

Fig. 9. Performance comparison when the ratio of labeled images in targettraining set is set to take different values from .

for and a-Pascal a-Yahoo, as IAA borrowsknowledge of images’ attribute information and uses multiplekernels to fuse heterogeneous features. (4) Though our IAAalgorithm outperforms SAR a little when the ratio of labeledtraining images is , IAA beats SAR with clear advantagewhen the ratios are set to be , and , respec-tively, which demonstrates the positive impacts of unlabeledtraining images in attribute adaptation by IAA.2) Investigation of the -Norm Loss Function: In IAA (see

Eq. (10)), we replace the Frobenius norm loss function in mul-tiple kernel regression with a -norm sparse loss function. Thegoal is to remove the shared irrelevance and noise. Experimentalresults reported in Table I and Fig. 9 have shown the efficacy of-norm loss function. In this section, we first investigate the in-fluence of parameter in the -norm loss function, and then wecompare the performance of the -norm and that of Frobeniusnorm loss functions in our framework.In Fig. 10, we show the performance comparison of our IAA

algorithm when parameter is set to different values of. From the results we observe: (1) There is only one

Page 11: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

HAN et al.: IMAGE ATTRIBUTE ADAPTATION 1125

Fig. 10. Performance comparison of our proposed IAA algorithmwhen param-eter in the -norm is set to take different values of . Thesize of auxiliary image set is 2000.

Fig. 11. Performance comparison of the -norm and Frobenius norm loss func-tions in IAA. The value of is tuned in the range of , and the bestperformance is given. The size of auxiliary image set is 2000.

case that IAA obtains the best performance when (see theresult of a-Pascal a-Yahoo when the ratio is set to be 10%).The results in Fig. 10 indicate that we may miss the better per-formance if we use the -norm loss function. As can be seen,in most cases, the best performance of IAA is achieved when

or . (2) As mentioned in Section III-D, theo-retically, when there is more noise in the positive samples, thelower reduces more influence from the noise and thereforebetter performance is achieved. On the other hand, if there isless or no noise in the positive samples and data distribution isuniform, larger will make the regression to account for everydata points and help to find a more faithful classifier. The re-sults in Fig. 10 show that the best performance is not alwaysobtained when , which indicates that the noise in pos-itive samples and data distribution in the image set may not beuniform. Because parameter has impact on the performance ofIAA and the choice of it is data-dependant, we need to tune thevalue of appropriately. In Fig. 11, we report the performancecomparison of the -norm and Frobenius norm loss functions inour IAA framework. From the results we observe that, tuningand determining the value of appropriately, we obtain betterperformance of IAA with the -norm loss function, which fur-ther demonstrates the effectiveness of the -norm loss functionin our framework.

VI. CONCLUSION

This paper has proposed a framework of image attribute adap-tation (IAA), the goal of which is to learn to predict attributesfor novel images in target domain. In the proposed framework,multiple kernels are employed to map each training sample in

auxiliary and target datasets to the corresponding RKHS, wherewe reduce the mismatch of data distributions of different do-mains. To make use of the unlabeled training images, we in-corporate a semi-supervised learning process in our framework.We also introduce a robust loss function into our framework toremove the shared irrelevance and noise of training images. Ex-perimental results show that the proposed IAA has better perfor-mance of image attribute adaptation, compared to the state-of-the-art methods. IAA also has an appealing extension ability byincorporating new base kernels to effectively fuse other specifictypes of visual features.

REFERENCES[1] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimedia

retrieval framework based on semi-supervised ranking and relevancefeedback,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp.723–742, 2012.

[2] Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, and T.-S. Chua,“Interactive video indexing with statistical active learning,” IEEETrans. Multimedia, vol. 14, no. 1, pp. 17–27, 2012.

[3] Y. Yang, Z. Ma, A. Hauptmann, and N. Sebe, “Feature selection formultimedia analysis by sharing information among multiple tasks,”IEEE Trans. Multimedia, vol. 15, no. 3, pp. 661–669, 2013.

[4] J. Yang, Y. Tian, L. Duan, T. Huang, and W. Gao, “Group-sensitivemultiple kernel learning for object recognition,” IEEE Trans. ImageProcess., vol. 21, no. 5, pp. 2838–2852, 2012.

[5] Y. Han, F. Wu, Q. Tian, and Y. Zhuang, “Image annotation by input-output structural grouping sparsity,” IEEE Trans. Image Process., vol.21, no. 6, pp. 3066–3079, 2012.

[6] Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang, “Visual query sug-gestion,” in Proc. ACM Multimedia, 2009, pp. 15–24.

[7] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objectsby their attributes,” in Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2009, pp. 1778–1785.

[8] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseenobject classes by between-class attribute transfer,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2009, pp. 951–958.

[9] V. Ferrari and A. Zisserman, “Learning visual attributes,” Adv. NeuralInf. Process. Syst. 20, pp. 433–440, 2008.

[10] D. Parikh and K. Grauman, “Relative attributes,” in Proc. IEEE Int.Conf. Computer Vision (ICCV), 2011, pp. 503–510.

[11] X. Felix, R. Ji, M. Tsai, G. Ye, and S. Chang, “Weak attributes forlarge-scale image retrieval,” in Proc. IEEE Conf. Computer Vision andPattern Recognition (CVPR), 2012, pp. 2949–2956.

[12] T. Berg, A. Berg, and J. Shih, “Automatic attribute discovery and char-acterization from noisy web data,” in Proc. Eur. Conf. Computer Vision(ECCV), 2010, pp. 663–676.

[13] D. Osherson, J. Stern, O.Wilkie, M. Stob, and E. Smith, “Default prob-ability,” Cognit. Sci., vol. 15, no. 2, pp. 251–269, 1991.

[14] J. Yang, R. Yan, and A. Hauptmann, “Cross-domain video concept de-tection using adaptive SVMs,” in Proc. ACM Multimedia, 2007, pp.188–197.

[15] S. Pan, I. Tsang, J. Kwok, and Q. Yang, “Domain adaptation via transfercomponent analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp.199–210, 2009.

[16] B. Wang, J. Tang, W. Fan, S. Chen, Z. Yang, and Y. Liu, “Heteroge-neous cross domain ranking in latent space,” in Proc. ACM Conf. In-formation and Knowledge Management, 2009, pp. 987–996.

[17] L. Duan, D. Xu, I. Tsang, and J. Luo, “Visual event recognition invideos by learning from web data,” in Proc. IEEE Conf. Computer Vi-sion and Pattern Recognition (CVPR), 2010, pp. 1959–1966.

[18] Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. Hauptmann, “Knowledgeadaptation for ad hoc multimedia event detection with few exemplars,”in Proc. ACM Multimedia, 2012, pp. 469–478.

[19] S. Hwang, F. Sha, and K. Grauman, “Sharing features between objectsand their attributes,” in Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), 2011, pp. 1761–1768.

[20] K. Borgwardt, A. Gretton, M. Rasch, H. Kriegel, B. Schölkopf, andA. Smola, “Integrating structured biological data by kernel maximummean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.

[21] Y. Han, Z. Xu, Z. Ma, and Z. Huang, “Image classification with man-ifold learning for out-of-sample data,” Signal Process., vol. 93, pp.2169–2177, 2013.

Page 12: IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE …disi.unitn.it/~sebe/publications/Attributes-TMM2014.pdf · IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014 1115

1126 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 4, JUNE 2014

[22] N. Kumar, A. Berg, P. Belhumeur, and S. Nayar, “Describable visualattributes for face verification and image search,” IEEE Trans. PatternAnal. Mach. Intell., vol. 33, no. 10, pp. 1962–1977, 2011.

[23] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T.Berg, “Baby talk: Understanding and generating simple image descrip-tions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR), 2011, pp. 1601–1608.

[24] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-levelfeatures for recognition,” in Proc. IEEE Conf. Computer Vision andPattern Recognition (CVPR), 2010, pp. 2559–2566.

[25] T. Xia, D. Tao, T. Mei, and Y. Zhang, “Multiview spectral embed-ding,” IEEE Trans. Syst., Man, Cybern. B: Cybern., vol. 40, no. 6, pp.1438–1446, 2010.

[26] X. Yu and Y. Aloimonos, “Attribute-based transfer learning for objectcategorization with zero/one training example,” in Proc. Eur. Conf.Computer Vision (ECCV), 2010, pp. 127–140.

[27] Y. Han, F. Wu, X. Lu, Q. Tian, Y. Zhuang, and J. Luo, “Correlatedattribute transfer with multi-task graph-guided fusion,” in Proc. 20thACM Int. Conf. Multimedia, 2012, pp. 529–538.

[28] Y. Yang, Y. Yang, and H. T. Shen, “Effective transfer tagging fromimage to video,” ACM Trans. Multimedia Comput., Commun., Ap-plicat. (TOMCCAP), vol. 9, no. 2, pp. 14.1–14.20, 2013.

[29] F. Bach, “Exploring large feature spaces with hierarchical multiplekernel learning,” Adv. Neural Inf. Process. Syst. 21, pp. 105–112, 2009.

[30] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[31] J. Zhu, S. Hoi, M. Lyu, and S. Yan, “Near-duplicate keyframe retrievalby nonrigid image matching,” in Proc. ACM Multimedia, 2008, pp.41–50.

[32] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vision, vol. 42,no. 3, pp. 145–175, 2001.

[33] M. Saberian, H. Masnadi-Shirazi, and N. Vasconcelos, “Taylorboost:First and second-order boosting algorithms with explicit margin con-trol,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR), 2011, pp. 2929–2934.

[34] D. Lewis, “Evaluating text categorization,” in Proc. Speech and Nat-ural Language Workshop, 1991, pp. 312–318.

Yahong Han received the Ph.D. degree from Zhe-jiang University, Hangzhou, China. He is currentlyan Associate Professor with the School of ComputerScience and Technology, Tianjin University, Tianjin,China. His current research interests include multi-media analysis, retrieval, and machine learning.

Yi Yang received the Ph.D. degree in ComputerScience from Zhejiang University, Hangzhou,China, in 2010. He is now a DECRA fellow withthe University of Queensland, Brisbane, Australia.Prior to that, he was a Postdoctoral research fellowat the school of computer science, Carnegie MellonUniversity, Pittsburgh, PA. His research interestsinclude machine learning and its applications tomultimedia content analysis and computer vision,e.g. multimedia indexing and retrieval, surveillancevideo analysis, video semantics understanding, etc.

Zhigang Ma received the Ph.D. in computer sciencefrom University of Trento, Trento, Italy, in 2013.His research interests include machine learning andits application to computer vision and multimediaanalysis.

Haoquan Shen is currently a student with theCollege of Computer Science at Zhejiang University,Hangzhou, China. He was a visiting scholar inCarnegie Mellon University, Pittsburgh, PA fromMay to December, 2013, where he participatedthe TRECVID MED and SED tasks. His researchinterests include machine learning and its applicationto computer vision and multimedia analysis.

Nicu Sebe (M’01–SM’11) received the Ph.D. incomputer science from Leiden University, Leiden,The Netherlands, in 2001. Currently, he is withthe Department of Information Engineering andComputer Science, University of Trento, Italy,where he is leading the research in the areas of mul-timedia information retrieval and human-computerinteraction in computer vision applications. He wasa General Co-Chair of the IEEE Automatic Faceand Gesture Recognition Conference, FG 2008 andACM Multimedia 2013, and a program chair of

ACM International Conference on Image and Video Retrieval (CIVR) 2007and 2010, ACM Multimedia 2007 and 2011. He is a program chair of ECCV2016 and ICCV 2017. He is a senior member of IEEE and of ACM and afellow of IAPR.

Xiaofang Zhou received the B.S. and M.S. degreesin computer science from Nanjing University,China, in 1984 and 1987, respectively, and the Ph.D.degree in computer science from The University ofQueensland, Australia, in 1994. He is a professor ofcomputer science with The University of Queens-land. He is the head of the Data and KnowledgeEngineering Research Division, School of Infor-mation Technology and Electrical Engineering.He is the director of ARC Research Network inEnterprise Information Infrastructure (EII) and a

chief investigator of ARC Centre of Excellence in Bioinformatics. He isalso an adjunct professor with Renmin University of China appointed underthe Chinese National Qianren Scheme. From 1994 to 1999, he was a seniorresearch scientist and project leader in CSIRO. His research is focused onending effective and efficient solutions to managing integrating and analyzingvery large amounts of complex data for business and scientific applications. Hisresearch interests include spatial and multimedia databases, high performancequery processing, web information systems, data mining, bioinformatics, ande-research. He is a senior member of the IEEE.