6
Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces Mehmet G¨ onen 1 and ulefs ¸an Bozkurt G¨ onen 2 and Fikret G ¨ urgen 3 Abstract. Multiview learning basically tries to exploit different fea- ture representations to obtain better learners. For example, in video and image recognition problems, there are many possible feature rep- resentations such as color- and texture-based features. There are two common ways of exploiting multiple views: forcing similarity (i) in predictions and (ii) in latent subspace. In this paper, we introduce a novel Bayesian multiview dimensionality reduction method coupled with supervised learning to find predictive subspaces and its infer- ence details. Experiments show that our proposed method obtains very good results on image recognition tasks in terms of classifica- tion and retrieval performances. 1 INTRODUCTION Multiview learning considers problems that can describe data points with different feature representations (i.e., views or modalities). The main idea is to exploit these different views to obtain better learners than the learners that can be found from each view separately. We can also transfer information from a subset of views (i.e., source views) to a particular view (i.e., target view) if we do not have enough training instances in the latter to build a reliable learner, which is known as transfer learning. There are two common approaches for multiview learning: (i) training separate learners for each view in a coupled manner by forcing them to have similar predictions on matching data points, (ii) projecting the data points from each view into a unified subspace and training a common learner in this subspace. The first attempt to exploit multiple views is proposed for semi- supervised learning with two views, which is known as co-training [3]. In this approach, two distinct learners are trained separately us- ing a small set of labeled instances from both views. Then, the un- labeled examples that are classified most confidently by these two learners are added to the set of labeled data points. Recently, the co- training idea is reformulated with a Bayesian approach applicable to a large set of problems [19]. One other strategy is minimizing the regularization errors of all views by training distinct learners simul- taneously and a regularization term that penalizes the disagreement between views at the same time [4, 7, 8, 16, 20]. We can also exploit multiple views by finding a unified subspace from them. Canonical correlation analysis (CCA) [12] and kernel CCA (KCCA) [11], which extract a shared representation from two 1 Department of Computational Biology, Sage Bionetworks, Seattle, WA 98109, USA, email:[email protected] Present address: Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR 97239, USA, email: [email protected] 2 Department of Computer Engineering, Bo˘ gazic ¸i University, ˙ Istanbul, Turkey, email: [email protected] 3 Department of Computer Engineering, Bo˘ gazic ¸i University, ˙ Istanbul, Turkey, email: [email protected] multivariate variables, are the first two methods that come to mind. The main restriction of such methods is that they are required to have matching samples from the views. [15] proposes a probabilis- tic KCCA variant using Gaussian process regression to find a shared representation from two views. [13] formulates an algorithm to find shared and private representations for each view using structured sparsity. [13, 15] obtain good performances for human pose estima- tion from image features (i.e., inferring missing data of one view using the other). [17, 18] extend spectral embedding and stochastic neighborhood embedding for multiview learning, respectively, and perform experiments on image and video retrieval tasks. However, the generalization performances of these unsupervised methods may not be good enough for prediction tasks due to their unsupervised nature. [5, 6] propose a supervised algorithm, which is called max-margin harmonium (MMH), for finding a predictive subspace from multiple views using an undirected latent space Markov network with a large margin approach. MMH obtains better results than its competitor al- gorithms on video and image recognition data sets in terms of clas- sification, annotation, and retrieval performances. [14] introduces a multiview metric learning algorithm that tries to preserve cross-view neighborhood by placing similarly labeled data points from different views nearby in the projected subspace. The proposed method out- performs CCA on an image retrieval task, where k-nearest neighbor strategy is used for retrieval. In this paper, we propose a novel Bayesian multiview dimension- ality reduction (BMDR) method, where data points from different views are projected into a unified subspace without the restriction of having matching data samples from these views. We make the following contributions: In §2, we give the graphical model of our approach for multiclass classification. §3 introduces an efficient vari- ational approximation approach in a detailed manner. We report our experimental results in §4 and conclude in §5. 2 BAYESIAN MULTIVIEW DIMENSIONALITY REDUCTION FOR LEARNING PREDICTIVE SUBSPACES We propose to combine linear dimensionality reduction and linear supervised learning in a joint probabilistic model to obtain predic- tive subspaces for multiview learning problems. The main idea is to map the training instances of different views to a unified subspace using linear projection matrices and to estimate the target outputs in this projected subspace. Performing dimensionality reduction and supervised learning separately (generally with two different objec- tive functions) may not result in a predictive subspace and may have low generalization performance. For multiview learning problems, ECAI 2014 T. Schaub et al. (Eds.) © 2014 The Authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi:10.3233/978-1-61499-419-0-387 387

Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

Bayesian Multiview Dimensionality Reductionfor Learning Predictive Subspaces

Mehmet Gonen1 and Gulefsan Bozkurt Gonen2 and Fikret Gurgen3

Abstract. Multiview learning basically tries to exploit different fea-ture representations to obtain better learners. For example, in videoand image recognition problems, there are many possible feature rep-resentations such as color- and texture-based features. There are twocommon ways of exploiting multiple views: forcing similarity (i) inpredictions and (ii) in latent subspace. In this paper, we introduce anovel Bayesian multiview dimensionality reduction method coupledwith supervised learning to find predictive subspaces and its infer-ence details. Experiments show that our proposed method obtainsvery good results on image recognition tasks in terms of classifica-tion and retrieval performances.

1 INTRODUCTION

Multiview learning considers problems that can describe data pointswith different feature representations (i.e., views or modalities). Themain idea is to exploit these different views to obtain better learnersthan the learners that can be found from each view separately. We canalso transfer information from a subset of views (i.e., source views) toa particular view (i.e., target view) if we do not have enough traininginstances in the latter to build a reliable learner, which is known astransfer learning. There are two common approaches for multiviewlearning: (i) training separate learners for each view in a coupledmanner by forcing them to have similar predictions on matching datapoints, (ii) projecting the data points from each view into a unifiedsubspace and training a common learner in this subspace.

The first attempt to exploit multiple views is proposed for semi-supervised learning with two views, which is known as co-training[3]. In this approach, two distinct learners are trained separately us-ing a small set of labeled instances from both views. Then, the un-labeled examples that are classified most confidently by these twolearners are added to the set of labeled data points. Recently, the co-training idea is reformulated with a Bayesian approach applicable toa large set of problems [19]. One other strategy is minimizing theregularization errors of all views by training distinct learners simul-taneously and a regularization term that penalizes the disagreementbetween views at the same time [4, 7, 8, 16, 20].

We can also exploit multiple views by finding a unified subspacefrom them. Canonical correlation analysis (CCA) [12] and kernelCCA (KCCA) [11], which extract a shared representation from two

1 Department of Computational Biology, Sage Bionetworks, Seattle, WA98109, USA, email:[email protected] address: Department of Biomedical Engineering, Oregon Health &Science University, Portland, OR 97239, USA, email: [email protected]

2 Department of Computer Engineering, Bogazici University, Istanbul,Turkey, email: [email protected]

3 Department of Computer Engineering, Bogazici University, Istanbul,Turkey, email: [email protected]

multivariate variables, are the first two methods that come to mind.The main restriction of such methods is that they are required tohave matching samples from the views. [15] proposes a probabilis-tic KCCA variant using Gaussian process regression to find a sharedrepresentation from two views. [13] formulates an algorithm to findshared and private representations for each view using structuredsparsity. [13, 15] obtain good performances for human pose estima-tion from image features (i.e., inferring missing data of one viewusing the other). [17, 18] extend spectral embedding and stochasticneighborhood embedding for multiview learning, respectively, andperform experiments on image and video retrieval tasks. However,the generalization performances of these unsupervised methods maynot be good enough for prediction tasks due to their unsupervisednature.

[5,6] propose a supervised algorithm, which is called max-marginharmonium (MMH), for finding a predictive subspace from multipleviews using an undirected latent space Markov network with a largemargin approach. MMH obtains better results than its competitor al-gorithms on video and image recognition data sets in terms of clas-sification, annotation, and retrieval performances. [14] introduces amultiview metric learning algorithm that tries to preserve cross-viewneighborhood by placing similarly labeled data points from differentviews nearby in the projected subspace. The proposed method out-performs CCA on an image retrieval task, where k-nearest neighborstrategy is used for retrieval.

In this paper, we propose a novel Bayesian multiview dimension-ality reduction (BMDR) method, where data points from differentviews are projected into a unified subspace without the restrictionof having matching data samples from these views. We make thefollowing contributions: In §2, we give the graphical model of ourapproach for multiclass classification. §3 introduces an efficient vari-ational approximation approach in a detailed manner. We report ourexperimental results in §4 and conclude in §5.

2 BAYESIAN MULTIVIEW DIMENSIONALITYREDUCTION FOR LEARNING PREDICTIVESUBSPACES

We propose to combine linear dimensionality reduction and linearsupervised learning in a joint probabilistic model to obtain predic-tive subspaces for multiview learning problems. The main idea is tomap the training instances of different views to a unified subspaceusing linear projection matrices and to estimate the target outputsin this projected subspace. Performing dimensionality reduction andsupervised learning separately (generally with two different objec-tive functions) may not result in a predictive subspace and may havelow generalization performance. For multiview learning problems,

ECAI 2014T. Schaub et al. (Eds.)© 2014 The Authors and IOS Press.This article is published online with Open Access by IOS Press and distributed under the termsof the Creative Commons Attribution Non-Commercial License.doi:10.3233/978-1-61499-419-0-387

387

Page 2: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

Φo Xo

Qo Zo

W To b

Ψ yo λ

V

φfo,s ∼ G(φfo,s;αφ, βφ) ∀(o, f, s)qfo,s|φfo,s ∼ N (qfo,s; 0, φ

−1s ) ∀(o, f, s)

zso,i|qo,s,xo,i ∼ N (zso,i; q�o,sxo,i, 1) ∀(o, s, i)

λc ∼ G(λc;αλ, βλ) ∀cbc|λc ∼ N (bc; 0, λ

−1c ) ∀c

ψsc ∼ G(ψsc ;αψ, βψ) ∀(s, c)wsc |ψsc ∼ N (wsc ; 0, (ψ

sc)−1) ∀(s, c)

tco,i|bc,wc, zo,i ∼ N (tco,i;w�c zo,i + bc, 1) ∀(o, c, i)

yo,i|to,i ∼∏

c�=yo,iδ(t

yo,io,i > tco,i) ∀(o, i)

Figure 1. Graphical model and distributional assumptions of Bayesian multiview dimensionality reduction for learning predictive subspaces.

we should consider the predictive performance of the unified pro-jected subspace while learning the projection matrices. We give de-tailed derivations for multiclass classification, but our derivations caneasily be extended to binary classification and regression.

Figure 1 illustrates the proposed probabilistic model with a graph-ical model and its distributional assumptions. N (·;μ,Σ) denotes thenormal distribution with the mean vector μ and the covariance matrixΣ. G(·;α, β) denotes the gamma distribution with the shape param-eter α and the scale parameter β. δ(·) denotes the Kronecker deltafunction that returns 1 if its argument is true and 0 otherwise. Thereason for choosing these specific distributions in our probabilisticmodel becomes clear when we explain our inference procedure in thefollowing section. The notation we use throughout the manuscript issummarized in Table 1. As short-hand notations, all prior variablesin the model are denoted by Ξ = {λ, {Φo}Vo=1,Ψ}, where the re-maining variables by Θ = {b, {Qo}Vo=1, {To}Vo=1,W, {Z}Vo=1}and the hyper-parameters by ω = {αλ, βλ, αφ, βφ, αψ, βψ}. De-pendence on ω is omitted for clarity throughout the manuscript.

Table 1. List of notation.

V Number of views (i.e., feature representations)No Number of training instances for view oDo Dimensionality of input space for view oK Number of classesR Dimensionality of unified projected subspaceXo Do ×No data matrix for view oQo Do ×R matrix of projection variables for view oΦo Do ×R matrix of priors over projection variables for view oZo R×No matrix of projected variables for view oW R×K matrix of weight parametersΨ R×K matrix of priors over weight parametersb K × 1 vector of bias parametersλ K × 1 vector of priors over bias parametersTo No ×K matrix of auxiliary variables for view oyo No × 1 vector of class labels from {1, . . . ,K} for view o

The basic steps of our algorithm can be summarized as follows:

1. The data matrices {Xo}Vo=1 are used to project data points intoa low-dimensional unified subspace using the projection matrices{Qo}Vo=1.

2. The low-dimensional representations of data points {Zo}Vo=1 andthe shared set of classification parameters {W, b} are used to cal-culate the classification scores.

3. Finally, the given class label vectors {yo}Vo=1 are generated fromthe score matrices {To}Vo=1.

The auxiliary variables between the class labels and the projectedinstances are introduced to make the inference procedures efficient[1]. Exact inference for this probabilistic model is intractable andwe instead formulate a deterministic variational approximation in thefollowing section.

3 INFERENCE USING VARIATIONALAPPROXIMATION

Inference using a Gibbs sampling approach is computationally ex-pensive [9]. We instead formulate a deterministic variational approx-imation, which is more efficient in terms of computation time. Thevariational methods use a lower bound on the marginal likelihoodusing an ensemble of factored posteriors to find the joint parame-ter distribution [2]. Note that there is not a strong coupling betweenthe parameters of our model, although the factorable ensemble im-plies independence of the approximate posteriors. The factorable en-semble approximation of the required posterior for our model can bewritten as

p(Θ,Ξ|{Xo}Vo=1, {yo}Vo=1) ≈ q(Θ,Ξ) =

q({Φo}Vo=1)q({Qo}Vo=1)q({Zo}Vo=1)

q(λ)q(Ψ)q(b,W)q({To}Vo=1).

Each factor in the ensemble is defined just like its full conditionaldistribution:

q({Φo}Vo=1) =V∏o=1

Do∏f=1

R∏s=1

G(φfo,s;α(φfo,s), β(φfo,s))

q({Qo}Vo=1) =V∏o=1

R∏s=1

N (qo,s;μ(qo,s),Σ(qo,s))

q({Zo}Vo=1) =V∏o=1

No∏i=1

N (zo,i;μ(zo,i),Σ(zo,i))

q(λ) =

K∏c=1

G(λc;α(λc), β(λc))

q(Ψ) =R∏s=1

K∏c=1

G(ψsc ;α(ψsc), β(ψsc))

q(b,W) =K∏c=1

N([

bcwc

];μ(bc,wc),Σ(bc,wc)

)

M. Gönen et al. / Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces388

Page 3: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

q({To}Vo=1) =V∏o=1

No∏i=1

T N (to,i;μ(to,i),Σ(to,i), ρ(to,i)),

where α(·), β(·), μ(·), and Σ(·) denote the shape parameter, the scaleparameter, the mean vector, and the covariance matrix for their argu-ments, respectively. T N (·;μ,Σ, ρ(·)) denotes the truncated normaldistribution with the mean vector μ, the covariance matrix Σ, andthe truncation rule ρ(·) such that T N (·;μ,Σ, ρ(·)) ∝ N (·;μ,Σ)if ρ(·) is true and T N (·;μ,Σ, ρ(·)) = 0 otherwise.

We can bound the marginal likelihood using Jensen’s inequality:

log p({yo}Vo=1|{Xo}Vo=1) ≥

Eq(Θ,Ξ)[log p({yo}Vo=1,Θ,Ξ|{Xo}Vo=1)]−Eq(Θ,Ξ)[log q(Θ,Ξ)]

and optimize this bound by maximizing with respect to each factorseparately until convergence. The approximate posterior distributionof a specific factor τ can be found as

q(τ ) ∝ exp(Eq({Θ,Ξ}\τ)[log p({yo}

Vo=1,Θ,Ξ|{Xo}Vo=1)]

).

Due to conjugate distributions in our probabilistic model, the re-sulting approximate posterior distribution of each factor follows thesame distribution as the corresponding factor.

Dimensionality reduction part has two sets of parameters: the pro-jection matrices that have normally distributed entries and the priormatrices that determine the precisions for these projection matrices.The approximate posterior distribution of the priors can be formu-lated as a product of gamma distributions:

q({Φo}Vo=1) =

V∏o=1

Do∏f=1

R∏s=1

G

⎛⎝φfo,s;αφ +1

2,

⎛⎝ 1

βφ+

˜(qfo,s)2

2

⎞⎠−1⎞⎠,

where the tilde notation gives the posterior expectations as usual, i.e.,f(τ ) = Eq(τ)[f(τ )]. The approximate posterior distribution of theprojection matrices is a product of multivariate normal distributions:

q({Qo}Vo=1) =

V∏o=1

R∏s=1

N (qo,s; Σ(qo,s)Xozso, (diag(φo,s) +XoX�o )−1).

The approximate posterior distribution of the projected instances canbe found as a product of multivariate normal distributions:

q({Zo}Vo=1) =

V∏o=1

No∏i=1

N (zo,i; Σ(zo,i)(Q�o xo,i+Wto,i−Wb), (I+ ˜WW�)−1).

Supervised learning part has two sets of parameters: the bias vec-tor and the weight matrix that have normally distributed entries, andthe corresponding priors are from gamma distribution. The approx-imate posterior distributions of the priors on the bias vector and theweight matrix can be formulated as products of gamma distributions:

q(λ) =K∏c=1

G(λc;αλ +

1

2,

(1

βλ+

b2c2

)−1)

q(Ψ) =R∏s=1

K∏c=1

G

⎛⎝ψsc ;αψ +1

2,

(1

βψ+

˜(wsc)2

2

)−1⎞⎠.

The approximate posterior distribution of the supervised learning pa-rameters is a product of multivariate normal distributions:

q(b,W) =

K∏c=1

N

⎛⎜⎜⎝[bcwc

]; Σ(bc,wc)

⎡⎢⎢⎣V∑o=1

1�tcoV∑o=1

Zotco

⎤⎥⎥⎦,⎡⎢⎢⎣λc +

V∑o=1

No

V∑o=1

1�Z�oV∑o=1

Zo1 diag(ψc) +V∑o=1

˜ZoZ�o

⎤⎥⎥⎦−1⎞⎟⎟⎠,

where we couple different views using the same bias vector andweight matrix for classification. The projection matrix for each viewtries to embed corresponding data points accordingly.

The auxiliary variables of each point follow a truncated multivari-ate normal distribution whose mean vector depends on the weightmatrix, the bias vector, and the corresponding projected instance.The approximate posterior distribution of the auxiliary variables isa product of truncated multivariate normal distributions:

q({To}Vo=1) =

V∏o=1

No∏i=1

T N

⎛⎝to,i;W�zo,i + b, I,∏

c�=yo,iδ(t

yo,io,i > tco,i)

⎞⎠.

However, we need to find the posterior expectations of the auxiliaryvariables to update the approximate posterior distributions of the pro-jected instances and the supervised learning parameters. We can ap-proximate these expectations using a naive sampling approach [10].

Updating the projection matrices {Qo}Vo=1 is the most time-consuming step, which requires inverting Do × Do matrices forthe covariance calculations and dominates the overall running time.When we have high-dimensional views, we can use an unsuperviseddimensionality reduction method (e.g., principal component analy-sis) before running the algorithm to reduce the computational com-plexity of our algorithm.

After convergence, we have a separate projection matrix for eachview and a unified set of classification parameters for the pro-jected subspace. For a test data point, we can perform dimension-ality reduction and classification using only the available views.p(Qo|{Xo}Vo=1, {yo}Vo=1) can be replaced with its approximateposterior distribution q(Qo) for the prediction step. We obtain thepredictive distribution of the projected instance zo,� for a new datapoint xo,� from a particular view as

p(zo,�|xo,�, {Xo}Vo=1, {yo}Vo=1) =

R∏s=1

N (zso,�;μ(qo,s)�xo,�, 1 + x�o,�Σ(qo,s)xo,�).

The predictive distribution of the auxiliary variables to,� can alsobe found by replacing p(b,W|{Xo}Vo=1, {yo}Vo=1) with its approx-imate posterior distribution q(b,W):

p(to,�|{Xo}Vo=1, {yo}Vo=1, zo,�) =

K∏c=1

N(tco,�;μ(bc,wc)

�[

1zo,�

], 1 +

[1 zo,�

]Σ(bc,wc)

[1

zo,�

])and the predictive distribution of the class label yo,� can be formu-lated using these auxiliary variables:

M. Gönen et al. / Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces 389

Page 4: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

p(yo,� = c|xo,�, {Xo}Vo=1, {yo}Vo=1) =

Ep(u)

⎡⎣∏j �=c

Φ(Σ(tjo,�)

−1(uΣ(tco,�) + μ(tco,�)− μ(tjo,�)))⎤⎦,

where the random variable u is standardized normal and Φ(·) is thestandardized normal cumulative distribution function. The expecta-tion can be found using a naive sampling approach. If we have morethan one view for testing, we can find the predictive distribution foreach view separately and calculate the average probability to estimatethe class label.

4 EXPERIMENTS

We test our algorithm BMDR by performing classification andretrieval experiments on FLICKR image data set from [5, 6],which contains 3411 images from 13 animal categories, namely,squirrel, cow, cat, zebra, tiger, lion, elephant,whales, rabbit, snake, antlers, wolf, and hawk. Eachanimal image is represented using 500-dimensional SIFT featuresand 634-dimensional low-level image features (e.g., color histogram,edge direction histogram, etc.). We use 2054 images for train-ing and the rest for testing as provided. We implement our algo-rithm in Matlab, which is publicly available at https://github.com/mehmetgonen/bmdr. The default hyper-parameter valuesfor BMDR are selected as (αλ, βλ) = (αφ, βφ) = (αψ, βψ) =(1, 1). We run our algorithm for 500 iterations.

In classification experiments, we use both views for training andonly image features for testing (i.e., 634-dimensional low-level im-age features). We evaluate the classification results using the test ac-curacy. Table 2 shows the classification results on FLICKR data set.We compare our results with only the results of [5,6] because MMHoutperforms several algorithms significantly in terms of classifica-tion accuracy using 30 latent topics. BMDR obtains higher test accu-racies than MMH using 10 or 15 dimensions. Figure 2 displays eighttraining images, which corresponds the images with four smallestand four largest coordinate values, for each dimension obtained byBMDR with R = 10. We can easily see that most of the dimen-sions have clear meanings. For example, the dimensions #1, #4, #8,and #10 aim to separate zebra, whales, tiger, and lion cate-gories, respectively, from other categories.

Table 2. Classification results on FLICKR data set.

Algorithm Test Accuracy

MMH (30 topics) 51.70BMDR (R = 5) 48.34BMDR (R = 10) 54.02BMDR (R = 15) 54.68

In retrieval experiments, each test image is considered as a sepa-rate query and training images are ranked based on their cosine sim-ilarities with the given test image. The cosine similarity is calculatedusing the subspace projections obtained using only image features. Atraining image is taken as relevant if it belongs to the category of thetest image. We evaluate the retrieval results using the mean averageprecision score. Table 3 shows the retrieval results on FLICKR dataset. We again compare our results with only the results of [5, 6] be-cause MMH outperforms several algorithms significantly in terms ofaverage precision using 60 latent topics. BMDR obtains significantlyhigher average precisions than MMH independent of the subspacedimensionality. Figure 3 displays one test image from each category

and the first seven training images in the ranked result list for that testimage. We see that the initial images in the result list are very mean-ingful for most of the categories even though there are some mistakesfor confusing category groups such as {cat, tiger, lion, wolf}.

Table 3. Retrieval results on FLICKR data set.

Algorithm Average Precision

MMH (60 topics) 0.163BMDR (R = 5) 0.341BMDR (R = 10) 0.383BMDR (R = 15) 0.395

Our method also decreases the computational complexity of re-trieval tasks due to low-dimensional representation used for imagesas in indexing and hashing schemes. When we need to retrieve im-ages similar to a query image, we can calculate the similarities be-tween the query image and other images very fast.

5 CONCLUSIONS

We introduce a Bayesian multiview dimensionality reduction methodcoupled with supervised learning to find predictive subspaces. Welearn a unified subspace from multiple views (i.e., feature representa-tions) by exploiting the correlation information between them. Thisapproach can also be interpreted as transfer learning between dif-ferent views. We give detailed derivations for multiclass classifica-tion using a variational approximation scheme and extensions to bi-nary classification and regression are straightforward. Experimentalresults on FLICKR image data set show that the proposed methodobtains a unified predictive subspace for classification and retrievaltask using different views.

ACKNOWLEDGEMENTS

Most of this work has been done while the first author was work-ing at the Helsinki Institute for Information Technology HIIT, De-partment of Information and Computer Science, Aalto University,Espoo, Finland. This work was financially supported by the Integra-tive Cancer Biology Program of the National Cancer Institute (grantno 1U54CA149237) and the Academy of Finland (Finnish Centreof Excellence in Computational Inference Research COIN, grant no251170).

REFERENCES

[1] J. H. Albert and S. Chib, ‘Bayesian analysis of binary and polychoto-mous response data’, Journal of the American Statistical Association,88(422), 669–679, (1993).

[2] M. J. Beal, Variational Algorithms for Approximate Bayesian Inference,Ph.D. dissertation, The Gatsby Computational Neuroscience Unit, Uni-versity College London, 2003.

[3] A. Blum and T. Mitchell, ‘Combining labeled and unlabeled data withco-training’, in Proceedings of the 11th Annual Conference on Compu-tational Learning Theory, (1998).

[4] U. Brefeld, T. Gartner, T. Scheffer, and S. Wrobel, ‘Efficient co-regularised least squares regression’, in Proceedings of the 23rd Inter-national Conference on Machine Learning, (2006).

[5] N. Chen, J. Zhu, F. Sun, and E. P. Xing, ‘Large-margin predictive latentsubspace learning for multiview data analysis’, IEEE Transactions onPattern Analysis and Machine Intelligence, 34(12), 2365–2378, (2012).

[6] N. Chen, J. Zhu, and E. P. Xing, ‘Predictive subspace learning for multi-view data: A large margin approach’, in Advances in Neural Informa-tion Processing Systems 23, (2010).

M. Gönen et al. / Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces390

Page 5: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

�� �� � �� ��

� � �

��� ��� ��� ��� ��� ���� ���� ����

� � �

���� ���� ����� ����� ��� ��� ��� ���

� � �

������ ��� ���� ���� ���� ��� ��� ���

� � �

���� ������ ���� ���� ���� ���� ���� ����

� � �

���� ���� ������ ��� ���� ��� ��� ������

� � �

���� ���� ��� ���� ��� ������ ��� ����

� � �

���� ������ ���� ������ ��� ��� ��� �����

� � �

���� ���� ���� ���� ���� ���� ���� ����

� � �

��� ������ ���� ���� ������ ��� ���� ������

� � �

���� ���� ���� ���� ��� ��� ��� ���

Figure 2. Training images of FLICKR data set projected on the dimensions obtained by BMDR with R = 10.

[7] T. Diethe, D. R. Hardoon, and J. Shawe-Taylor, ‘Constructing nonlin-ear discriminants from multiple data views’, in Proceedings of the Eu-ropean Conference on Machine Learning and Principles and Practiceof Knowledge Discovery in Databases, (2010).

[8] J. D. R. Farquhar, D. Hardoon, Hongying Meng, J. Shawe-Taylor, andS. Szedmak, ‘Two view learning: SVM-2K, theory and practice’, inAdvances in Neural Information Processing Systems 18, (2006).

[9] A. E. Gelfand and A. F. M. Smith, ‘Sampling-based approaches to cal-culating marginal densities’, Journal of the American Statistical Asso-ciation, 85, 398–409, (1990).

[10] M. Girolami and S. Rogers, ‘Variational Bayesian multinomial probitregression with Gaussian process priors’, Neural Computation, 18(8),1790–1817, (2006).

[11] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, ‘Canonical correla-tion analysis: An overview with application to learning methods’, Neu-ral Computation, 16(12), 2639–2664, (2004).

[12] H. Hotelling, ‘Relations between two sets of variates’, Biometrika,28(3/4), 321–327, (1936).

[13] Y. Jia, M. Salzmann, and T. Darrell, ‘Factorized latent spaces withstructured sparsity’, in Advances in Neural Information Processing Sys-tems 23, (2010).

[14] N. Quadrianto and C. H. Lampert, ‘Learning multi-view neighborhoodpreserving projections’, in Proceedings of the 28th International Con-ference on Machine Learning, (2011).

[15] A. Shon, K. Grochow, A. Hertzmann, and R. Rao, ‘Learning sharedlatent structure for image synthesis and robotic imitation’, in Advancesin Neural Information Processing Systems 18, (2006).

[16] V. Sindhwani and D. S. Rosenberg, ‘An RKHS for multi-view learningand manifold co-regularization’, in Proceedings of the 25th Interna-tional Conference on Machine Learning, (2008).

[17] T. Xia, D. Tao, T. Mei, and Y. Zhang, ‘Multiview spectral embedding’,

M. Gönen et al. / Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces 391

Page 6: Bayesian Multiview Dimensionality Reduction for Learning ...users.ics.aalto.fi/gonen/files/gonen_ecai14b_paper.pdfBayesian Multiview Dimensionality Reduction for Learning Predictive

�����

������

� � � � � �

�������� �������� �������� �� �������� ����� �������� �����

��� ������� ��� ��� ��� ������� ������� ���

�� �� �� �� ���� �� �� ����

����� ����� ����� ����� ����� ����� ����� �����

���� ���� ���� ���� ���� ���� �� ����

���� ���� ���� ������� ������� ������� ���� ����

������� ������� ��� ��� ������� ������� ��� �������

������ ������ ������ ������ ������� ������ ������ ������

����� �� ���� �� ����� ����� ����� ����

����� ����� ����� ����� ����� ����� ����� �����

������ ������ ������ ������ ������ ������ ������ ������

���� �������� ���� �������� ���� ���� ���� �����

���� ���� ���� �������� ���� ���� ��� ������

Figure 3. Sample queries and result images obtained by BMDR with R = 10 on FLICKR data set.

IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cyber-netics, 40(6), 1438–1446, (2010).

[18] B. Xie, Y. Mu, D. Tao, and K. Huang, ‘m-SNE: Multiview stochasticneighbor embedding’, IEEE Transactions on Systems, Man, and Cyber-netics – Part B: Cybernetics, 41(4), 1088–1096, (2011).

[19] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao, ‘Bayesian co-training’, Journal of Machine Learning Research, 12(Sep), 2649–2680,

(2011).[20] D. Zhang, J. He, Y. Liu, L. Si, and R. D. Lawrence, ‘Multi-view transfer

learning with a large margin approach’, in Proceedings of the 17th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, (2011).

M. Gönen et al. / Bayesian Multiview Dimensionality Reduction for Learning Predictive Subspaces392