14
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact Space Learning Chang Xu, Dacheng Tao, Fellow, IEEE, Chao Xu Abstract—It is practical to assume that an individual view is unlikely to be sufficient for effective multi-view learning. Therefore, integration of multi-view information is both valuable and necessary. In this paper, we propose the Multi-view Intact Space Learning (MISL) algorithm, which integrates the encoded complementary information in multiple views to discover a latent intact representation of the data. Even though each view on its own is insufficient, we show theoretically that by combing multiple views we can obtain abundant information for latent intact space learning. Employing the Cauchy loss (a technique used in statistical learning) as the error measurement strengthens robustness to outliers. We propose a new definition of multi-view stability and then derive the generalization error bound based on multi-view stability and Rademacher complexity, and show that the complementarity between multiple views is beneficial for the stability and generalization. MISL is efficiently optimized using a novel Iteratively Reweight Residuals (IRR) technique, whose convergence is theoretically guaranteed. Experiments on synthetic data and real-world datasets demonstrate that MISL is an effective and promising algorithm for practical applications. Index Terms—Multi-view learning, robust algorithms 1 I NTRODUCTION M Ost of the data used in video surveillance, so- cial computing, and environmental sciences are collected from diverse domains or obtained from var- ious feature extractors. These data are heterogeneous, because their variables can be naturally partitioned into groups. Each variable group is referred to as a particular view, and the multiple views for a partic- ular problem can take different forms. For example, a sparse camera network containing multiple cameras is used for person re-identification and understanding global activity through color descriptors, local binary patterns, local shape descriptors, slow features and spatial temporal contexts. The information obtained from an individual view cannot comprehensively describe all examples. It has therefore become popular to leverage the information derived from the connections and differences between multiple views to better describe the objects, which has resulted in multi-view learning algorithms that integrate multiple features from diverse views (or simply multi-view features). Chang Xu and Chao Xu are with the Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China (e-mail: [email protected]; [email protected]). D. Tao is with the Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering & Information Technology, University of Technology, Sydney, 235 Jones Street, Ultimo, NSW 2007, Australia (email: [email protected]; Office number: +61 2 9514 1829 ; Fax number: +61 2 9514 4517). c 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Recently, numbers of multi-view learning algo- rithms have been designed and successfully applied to various computer vision and intelligent system problems [1, 2]. Co-training [3] is one of the earli- est semi-supervised schemes for multi-view learning. It trains alternately to maximize the mutual agree- ment on two distinct views of the unlabeled data. Many variants [4–11], such as co-EM [4] and co- regularization [6, 7], have since been developed. Their success is relied on the assumption that the two sufficient and redundant views are conditional inde- pendent to the other given the class label. However, this assumption tends to be too rigorous for many practical applications, and thus some alternative as- sumptions have been studied. [12] showed that weak dependence can also guarantee successful co-training. [13] proved a weaker assumption called -expansion was sufficient for iterative co-training to succeed. After that, Wang and Zhou conducted a series of in- depth analyses and revealed some interesting prop- erties of co-training, including the large-diversity of classifiers [14, 15], label propagation over two views [16] and co-training with insufficient views [17]. Multiple kernel learning (MKL) was originally de- veloped to control the search space capacity of possi- ble kernel matrices to achieve good generalization but has been widely applied to problems involving multi- view data [18, 19]. This is because kernels in MKL naturally correspond to different views and combin- ing kernels either linearly or non-linearly improves learning performance, especially when the views are assumed to be independent. [20, 21] formulated MKL as a semi-definite programming problem. [22] treated MKL as a second order cone program problem and developed an SMO algorithm to efficiently obtain the optimal solution. [23, 24] developed an efficient semi- infinite linear program and made MKL applicable to arXiv:1904.02340v1 [cs.CV] 4 Apr 2019

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1

Multi-view Intact Space LearningChang Xu, Dacheng Tao, Fellow, IEEE, Chao Xu

Abstract—It is practical to assume that an individual view is unlikely to be sufficient for effective multi-view learning. Therefore,integration of multi-view information is both valuable and necessary. In this paper, we propose the Multi-view Intact SpaceLearning (MISL) algorithm, which integrates the encoded complementary information in multiple views to discover a latent intactrepresentation of the data. Even though each view on its own is insufficient, we show theoretically that by combing multipleviews we can obtain abundant information for latent intact space learning. Employing the Cauchy loss (a technique used instatistical learning) as the error measurement strengthens robustness to outliers. We propose a new definition of multi-viewstability and then derive the generalization error bound based on multi-view stability and Rademacher complexity, and show thatthe complementarity between multiple views is beneficial for the stability and generalization. MISL is efficiently optimized using anovel Iteratively Reweight Residuals (IRR) technique, whose convergence is theoretically guaranteed. Experiments on syntheticdata and real-world datasets demonstrate that MISL is an effective and promising algorithm for practical applications.

Index Terms—Multi-view learning, robust algorithms

F

1 INTRODUCTION

MOst of the data used in video surveillance, so-cial computing, and environmental sciences are

collected from diverse domains or obtained from var-ious feature extractors. These data are heterogeneous,because their variables can be naturally partitionedinto groups. Each variable group is referred to as aparticular view, and the multiple views for a partic-ular problem can take different forms. For example,a sparse camera network containing multiple camerasis used for person re-identification and understandingglobal activity through color descriptors, local binarypatterns, local shape descriptors, slow features andspatial temporal contexts.

The information obtained from an individual viewcannot comprehensively describe all examples. It hastherefore become popular to leverage the informationderived from the connections and differences betweenmultiple views to better describe the objects, whichhas resulted in multi-view learning algorithms thatintegrate multiple features from diverse views (orsimply multi-view features).

• Chang Xu and Chao Xu are with the Key Laboratory of MachinePerception (Ministry of Education), School of Electronics Engineeringand Computer Science, Peking University, Beijing 100871, China(e-mail: [email protected]; [email protected]).

• D. Tao is with the Centre for Quantum Computation & IntelligentSystems and the Faculty of Engineering & Information Technology,University of Technology, Sydney, 235 Jones Street, Ultimo, NSW2007, Australia (email: [email protected]; Office number: +612 9514 1829 ; Fax number: +61 2 9514 4517).

c©20XX IEEE. Personal use of this material is permitted. Permissionfrom IEEE must be obtained for all other uses, in any current or futuremedia, including reprinting/republishing this material for advertisingor promotional purposes, creating new collective works, for resale orredistribution to servers or lists, or reuse of any copyrighted componentof this work in other works.

Recently, numbers of multi-view learning algo-rithms have been designed and successfully appliedto various computer vision and intelligent systemproblems [1, 2]. Co-training [3] is one of the earli-est semi-supervised schemes for multi-view learning.It trains alternately to maximize the mutual agree-ment on two distinct views of the unlabeled data.Many variants [4–11], such as co-EM [4] and co-regularization [6, 7], have since been developed. Theirsuccess is relied on the assumption that the twosufficient and redundant views are conditional inde-pendent to the other given the class label. However,this assumption tends to be too rigorous for manypractical applications, and thus some alternative as-sumptions have been studied. [12] showed that weakdependence can also guarantee successful co-training.[13] proved a weaker assumption called ε-expansionwas sufficient for iterative co-training to succeed.After that, Wang and Zhou conducted a series of in-depth analyses and revealed some interesting prop-erties of co-training, including the large-diversity ofclassifiers [14, 15], label propagation over two views[16] and co-training with insufficient views [17].

Multiple kernel learning (MKL) was originally de-veloped to control the search space capacity of possi-ble kernel matrices to achieve good generalization buthas been widely applied to problems involving multi-view data [18, 19]. This is because kernels in MKLnaturally correspond to different views and combin-ing kernels either linearly or non-linearly improveslearning performance, especially when the views areassumed to be independent. [20, 21] formulated MKLas a semi-definite programming problem. [22] treatedMKL as a second order cone program problem anddeveloped an SMO algorithm to efficiently obtain theoptimal solution. [23, 24] developed an efficient semi-infinite linear program and made MKL applicable to

arX

iv:1

904.

0234

0v1

[cs

.CV

] 4

Apr

201

9

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 2

large scale problems. [25, 26] proposed simple MKLby exploring an adaptive 2-norm regularization for-mulation. [27, 28] constructed the connection betweenMKL and group-LASSO to model group structure.

A number of works exploit the shared latent sub-space across diverse views, such as canonical cor-relation analysis (CCA) [29–31], its kernel extension[32], its probabilistic interpretation [33], and its sparseformulation [34]. Recently, other methodologies havebeen proposed for this task: [35, 36] used Gaussianprocess to discover latent variable model shared bymulti-view data; [37, 38] found the joint embeddingfor multi-view data by maximizing mutual informa-tion; these techniques are particularly effective formodeling the correlations between different views. Tosimultaneously account for the dependencies and in-dependencies of different input views, various meth-ods have been introduced that factorize the latentspace into a shared part common to all views anda private part for each view [39, 40]. By consideringthe side information, the recent work of max-marginHarmonium (MMH) [41] showed that applying thelarge-margin principle to learn subspace shared bymulti-view data is more suitable for prediction.

However, existing multi-view learning methodshave their own limitations. First, since an individualview is insufficient for learning, integration of multi-view information is necessary and valuable; however,besides several works concentrating on co-trainingstyle algorithms [17, 42], the issue of single-viewinsufficiency has not been clearly addressed and com-prehensively studied. Second, the term“intact” meanscomplete and not damaged in Merriam-Webster, whichare exactly the two favorable properties we wish topossess in the latent intact space. However, most ofthe existing multi-view learning algorithms fail todiscover latent intact spaces, due to the informationloss in learning from insufficient views or the influ-ence of noises in insufficient views. Finally, there is ademand on the theoretical supports to guarantee theperformance of multi-view learning.

In this paper, we assume that while each individ-ual view only captures partial information, all theviews together possess redundant information of theobject. In contrast to most existing multi-view learningmodels that assume view sufficiency, we propose aMulti-view Intact Space Learning (MISL) algorithmto address insufficiency in each individual view andto integrate the encoded complementary information.The new view functions in the MISL algorithm arerigorously studied. To enhance the robustness of themodel, we measure the reconstruction error fromdifferent views using the Cauchy loss, which is ro-bust to outliers and has an optimal breakdown pointcompared with conventional L2 and L1 losses [43].To solve the two sub-problems w.r.t. the view gen-eration functions and the intact space derived fromthe optimization problem, we develop an Iteratively

Fig. 1. View insufficiency assumption.

Reweight Residuals (IRR) optimization technique,which is efficiently implemented and has guaran-teed convergence. Although each view only capturespartial information of the latent intact space, MISLtheoretically guarantees that given enough views thelatent intact space can be approximately restored. Weintroduce a new definition of “multi-view stability”to analyze the robustness of the proposed algorithm.Moreover, we derive the generalization error boundbased on the multi-view stability and Rademachercomplexity, and show that the complementarity ofmultiple views can improve the multi-view stabilityand the generalization. Finally, we conduct experi-ments to explicitly illustrate the view insufficiencyassumption and the robustness of the proposed algo-rithm and show, using real-world datasets, that ourapproach can accurately discover an intact space forthe subsequent classification tasks.

The rest of the paper is organized as follows. InSection 2, we formulate the multi-view learning prob-lem and propose the MISL algorithm. The optimiza-tion method is presented in Section 3 and theoreticalanalysis is given in Section 4. Section 5 presentsthe experimental results, and Section 6 concludes thepaper. The detailed proofs of the theoretical resultsare in Section 7.

2 PROBLEM FORMULATION

View sufficiency is usually not guaranteed in practice.By contrast, we assume “view insufficiency” that eachview only captures partial information but all theviews together carry redundant information aboutthe latent intact representation (shown in Figure 1).Many practical problems support this assumption. Forexample, in a camera network, cameras are placedin public areas to predict potentially dangerous sit-uations in time to take necessary action. However,each camera alone captures insufficient informationand thus cannot comprehensively describe the en-vironment, which can only be fully recovered byintegrating multiple data from all the cameras.

In multi-view learning, an example x is representedby multi-view features zv , where m is the number

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 3

of views and 1 ≤ v ≤ m. Supposing x is the latentintact representation, each view zv is a particularreflection of the example, and obtained from the viewgeneration function fv on x,

zv = fv(x) + εv, (1)

where εv is the view-dependent noise. According tothe view insufficiency assumption, we know that thefunction fv(x) is non-invertible, so we cannot recoverx from zv even given the view function fv : X → Zv .For a linear function fv(x) = Wvx, non-invertibilityimplies that Wv is not column full-rank.

Hence, our objective is to learn a series of viewgeneration functions Wvmv=1 to generate multi-viewdata points from a latent intact space X . A straightfor-ward approach is minimizing the empirical risk overzv−fv(x)mv=1 using the L1 or L2 loss. Given Eq. (1),the noise in different views seriously influences thediscovery of the optimal view generation functionsand the latent intact space. However, as thoroughlystudied in robust statistics [44], neither L1 nor L2

loss is robust to outliers, and thus the performanceof multi-view learning will be seriously degraded.

2.1 Robust Estimators

M-estimator is popular in robust statistics. Let ri bethe residual of the i-th data point, i.e., the differencebetween the i-th observation and its fitted values.The standard least-squares method tries to minimize∑i r

2i , which is unstable if outliers are present, and

which has a strong effect to distort the estimatedparameters. M-estimators try to reduce the effect ofoutliers by replacing the squared residuals r2

i withanother function of residuals

min∑i

ρ(ri), (2)

where ρ is a symmetric, positive-definite function witha unique minimum at zero and chosen to be less in-creasing than the square function. The correspondinginfluence function is defined as

ψ(x) =∂ρ(x)

x, (3)

which measures the influence of a random data pointon the value of the parameter estimate.

As shown in Figure 2, for the L2 estimator (least-squares) with ρ(x) = x2/2, the influence function isψ(x) = x; that is, the influence of a data point onthe estimate increases linearly with the size of itserror. This confirms the non-robustness of the least-squares estimate. Although the L1 (absolute value)estimator with ρ(x) = |x| reduces the influence oflarge errors, its influence function has no cut-off.When an estimator is robust, it is inferred that theinfluence of any single observation is insufficient to

Least-Squares Least-Absolute Cauchy

ρ-function ρ-function ρ-function

ψ-function ψ-function ψ-function

Fig. 2. Example robust estimators.

yield a significant offset. Cauchy estimator has beenshown to own this valuable property

ρ(x) = log(1 + (x/c)2) (4)

along with the upper bounded influence function

ψ(x) =2x

c2 + x2. (5)

Furthermore, Cauchy estimator [45] theoreticallyhas a breakdown point of nearly 50%, which meansthat nearly half of the observations (e.g., arbitrarilylarge observations) can be incorrect before the esti-mator gives an incorrect (e.g., arbitrarily large) result.Therefore, we deploy the Cauchy estimator in MISL.

2.2 Multi-view Intact Space LearningWe consider a multi-view training sample D =zvi |1 ≤ i ≤ n, 1 ≤ v ≤ m whose view number ism and sample size is n. The reconstruction error overthe latent intact space X can be measured using theCauchy loss

1

mn

m∑v=1

n∑i=1

log

(1 +‖zvi −Wvxi‖2

c2

), (6)

where xi ∈ Rd is a data point in the latent intact spaceX , Wv ∈ RDv×d is the v-th view generation matrix,and c is a constant scale parameter.

Moreover, we adopt some regularization terms topenalize the latent data point x and the view genera-tion matrix W . Finally, the resulting objective functioncan be written as

minx,W

1

mn

m∑v=1

n∑i=1

log

(1 +‖zvi −Wvxi‖2

c2

)+ C1

m∑v=1

‖Wv‖2F + C2

n∑i=1

‖xi‖22

(7)

where C1 and C2 are non-negative constants that canbe determined using cross validation. Problem (7)jointly models the relationships between the latentintact space X and each view space Zv using a robustapproach. It can be expected that by solving this

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 4

problem with an input of multiple insufficiency views,a series of view generation functions and a latentintact space can be found that represents the objectin its entirety.

At inference, given a new multi-view examplez1, · · · , zm, the corresponding data point x in theintact space can be obtained by solving the problem

minx

1

m

m∑v=1

log

(1 +‖zv −W ∗v x‖2

c2

)+ C2‖x‖22, (8)

where W ∗v is the optimal view generation function.

2.3 Kernel ExtensionWhen the view space Z lies in an infinite-dimensionalHilbert space, there exists a nonlinear mapping φ :Rd → H, such that k(z, zi) = 〈φ(z), φ(zi)〉. Further-more, we denote the learned projection function W inthe feature space as φ(W ). Assuming atoms of W liein the space spanned by the input data, we can writeφ(W ) = φ(Z)A, where A is the atom representationmatrix and φ(Z) = [φ(z1), · · · , φ(zn)]. The kernelextension of Problem (7) can then be obtained through

‖zvi −Wvxi‖2H = 〈φ(zvi )− φ(Wv)xi, φ(zvi )− φ(Wv)xi〉= 〈φ(zvi ), φ(zvi )〉 − 2〈φ(zvi ), φ(Wv)xi〉+ 〈φ(Wv)xi, φ(Wv)xi〉

= k(zvi , zvi )− 2

n∑j=1

k(zi, zj)Avxi + xTi ATvKZvAvxi,

and

‖Wv‖2H =〈φ(Zv)Av, φ(Zv)Av〉 = ATvKZvAv (9)

where KZv and Av are the kernel matrix and the atommatrix of view-v respectively. The kernelized problemcan be optimized using the same technique as thelinear MISL defined by Eq. (7).

3 OPTIMIZATION

Problem (7) can be decomposed into two sub-problems over the view generation function Wand the latent intact space X using the alternatingoptimization method. Inspired by the generalizedWeiszfeld’s method [46], we develop an IterativelyReweight Residuals (IRR) algorithm to efficiently op-timize these two subproblems.

Given fixed view generation functions Wvmv=1, Eq.(7) can be minimized over each latent point x in thelatent intact space X ,

minxJ =

1

m

m∑v=1

log

(1 +‖zv −Wvx‖2

c2

)+ C2‖x‖22.

(10)Setting the gradient of J with respect to x to 0, we

have

m∑v=1

− 2WTv (zv −Wvx)

c2 + ‖zv −Wvx‖22+ 2mC2x = 0, (11)

which can be rewritten as(m∑

v=1

WTv Wv

c2 + ‖zv −Wvx‖22+mC2

)x =

m∑v=1

WTv z

v

c2 + ‖zv −Wvx‖22, (12)

where rv = zv − Wvx is referred to as the residualof the example x on each view. A weight function isthen defined as

Q =

[1

c2 + ‖r1‖2, · · · , 1

c2 + ‖rm‖2

], (13)

which can be used to reduce the influence of outliersand adjust the errors introduced by different views.Based on Eqs. (12) and (13), we have

x =

(m∑v=1

WTv QvWv +mC2

)−1 m∑v=1

WTv Qvz

v. (14)

Considering Q depends on x, we thus iterativelyupdate x using Eq. (14) with an initial estimate untilconvergence. The iterative procedure is described inalgorithm 1.

Algorithm 1 Iteratively Reweighted Residuals (IRR)

Input: z1, · · · , zm, Wvmv=1, and x0

Initial residuals rvmv=1 are found using x0

for k = 1, · · · doWeight function Q is chosen through Eq. (13)Using Eq. (14) to obtain the estimate xk

Update the residuals rvmv=1

if the estimates of x converge thenbreak

end ifend forOutput: xk

By fixing all data points in the intact space X , Eq.(7) is reduced to the minimization over each viewgeneration function W

minWJ =

1

n

n∑i=1

log

(1 +‖zi −Wxi‖2

c2

)+ C1‖W‖2.

(15)Given the residual ri = zi − Wxi and the weight

function on the training data

Q =

[1

c2 + ‖r1‖2, · · · , 1

c2 + ‖rn‖2

], (16)

we can update the projection function W by

W =

n∑i=1

ziQixTi

(n∑i=1

xiQixTi + nC1

)−1

. (17)

Similar to the optimization over the latent com-prehensive space X , we can also estimate W byAlgorithm 1.

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 5

4 THEORETICAL ANALYSIS

In this section, we analyze the convergence of theoptimization technique, present a new definition of“multi-view stability”, and then derive the generaliza-tion error bound of the proposed multi-view learningalgorithm. The detailed proofs are given in Section 7.

4.1 Convergence AnalysisWe employ the majorize-minimize framework from[46] to analyze the convergence of the IRR algorithm.The key idea of this framework is to globally ap-proximate J using a sequence of quadratic functions.Taking the subproblem over x as an example, af-ter having found xk, we can construct a quadraticfunction ψ(x;xk) to upper bound J (x) such that thefollowing conditions hold:

ψ(xk;xk) = J (xk)

ψ′(xk;xk) = J

′(xk).

(18)

Then ψ(x;xk) has the form

ψ(x;xk) =J (xk) + (x− xk)TJ′(xk)

+ (x− xk)TC(xk)(x− xk)(19)

with symmetric matrix C(xk)

C(xk) =

m∑v=1

WTv Wv

c2 + ‖zv −Wvxk‖22+ C2. (20)

Therefore, we have the following theorem to guar-antee the convergence of the IRR algorithm.

Theorem 1. The IRR algorithm update in Eq. (14)guarantees that the sequence J (xk) is monotonic, i.e.,J (xk+1) ≤ J (xk) for all k, and the sequence xk willconverge to the local minimizer of J .

4.2 View Insufficiency AnalysisInformation theory provides a natural channel toexplain the view insufficiency assumption. In partic-ular, for discrete random variables A,B and C, theconditional mutual information I(A;B|C) measureshow much information is shared between A and Bconditioned on already known C.

Given an active view set S in the multi-view learn-ing setting, each time we randomly generate view Zv

from the latent intact space X , and add it into S.For example, beginning with the view set S alreadycontaining one view Z1, the insufficiency of the newlygenerated view Z2 can be measured by

I(X;Z2|Z1) ≥ ε(2)info, (21)

where ε(2)info is a variable larger than zero. Conven-

tional view sufficiency assumption I(X;Z2|Z1) ≤εinfo with some small εinfo > 0 [47] states that bothZ1 and Z2 are redundant with regards to their infor-mation about X . By contrast, our view insufficiency

assumption (i.e., Eq. (21)) implies that an individualview cannot sufficiently describe X , and thereforeeach view will carry additional information about Xthat other views do not have. For the current view setS = Z1, Z2, the information obtained with respectto X is measured using

I(X;Z1, Z2) = I(X;Z1) + I(X;Z2|Z1). (22)

According to the chain rule of mutual information, wehave the following proposition.

Proposition 1. Considering there are m randomly gener-ated views Z1, · · · , Zm from the latent comprehensivespace X , the information obtained to learn X can bemeasured by

I(X;Z1, · · · , Zm) =

m∑i=1

I(X;Zi|Zi−1, · · · , Z1).

Proposition 1 suggests that more views will bringin more information with respect to X . Although eachindividual view is insufficient, we can receive abun-dant information to learn the latent intact space Xby exploiting the complementarity between multipleviews.

Assume that the latent intact space X can becompletely captured by the ideal view set S∗M =Z1, · · · , ZM. Let X∗M denote the optimal solutionwith respect to S∗M , i.e., X∗M = minX L(X;S∗M ), whereL(·) is the expectation of the loss function `(·) overthe samples on different views. Since M could be verylarge, we randomly select m views from S∗M to con-struct a smaller view set S∗m for multi-view learning.Sridharan and Kakade [47] developed a significantinformation theoretical framework to analyze multi-view learning algorithms, based on which we showthe learning error of the newly proposed algorithm isbounded by the following theorem.

Theorem 2. Given the ideal view set S∗M =Z1, · · · , ZM, and the bounded loss function `(·), theexpected losses of multi-view learning based on S∗M andits subset S∗m are denoted by L(X;S∗M ) and L(X;S∗m).Their difference is bounded by

|L(X;S∗M )− L(X;S∗m)| ≤√I(X;S∗M\S∗m|S∗m),

which will decrease with increasing m.

According to Theorem 2, although we cannot obtainall the necessary views to learn the latent intact spaceX , we can approximately restore it when providedwith enough views.

4.3 Generalization Error Analysis

In this section, we propose a new definition of multi-view stability and use it with the Rademacher complex-ity to analyze the generalization error of the proposedmulti-view learning algorithm.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 6

In learning theory, algorithmic stability [48] is em-ployed to measure the variation of the output fordifferent training sets that are identical up to removalor change of a single example. We apply this idea tomulti-view learning, and then propose the definitionof multi-view stability.

Definition 1. Given f = (f1, · · · , fm) and f ∈ F , wherefv is the view function on v-view. The function class F issaid to have multi-view stability β, if for any two multi-view examples z = z1, · · · , zm and z = z1, · · · , zmthat differ only at a single coordinate on an individual view,supf∈F ‖f(z)− f(z)‖1 ≤ β.

Stability characterizes a system’s persistence againstthe perturbation of variables. On the other hand, sincethe perturbed variables can be regarded as the outliers(or at least noised examples) for the system, thestability has an implicit connection with robustness.Hence, the theoretical analysis based on the stabilityhere is expected to deliver similar conclusions fromthe perspective of robustness.

Conventional concentration inequalities, e.g., Ho-effding’s inequality and McDiarmid’s inequality, aredesigned under the independent identical distributionassumption. However, for more complex cases (e.g.,multi-view learning), the variables can have depen-dence on each other, which calls for the new concen-tration inequality developed for dependent randomvariables. By leveraging recent results in the concen-tration of dependent random variables [49], we usethe following concentration inequality for our analysison multi-view learning.

Theorem 3. Let φ : Z → R, and M ∈ Rm×m bethe coefficient matrix measuring variable dependence. Forany two multi-view examples z = z1, · · · , zm andz = z1, · · · , zm that differ only at a single coordinate onan individual view, |φ(z)−φ(z)| ≤ c. Then for any ε > 0,

Pφ− Ez[φ] ≥ ε ≤ exp

(− 2ε2

mc2‖M‖2∞

).

Before proceeding, we define some notationsfor convenience. Given a multi-view examplez = z1, · · · , zm and multi-view functions f =(f1, · · · , fm), we define f(z) = (f1(z1), · · · , fm(zm)).For any f ∈ F , let

F (z) =1

m

m∑v=1

fv(zv),

ψ(F , z) = supf∈F

[Ez[F (z)]− F (z)].

Besides multi-view stability, we employ theRademacher complexity to measure the hypothesiscomplexity, and its definition is adapted from [50] byremoving the assumption that z1, · · · , zm are i.i.d.

Definition 2. Let z = z1, · · · , zm be a set of randomvariables. Let σimv=1 be a set of independent Rademacher

random variables, with zero mean and unit standard devi-ation. The empirical Rademacher complexity of F is

R(F , z) = Eσ

[supf∈F

1

m

m∑v=1

σvfv(zv)|z

]. (23)

The Rademacher complexity of F is Rm(F) =Ez[R(F , z)].

With these definitions, we now present our mainresult.

Theorem 4. Fix δ ∈ (0, 1) and m ≥ 1. If F has multi-view stability β, then with probability at least 1 − δ overmulti-view example z = z1, · · · , zm,

supf∈F

Ez[F (z)]− F (z) ≤ 2Rm(F) + β‖M‖∞

√ln(1/δ)

2m.

We can directly apply Theorem 4 to the setting inwhich there are n i.i.d. multi-view examples.

Corollary 1. Fix δ ∈ (0, 1), m ≥ 1 and n ≥ 1. LetZ = zini=1 be a set of n multi-view examples. If F hasmulti-view stability β, then with probability at least 1− δover Z,

supf∈F

Ez[F (Z)]−F (Z) ≤ 2Rmn(F)+β‖M‖∞

√ln(1/δ)

2mn.

Through the theoretical analysis, we find that thegeneralization error of multi-view leaning algorithmscan be well bounded by the multi-view stability andRademacher complexity of the hypothesis. Next weproceed to analyze the specific multi-view stabilityand Rademacher complexity for the the proposedMISL algorithm.

4.3.1 Multi-view StabilityFor any multi-view example z = z1, · · · , zm, weconsider that there is a perturbation τ at j-th coor-dinate of the k-th view. The new multi-view exampleis thus written as z = z1, · · · , zk+∆k, · · · , zm, where∆k has only one non-zero element τ at j-th coordinate.We suppose ‖Wv‖ ≤ Ωv .

The following proposition illustrates the multi-viewstability of the proposed MISL algorithm.

Proposition 2. Given f = (f1, · · · , fm) and f ∈ F ,where fv is the view function on v-view. The functionclass F learned through MISL algorithm has multi-viewstability

β =

√2

c|τ |+

m∑v=1

(128)1/4Ωvc

√|τ |

mcC2

According to the above analysis, we know thatmultiple views together determine the multi-viewstability. If one view is perturbed by outliers or noise,the other clean views will act to alleviate the influenceof the perturbation and preserve the output results

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 7

unchanged. This resistance will be further strength-ened with the increasing of the number of cleanviews. Thus, the multi-view learning performance isimproved through the corporation between multipleviews. However, if we model all views independently,this corporation will be ignored.

4.3.2 Rademacher ComplexityWe proceed to bound the Rademacher complexity ofF by first bounding its covering number [51, 52].Given W = [W1; · · · ;Wm], we assume that W lies in asphere with radius Υ. The covering number of F canbe bounded by the following lemma.

Lemma 1. For ε > 0 and ‖x‖ ≤ χ, the covering numberof F is upper bounded by

N (F , ε) ≤ (3√

2χΥ

εc)s, (24)

where the exponent s is the dimension of the constraint setW in the sense of its manifold structure.

Obviously, s can be upper bounded using d(D1 +· · · + Dm). We suggest that if there exist dependen-cies between multiple views, simultaneously handlingmultiple views is advantageous over modeling eachview independently. In particular, the connectionsbetween matrices Wvmv=1 will decrease the intrinsicdimension of W , lead to a lower s, and then improvethe covering number.

By applying the discretization theorem on the cov-ering number, we can easily get the following propo-sition to bound the Rademacher complexity of theproposed algorithm.

Proposition 3. The Rademacher complexity of F learnedthrough MISL algorithm is bounded by

Rm(F) ≤ infε

√2s ln(3

√2χΥ/εc)

m+ ε.

Finally, by integrating Propositions 2 and 3 withCorollary 1, we can easily get the generalization errorbound of the proposed MISL algorithm. Though someviews maybe noisy, multi-view stability can preservethe performance of multi-view learning by exploitingthe information on the other accurate views. Thedependencies between views can decrease the com-plexity of the hypotheses space, and then improve thegeneralization error bound.

5 EXPERIMENTSIn this section, we present qualitative as well asquantitative evaluation on two toys and three real-world datasets. The proposed MISL algorithm and itskernel extension KMISL were compared with convexMulti-view Subspace Learning algorithm (MSL) [53],Factorized Latent Spaces with Structured Sparsityalgorithm (FLSSS) [40], and shared Gaussian LatentVariable Model (sGPLVM) [36].

5.1 Toy Examples

First we evaluated our approach on the problem of3-D point cloud reconstruction. For the 3-D modelprovided by [54], we extracted the point cloud data X(i.e., point positions in rectangular space coordinates),and then projected it into three 2-D planes (e.g., X-Y,X-Z, and Y-Z planes) as the base views Z1, Z2 andZ3 of X . To further validate the robustness of theproposed algorithm, we attempted to generate morenoisy views from these base views. Given a randomlyplaced window of fixed size on each base view, weadd some noise to the data points in it, and thenobtain a serious of distinct noisy views for multi-viewlearning.

In Figure 3, we show the 3-D latent intact spacesdiscovered by MISL algorithm with different numbersof noisy view under distinct signal noise ratio (SNR)as input for vase, chair and compel models, respec-tively. Specifically, Figure 3 (a) shows the initial 3-Dpoint cloud in red color and its corresponding meshedresult in green. For each base view, we randomly gen-erate 3, 6 and 9 noisy views, and then we totaly havethree different settings of 9, 18 and 27 noisy views fortraining. Given SNR = 10, we show the intact spacesdiscovered from these three settings in Figure 3 (b),(c) and (d), respectively. When SNR = 20, the intactspaces are presented in Figure 3 (e), (f) and (d). Thereconstruction results under similar settings for chairand compel models are presented in Figures 4 and5, respectively. From these reconstruction results, wefind that if the signal is cleaner (i.e., associated witha higher SNR), the reconstructed object will be moreaccurate. More importantly, although each individualview is noisy and insufficient for completely inferringthe 3-D objects, MISL algorithm is robust enoughto outliers and can accurately discover the intactspace by utilizing the complementarity between theseviews.

The second toy is based on the synthetic data “S-curve” as shown in Figure 6 (a). We uniformly sam-pled some 3-dimensional data points X (see Figure6 (b)), and then projected them into three 2-D planes(e.g., X-Y, X-Z, and Y-Z planes) as three views Z1, Z2

and Z3 of X (see Figure 6 (c)). Based on these threeviews, we find that the proposed MISL algorithmcan effectively reconstruct the s-curve, as shown inFigure 6 (d). We further add some noises on the cleanviews to evaluate the robustness of MISL. Figures 6(e) and (g) show the noisy views of SNR = 15 andSNR = 20, respectively. Specifically, it has alreadybecame difficult for us human being to figure outthe original curve from the terrible views in Figure6 (e). Thanks to the robustness of MISL, the noise canbe appropriately handled, and the intact spaces (seeFigures 6 (f) and (h)) under SNR = 15 and SNR = 20will be approximately restored.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 8

(a) (b) (c) (d) (e) (f) (g)

Fig. 3. Reconstruction of 3-D vase model using MISL algorithm.

(a) (b) (c) (d) (e) (f) (g)

Fig. 4. Reconstruction of 3-D chair model using MISL algorithm.

(a) (b) (c) (d) (e) (f) (g)

Fig. 5. Reconstruction of 3-D compel model using MISL algorithm.

Fig. 6. Reconstruction of s-curve using MISL algorithm.

5.2 Face Recognition

CMU PIE face database [55] contains 41,368 imagesof 68 people, each person under 13 different poses, 43different illumination conditions, and with 4 differentexpressions. To construct the multi-view setting, weselected two near frontal poses (i.e., C9 and C29) astwo views. Therefore a pair of images of one personunder these two poses with the same illuminationcan be seen as a two-view example. Different al-gorithms were used to project the multi-view facesinto some appropriate spaces for face recognition.50% images of one people were randomly selected

for training, and the rest for test. k-nearest neighbormethod based on the Euclidean distance was appliedfor face recognition, where k was set as 3. Giventhe noisy views of SNR = 2 and SNR = 5, theface recognition accuracies for different algorithms ondifferent dimensional spaces were shown in Figure 7(a) and (b).

From Figure 7, we find that MISL stably outper-forms other algorithms at all the dimensionalities. Thenoisy views do not seriously damage the performanceof MISL, but are optimally combined to find the latentintact space. This is mainly due to the satisfied robustproperty of MISL and its ability to appropriately

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 9

(a) (b)

Fig. 7. Face recognition accuracies of different algorithms on different dimensional spaces.

Fig. 8. Convergence curves of MISL for differentdimensional latent spaces.handle the complementarity between multiple views.

In Figure 8, we present the convergence curvesof MISL algorithm in discovering different low-dimensional intact spaces. It is instructive to note thatIRR is an effective method to optimize MISL, andleads to a fast convergence of the reconstruction error(i.e., Eq. (6)).

5.3 Human Motion RecognitionThe UCF101 dataset [56] is a large dataset of humanactions. It consists of 101 action classes, which can befurther divided into five types: Human-Object Interac-tion, Body-Motion Only, Human-Human Interaction,Playing Musical Instrument and Sports. There aretotally over 13k clips and 27 hours of video data.The entire dataset was split between train and testsamples three times, each split randomly selectingtwo-thirds of the data for training and the remain-ing data for testing. Therefore videos from the samegroup never appear in both the training and test set.Recently, deep learning models [57, 58] have beenwidely used to learn effective features from videos foraction recognition. Since feature learning is not in ourresearch interests, we employ the easily obtained con-ventional handcrafted features as input, and launchmulti-view learning to evaluate whether the comple-mentarity between multiple views can actually beenexploited to improve the recognition performance.Motion Boundary Histograms (MBH) and Histogramsof Oriented Gradients (HOG) are two well-known

TABLE 1Average performance of different algorithms.MBH/HOG MBH+HOG MBH HOGsGPLVM 58.43% 52.80% 42.02%FLSSS 60.75% 53.56% 43.12%MSL 60.49% 54.70% 42.74%MISL 62.57% 55.75% 43.81%KMISL 64.23% 56.23% 44.62%

motion descriptors [59], and thus we chose these twoviews to represent each clip. In the experiments, weused a linear SVM for classification, and conductcross validation to find the optimal C. In the caseof multi-class classification, we used a one-against-rest approach and select the class with the highestscore. The performance measurement was the averageclassification accuracy over all classes on three splits.

We utilized different algorithms to project the multi-view video clips into the 1500-dimensional latentspaces and reported the recognition accuracy basedon these embedded examples. We presented the con-fusion matric for KMISL over 101 action classes inFigure 9 (a). After merging the recognition resultsof classes belonging to the same action types, theconfusion matrices over five action types for MISL,MSL, FLSSS, and sGPLVM algorithms are shown inFigure 9 (b)-(e) respectively. MISL provides perfor-mance improvements in most of the action types. Therecognition accuracy of different algorithms based onvarious feature combinations is summarized in Table1. Compared with the recognition performance basedon single kind of feature as input, learning with multi-ple features through different multi-view learning al-gorithms demonstrate variable performance improve-ments, as a result of the exploitation of complemen-tary information underlying the multi-view features.In particular, MISL obtains the best recognition result,due to its robust approach to obtain the intact latentspace.

5.4 RGB-D Object RecognitionRGB-D datset is a large-scale multi-view object datasetcollected through an RGB-D camera. This datasetcontains color and depth images of 300 physicallydistinct everyday objects. Therefore, the RGB imageand the corresponding depth image can be seen astwo approaches capturing the shape and the visual

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 10

(a)

(b) (c)

(d) (e)

Fig. 9. Confusion tables of different algorithms on 101 action classes and 5 action types.

Fig. 10. Category recognition accuracy of differentalgorithms on different dimensional spaces.

appearance of an object from 51 categories. Videosequences of each object were recorded at 20 Hz,and then subsampled by taking every fifth videoframe, giving 41,877 RGB and depth images for theexperiments. We extracted 1000-dimensional gradientkernel descriptors to depict both the RGB image anddepth image. For category recognition, we randomlyleft one object out from each category for testingand training SVM classifiers on all the remainingobjects. The performance were measured in averagedrecognition accuracy across 10 trials.

We used different algorithms to project the multi-view objects into different dimensional spaces andreported the category recognition accuracy on thesespaces in Figure 10. We can see that both MISLand its kernel extension perform much better thanother competitors. In contrast with these competitors,whose objective is to discover the subspace bearingdependencies or independencies of different views,the proposed MISL algorithm attempts to find anintact space fusing all the information provided bydiverse views. Therefore the dimensionality of thespaces discovered by MISL is not limited to be lowerthan those of original features. Moreover, the robust-ness of MISL ensures that the noises introduced by

multiple views can be appropriately handled, whichleads to the performance improvement of multi-viewlearning.

6 CONCLUSION

We have presented a novel robust learning algorithmfor recovering the latent intact representations formulti-view examples, by assuming view insufficiency,i.e., that each view only captures partial informationbut all views together carry redundant informationabout the object. Theoretical analysis on view insuf-ficiency assumption suggests that we can approxi-mately restore the latent intact space by exploiting thecomplementarity between multiple insufficient views.Based on a new definition of “multi-view stability”and the Rademacher complexity, we derive the gen-eralization error bound for the proposed multi-viewlearning algorithm, and show that this bound canbe improved through the complementarity betweenmultiple views. Finally, we design a new IterativelyReweight Residuals (IRR) technique that convergesfast to solve the optimization problem. Experimen-tal results on synthetic data and real-world datasetsdemonstrate that the proposed MISL algorithm and itskernel extension are robust, effective and promisingfor practical applications.

7 PROOFS

We provide below the detailed proofs of the theoret-ical results in Section 4.

7.1 Proof of Theorem 1

Assume that ψ(x;xk) is locally convex with respectto x and has a local minimizer. Setting xk+1 as thisminimizer, we have

ψ′(xk+1;xk) = J

′(xk) + 2C(xk)(xk+1 − xk) = 0. (25)

Substituting for J ′ , we obtain the update rule in Eq.(14) (similar for Eq. (17)).

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 11

By appropriately choosing xk near x, we haveJ (x) ≤ ψ(x;xk) implying that

J (xk+1) ≤ ψ(xk+1;xk)

= J (xk) + (xk+1 − xk)TJ′(xk)

+ (xk+1 − xk)TC(xk)(xk+1 − xk).

(26)

By combining Eqs. (25) and (26), we have

J (xk+1)− J (xk) ≤ −(xk+1 − xk)TC(xk)(xk+1 − xk)

≤ −C2‖xk+1 − xk‖2 ≤ 0.(27)

This shows that J (xk+1) < J (xk), proving the firstpart of Theorem 1.

Since the sequence J (xk) is monotonic and lowerbounded, it converges. From Eq. (27), we can thenwrite

‖xk+1 − xk‖2 ≤ 1

C2[J (xk)− J (xk+1)]. (28)

Considering (J (xk)) is convergent, we conclude that(xk) has convergent subsequences.

In non-convex optimization [60], a common as-sumption is that the non-convex function J (·) is “lo-cally convex” around its “local” minimum. Supposewe assume that the initialization x0 is “good”, in thatit is situated sufficiently close a local minimizer x∗. Itis then possible that the entire (xk) is restricted to aball Br(x∗) of small enough radius r around x∗, andthe ball contains no other stationary points of J . Inthis case, we are guaranteed that every convergentsubsequence, and hence the whole sequence (xk),converges to x∗.7.2 Proof of Theorem 2Considering |`(x)| ≤ 1, and two probability distribu-tions P and Q, we have

|∫`(x)dQ−

∫`(x)dP | = |

∫(1− β)`dQ|

≤∫|1− β|dQ ≤

√KL(Q;P ),

(29)

where β = dPdQ and the last step is due to the L1

variational distance being bounded by the square rootof the KL divergence. Using this, for a fixed view setwe have

|EX|S∗m`(X;S∗M )− EX|S∗M `(X;S∗M )|

≤√KL(PX|S∗M ;PX|S∗m).

(30)

Taking the expectation with respect to S∗M and usingJensen’s inequality, we get

|ES∗MEX|S∗m`(X;S∗M )− L(X;S∗M )|

≤√ES∗MKL(PX|S∗M ;PX|S∗m).

(31)

SinceL(X;S∗m) ≤ ES∗MEX|S∗m`(X;S∗M ), (32)

and L(X;S∗M ) ≤ L(X;S∗m), we obtain

|L(X;S∗M )−L(X;S∗m)| ≤√ES∗

MKL(PX|S∗

M;PX|S∗m). (33)

Considering

ES∗MKL(PX|S∗M ;PX|S∗m) = I(X;S∗M\S∗m|S∗m), (34)

we have

|L(X;S∗M )− L(X;S∗m)| ≤√I(X;S∗M\S∗m|S∗m). (35)

Furthermore,

I(X;S∗M\S∗m−1|S∗m−1)− I(X;S∗M\S∗m|S∗m)

= H(X,S∗m−1) +H(S∗M )−H(X,S∗M )−H(S∗m−1)

− (H(X,S∗m) +H(S∗M )−H(X,S∗M )−H(S∗m))

= H(Zm|S∗m−1)−H(Zm|X,S∗m−1)

= I(X,Zm|S∗m−1) ≥ 0,

which shows that the larger m, the lessI(X;S∗M\S∗m|S∗m) will be, and the difference betweenL(X;S∗M ) and L(X;S∗m) will decrease.7.3 Proof of Theorem 4For any two multi-view examples z = z1, · · · , zmand z = z1, · · · , zm that differ only at a single coor-dinate on an individual view, assume that ψ(F , z) ≥ψ(F , z). We have

|ψ(F , z)− ψ(F , z)|

=

∣∣∣∣∣supf∈F

[Ez[F (z)]− F (z)]− supf∈F

[Ez[F (z)]− F (z)]

∣∣∣∣∣≤

∣∣∣∣∣supf∈F

[Ez[F (z)]− F (z)− Ez[F (z)] + F (z)]

∣∣∣∣∣=

∣∣∣∣∣supf∈F

[1

m

m∑v=1

fv(zv)− fv(zv)]

∣∣∣∣∣≤ supf∈F

1

m‖f(z)− f(z)‖1 ≤

β

m.

The last equality results from the multi-view stability.According to Theorem 3, we therefore have

ψ(F , z) ≤ Ez[ψ(F , z)] + β‖M‖∞

√ln(1/δ)

2m, (36)

with probability at least 1− δ.Using the symmetry argument and the contraction

principle, we proceed to upper-bound Ez[ψ(F , z)] bythe Rademacher complexity Rm(F). It is necessary tonote that our analysis does not require the randomvariables to be independent.

Based on Jensen’s inequality, we begin with thedefinition of Ez[ψ(F , z)] and have

Ez[ψ(F , z)] =Ez

[supf∈F

[Ez[F (z)]− F (z)]

]

≤Ez

[supf∈F

[F (z)− F (z)]

].

Given a set of independent variables σvmv=1, uni-formly distributed on −1, 1, we define

hσv (z, z)

z if σv = 1,

z if σv = −1,(37)

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 12

and

hσv(z, z)

z if σv = 1,

z if σv = −1.(38)

By symmetry,

Ez[ψ(F , z)] ≤ Ez

[supf∈F

1

m

m∑v=1

fv(zv)− fv(zv)

]

=Eσ

[Ez

[supf∈F

1

m

m∑v=1

fv(hσv (zv, zv))− fv(hσv (zv, zv))|σ

]]

=Eσ,z

[supf∈F

1

m

m∑v=1

σv(fv(zv)− fv(zv))

]

≤2Eσ,z

[supf∈F

1

m

m∑v=1

σvfv(zv)

]= 2Rm(F).

Finally, we can complete the proof by combining theabove results.7.4 Proof of Proposition 2We obtain the optimal intact representations of multi-view examples z and z through

minx

1

m

m∑v=1

log

(1 +‖zv −Wvx‖2

c2

)+ C2‖x‖2 (39)

and

minx

1

m

m∑v=1

log

(1 +‖zv −Wvx‖2

c2

)+ C2‖x‖2, (40)

and define ∆x = x − x. We introduce the followingnotation for convenience

R(z, x) =1

m

m∑v=1

log

(1 +‖zv −Wvx‖2

c2

). (41)

Though R(z, x) is not globally convex w.r.t. x, it canbe assumed to be locally convex within a small region.Given z and its perturbated copy z, we assume thattheir intact representations x and x are not far awayfrom each other, and fulfill the following inequalitiesderived from the convex principle that ∀t ∈ [0, 1]

R(z, x+ t∆x)−R(z, x) ≤ t(R(z, x)−R(z, x)

)(42)

and

R(z, x− t∆x)−R(z, x) ≤ t(R(z, x)−R(z, x)

). (43)

Summing the two inequalities yields

R(z, x+ t∆x)−R(z, x) +R(z, x− t∆x)−R(z, x) ≤ 0. (44)

Considering that x and x are optimal solutions of Eq.(39) and Eq. (40), respectively, we have

R(z, x)+C2‖x‖2 ≤ R(z, x+ t∆x)+C2‖x+ t∆x‖2 (45)

and

R(z, x)+C2‖x‖2 ≤ R(z, x− t∆x)+C2‖x− t∆x‖2 (46)

Using Eq. (44), Eq. (46) can be rewritten as

R(z, x+t∆x)−R(z, x) ≤ C2‖x−t∆x‖2−C2‖x‖2. (47)

Summing Eqs. (45) and (47), we obtain

mC2

(‖x‖2 − ‖x+ t∆x‖2 − ‖x− t∆x‖2 + ‖x‖2

)≤ log(1 +

‖zk −Wv(x+ t∆x)‖2

c2)− log(1 +

‖zk −Wv(x+ t∆x)‖2

c2)+

log(1 +‖zk −Wvx‖2

c2)− log(1 +

‖zk −Wvx‖2

c2)

≤√

2

c

(‖zk −Wv(x+ t∆x)‖ − ‖zk −Wv(x+ t∆x)‖+

‖zk −Wvx‖ − ‖zk −Wvx‖)

≤2√

2

c‖zk − zk‖ =

2√

2|τ |c

By setting t = 12 , the above inequality can then be

reformulated as

‖∆x‖2 ≤4√

2|τ |mcC2

(48)

For any view except k-th view, the difference betweenfv(z

v, x) and fv(zv, x) is

|fv(zv, x)− fv(zv, x)|

=

∣∣∣∣log

(1 +‖zv −Wvx‖2

c2

)− log

(1 +‖zv −Wvx‖2

c2

)∣∣∣∣≤√

2

c

∣∣∣∣‖zv −Wvx‖ − ‖zv −Wvx‖∣∣∣∣

≤√

2

c‖Wv∆x‖ ≤

(128)1/4Ωvc

√|τ |

mcC2

For the k-th view, we have

|fk(zk, x)− fk(zk, x)|

=

∣∣∣∣log

(1 +‖zk −Wkx‖2

c2

)− log

(1 +‖zk −Wkx‖2

c2

)∣∣∣∣≤√

2

c

∣∣∣∣‖zk −Wkx‖ − ‖zk −Wkx‖∣∣∣∣

≤√

2

c

(‖zk − zk‖+ ‖Wk∆x‖

)≤ (128)1/4Ωv

c

√|τ |

mcC2+

√2

c|τ |

Therefore, β = ‖f(z)− f(z)‖1 can be bounded by

‖f(z)− f(z)‖1 ≤√

2

c|τ |+

m∑v=1

(128)1/4Ωvc

√|τ |

mcC2

7.5 Proof of Lemma 1Let us consider the function class Fv = fv(gv(·)) :gv ∈ Gv on view-v. In the proposed algorithm,fv(·) is the Cauchy loss function, and fv(gv(·)) =log(1 + gv(·)2/c2), where gv = ‖zv − Wvx‖. It isproven in [61] that if fv(·) is a Lipschitz function withLipschitz constant L, then the covering number of Fvis N (Fv, ε) ≤ N (Gv, ε/L). Since L =

√2/c for any

view in the proposed algorithm, we have

N (F , ε) ≤ N (G, cε√2

). (49)

Considering gv = ‖zv −Wvx‖ and g′

v = ‖zv −W ′

vx′‖,

we have

g′

v − gv =‖zv −W′

vx′‖ − ‖zv −Wvx‖

≤‖zv −W′

vx‖ − ‖zv −Wvx‖≤‖x‖‖W

v −Wv‖.

(50)

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 13

Given ‖x‖ ≤ χ, we suggest that gv(·) is a χ-Lipschitzcontinuous function w.r.t. Wv . Hence for all views, wehave

N (F , ε) ≤ N (W,cε√2χ

). (51)

W is composed of the function classes W1, · · · ,Wmon different views. For any combinationW1, · · · ,Wm, the proposed multi-view learningalgorithm tends to assign them with a sharedcoefficient x for the multi-view example zvmv=1,that is [z1; · · · ; zm] = [W1; · · · ;Wm]x. Hence, bydefining W = [W1; · · · ;Wm] and z = [z1; · · · ; zm], thecomplexity of W can be approximated by that of thefunction class W . It is necessary to emphasize thatthough we concatenate the transformation matricesto estimate the complexity of the hypotheses, itdoes not imply that concatenating multiple viewsis equivalent to the proposed multi-view learningalgorithm. For the concatenated features z, directlysearching for W in a large space will increase the riskof over-fitting. By contrast, by grouping features intodifferent views, we are able to effectively discoverthe optimal matrix Wv for each view in a smallerspace with the help of the prior information on theother views.

Suppose that W lies in a sphere with radius Υ. It iswell known that the covering number of the sphereis upper bounded by

N (W, ε) ≤ (3Υ

ε)s, (52)

where the exponent s is the dimension of the con-straint set W in the sense of its manifold structure.Finally, we complete the proof by combining Eqs. (51)and (52).

8 ACKNOWLEDGMENTWe authors greatly thank the handling Associate Ed-itor and all the three anonymous reviewers for theirconstructive comments on this submission.

REFERENCES[1] L. Yuan, Y. Wang, P. M. Thompson, V. A. Narayan,

and J. Ye, “Multi-source learning for joint analysisof incomplete multi-modality neuroimaging data,” inSIGKDD. ACM, 2012, pp. 1149–1157.

[2] S. Xiang, L. Yuan, W. Fan, Y. Wang, P. M. Thomp-son, and J. Ye, “Multi-source learning with block-wisemissing data for alzheimer’s disease prediction,” inSIGKDD. ACM, 2013, pp. 185–193.

[3] A. Blum and T. Mitchell, “Combining labeled andunlabeled data with co-training,” in COLT. ACM,1998, pp. 92–100.

[4] K. Nigam and R. Ghani, “Analyzing the effectivenessand applicability of co-training,” in CIKM. ACM, 2000,pp. 86–93.

[5] I. Muslea, S. Minton, and C. A. Knoblock, “Active+semi-supervised learning= robust multi-view learn-ing,” in ICML, vol. 2, 2002, pp. 435–442.

[6] V. Sindhwani, P. Niyogi, and M. Belkin, “A co-regularization approach to semi-supervised learning

with multiple views,” in Workshop on Learning withMultiple Views at ICML 2005. Citeseer, 2005.

[7] A. Kumar, P. Rai, and H. Daume, “Co-regularizedmulti-view spectral clustering,” in NIPS, 2011, pp.1413–1421.

[8] I. Muslea, S. Minton, and C. Knoblock, “Active learningwith multiple views,” Journal of Artificial IntelligenceResearch, vol. 27, no. 1, pp. 203–233, 2006.

[9] S. Bickel and T. Scheffer, “Multi-view clustering,” inICDM, vol. 36, 2004.

[10] A. Kumar, P. Rai, and H. Daume III, “Co-regularizedspectral clustering with multiple kernels,” in NIPS 2010Workshop: New Directions in Multiple Kernel Learning,2010.

[11] S. Yu, B. Krishnapuram, R. Rosales, and R. Rao,“Bayesian co-training,” The Journal of Machine LearningResearch, vol. 999888, pp. 2649–2680, 2011.

[12] S. Abney, “Bootstrapping,” in ACL. Association forComputational Linguistics, 2002, pp. 360–367.

[13] M.-F. Balcan, A. Blum, and K. Yang, “Co-training andexpansion: Towards bridging theory and practice,” inNIPS, 2004, pp. 89–96.

[14] W. Wang and Z.-H. Zhou, “Analyzing co-trainingstyle algorithms,” in Machine Learning: ECML 2007.Springer, 2007, pp. 454–465.

[15] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unla-beled data using three classifiers,” Knowledge and DataEngineering, IEEE Transactions on, vol. 17, no. 11, pp.1529–1541, 2005.

[16] W. Wang and Z. Zhou, “A new analysis of co-training,”in ICML, 2010, pp. 1135–1142.

[17] W. Wang and Z.-H. Zhou, “Co-training with insuffi-cient views,” in ACML, 2013, pp. 467–482.

[18] S. Ji, L. Sun, R. Jin, and J. Ye, “Multi-label multiplekernel learning.” in NIPS, 2008, pp. 777–784.

[19] L. Tang, J. Chen, and J. Ye, “On multiple kernel learningwith multiple labels.” in IJCAI, 2009, pp. 1255–1260.

[20] G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett,and M. Jordan, “Learning the kernel matrix with semi-definite programming,” Computer, 2002.

[21] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui,and M. I. Jordan, “Learning the kernel matrix withsemidefinite programming,” The Journal of MachineLearning Research, vol. 5, pp. 27–72, 2004.

[22] F. R. Bach, G. R. Lanckriet, and M. I. Jordan, “Multiplekernel learning, conic duality, and the smo algorithm,”in ICML. ACM, 2004, p. 6.

[23] S. Sonnenburg, G. Ratsch, and C. Schafer, “A generaland efficient multiple kernel learning algorithm,” 2006.

[24] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf,“Large scale multiple kernel learning,” The Journal ofMachine Learning Research, vol. 7, pp. 1531–1565, 2006.

[25] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grand-valet, “More efficiency in multiple kernel learning,” inICML. ACM, 2007, pp. 775–782.

[26] A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvaletet al., “Simplemkl,” Journal of Machine Learning Research,vol. 9, pp. 2491–2521, 2008.

[27] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy,“Composite kernel learning,” Machine learning, vol. 79,no. 1, pp. 73–103, 2010.

[28] Z. Xu, R. Jin, H. Yang, I. King, and M. Lyu, “Simpleand efficient multiple kernel learning by group lasso,”in ICML, 2010, pp. 1175–1182.

[29] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Srid-haran, “Multi-view clustering via canonical correlationanalysis,” in ICML, 2009.

[30] S. M. Kakade and D. P. Foster, “Multi-view regressionvia canonical correlation analysis,” in Learning Theory.

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE … · 2019-04-05 · IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 1 Multi-view Intact

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, MONTH 20XX 14

Springer, 2007, pp. 82–96.[31] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis

for multilabel classification: A least-squares formula-tion, extensions, and analysis,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 33, no. 1,pp. 194–200, 2011.

[32] S. Akaho, “A kernel method for canonical correlationanalysis,” arXiv preprint cs/0609071, 2006.

[33] F. R. Bach and M. I. Jordan, “A probabilistic interpre-tation of canonical correlation analysis,” 2005.

[34] C. Archambeau and F. Bach, “Sparse probabilistic pro-jections,” 2009.

[35] N. D. Lawrence, “Gaussian process latent variablemodels for visualisation of high dimensional data,”NIPS, vol. 16, no. 329-336, p. 3, 2004.

[36] A. Shon, K. Grochow, A. Hertzmann, and R. P. Rao,“Learning shared latent structure for image synthesisand robotic imitation,” in NIPS, 2005, pp. 1233–1240.

[37] R. Memisevic, “Kernel information embeddings,” inICML, 2006.

[38] R. Memisevic, L. Sigal, and D. J. Fleet, “Shared kernelinformation embedding for discriminative inference,”Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, vol. 34, no. 4, pp. 778–790, 2012.

[39] M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell,“Factorized orthogonal latent spaces,” in AISTATS,2010, pp. 701–708.

[40] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latentspaces with structured sparsity,” in NIPS, 2010, pp.982–990.

[41] N. Chen, J. Zhu, F. Sun, and E. Xing, “Large-marginpredictive latent subspace learning for multi-view dataanalysis.” IEEE Trans. pattern analysis and machine intel-ligence, 2012.

[42] R. K. Ando and T. Zhang, “Two-view feature gener-ation model for semi-supervised learning,” in ICML.ACM, 2007, pp. 25–32.

[43] X. He, D. G. Simpson, and G. Wang, “Breakdownpoints of t-type regression estimators,” Biometrika,vol. 87, no. 3, pp. 675–687, 2000.

[44] W. J. Rey, Introduction to robust and quasi-robust statisticalmethods. Springer-Verlag New York, 1983, vol. 983.

[45] I. Mizera and C. H. Muller, “Breakdown points ofcauchy regression-scale estimators,” Statistics & prob-ability letters, vol. 57, no. 1, pp. 79–89, 2002.

[46] H. Voss and U. Eckhardt, “Linear convergence of gen-eralized weiszfeld’s method,” Computing, vol. 25, no. 3,pp. 243–251, 1980.

[47] K. Sridharan and S. M. Kakade, “An information the-oretic framework for multi-view learning.” in COLT,2008, pp. 403–414.

[48] O. Bousquet and A. Elisseeff, “Stability and generaliza-tion,” The Journal of Machine Learning Research, vol. 2,pp. 499–526, 2002.

[49] L. A. Kontorovich, K. Ramanan et al., “Concentrationinequalities for dependent random variables via themartingale method,” The Annals of Probability, vol. 36,no. 6, pp. 2126–2158, 2008.

[50] P. L. Bartlett and S. Mendelson, “Rademacher andgaussian complexities: Risk bounds and structural re-sults,” The Journal of Machine Learning Research, vol. 3,pp. 463–482, 2003.

[51] D.-X. Zhou, “The covering number in learning theory,”Journal of Complexity, vol. 18, no. 3, pp. 739–767, 2002.

[52] T. Zhang, “Covering number bounds of certain regu-larized linear function classes,” The Journal of MachineLearning Research, vol. 2, pp. 527–550, 2002.

[53] M. White, X. Zhang, D. Schuurmans, and Y.-l. Yu,“Convex multi-view subspace learning,” in NIPS, 2012,

pp. 1682–1690.[54] Y. Wang, S. Asafi, O. van Kaick, H. Zhang, D. Cohen-

Or, and B. Chen, “Active co-analysis of a set of shapes,”ACM Transactions on Graphics (TOG), vol. 31, no. 6, p.165, 2012.

[55] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumi-nation, and expression (pie) database,” in FGR. IEEE,2002, pp. 46–51.

[56] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: Adataset of 101 human actions classes from videos inthe wild,” arXiv preprint arXiv:1212.0402, 2012.

[57] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutionalneural networks for human action recognition,” PatternAnalysis and Machine Intelligence, IEEE Transactions on,vol. 35, no. 1, pp. 221–231, 2013.

[58] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou,and P. E. Barbano, “Toward automatic phenotypingof developing embryos from videos,” Image Processing,IEEE Transactions on, vol. 14, no. 9, pp. 1360–1371, 2005.

[59] H. Wang, C. Schmid et al., “Action recognition withimproved trajectories,” in ICCV, 2013.

[60] I. Singer, Duality for nonconvex approximation and opti-mization. Springer, 2007.

[61] M. Anthony and P. L. Bartlett, Neural network learning:Theoretical foundations. cambridge university press,2009.

Chang Xu received the B.E degree fromTianjin University in 2011. Currently, he is aPh.D candidate with the Key Laboratory ofMachine Perception (Ministry of Education)in the Peking University. Previously, he wasa research intern with the Knowledge Mininggroup in the Microsoft Research Asia. HeWon the Best Student Paper Award in ACMInternational Conference on Internet Multi-media Computing and Service 2013. Hisresearch interests lie primarily in machine

learning, multimedia search and computer vision.

Dacheng Tao (M’07-SM’12-F’15) is Profes-sor of Computer Science with the Centre forQuantum Computation & Intelligent Systemsand the Faculty of Engineering & Informa-tion Technology in the University of Tech-nology, Sydney. He mainly applies statisticsand mathematics for data analysis problemsin data mining, computer vision, machinelearning, multimedia, and video surveillance.He has authored and co-authored 100+ sci-entific articles at top venues including IEEE

T-PAMI, T-NNLS, T-IP, NIPS, ICML, AISTATS, ICDM, CVPR, ICCV,ECCV; ACM T-KDD, KDD and Multimedia, with several best paperawards, such as the Best Theory/Algorithm Paper Runner Up Awardin the IEEE ICDM07, the Best Student Paper Award in the IEEEICDM13 and 2014 IEEE ICDM 10-Year Highest-Impact Paper Award.He is a Fellow of the IAPR, the IEEE, and the IET/IEE.

Chao Xu received the B.E. degree from Ts-inghua University in 1988, the M.S. degreefrom University of Science and Technologyof China in 1991 and the Ph.D degree fromInstitute of Electronics, Chinese Academy ofSciences in 1997. Between 1991 and 1994he was employed as an assistant professorby University of Science and Technology ofChina. Since 1997 Dr. Xu has been withSchool of EECS at Peking University wherehe is currently a Professor. His research in-

terests are in image and video coding, processing and understand-ing. He has authored or co-authored more than 80 publications and5 patents in these fields.