15
1 Sparse Representation Classification Beyond 1 Minimization and the Subspace Assumption Cencheng Shen, Li Chen, Yuexiao Dong, Carey E. Priebe Abstract—The sparse representation classifier (SRC) has been utilized in various classification problems, which makes use of 1 minimization and works well for image recognition satisfying a subspace assumption. In this paper we propose a new implementation of SRC via screening, establish its equivalence to the original SRC under regularity conditions, and prove its classification consistency under a latent subspace model and contamination. The results are demonstrated via simulations and real data experiments, where the new algorithm achieves comparable numerical performance and significantly faster. Index Terms—feature screening, marginal regression, angle condition, stochastic block model 1 I NTRODUCTION Sparse coding is widely recognized as a useful tool in machine learning, thanks to the theo- retical advancement in regularized regression and 1 minimization [1], [2], [3], [4], [5], [6], [7], [8], as well as numerous classification and clustering applications in computer vision and pattern recognition [9], [10], [11], [12], [13], [14]. In this paper, we concentrate on the sparse representation classification (SRC), which is proposed in [9] and exhibits state-of-the-art per- formance for robust face recognition. It is easy to implement, work well for data satisfying the subspace assumption (e.g. face recognition, motion segmentation, and activity recognition), Cencheng Shen is with Department of Applied Economics and Statistics at University of Delaware, Li Chen is with Intel, Yuexiao Dong is with Department of Statistical Science at Temple University, and Carey E. Priebe is with Department of Applied Mathematics and Statistics at Johns Hopkins University (email: [email protected]; [email protected]; [email protected]; [email protected]). This work was partially supported by Johns Hopkins University Human Lan- guage Technology Center of Excellence, the XDATA program of the Defense Advanced Research Projects Agency administered through Air Force Research Laboratory contract FA8750-12-2-0303 and the SIMPLEX program through SPAWAR contract N66001-15-C-4041, and the National Science Foundation Division of Mathematical Sci- ences award DMS-1712947. This paper was presented in part at Joint Statistical Meeting and ICML Learning and Reasoning with Graphs workshop. The authors thank the editor and reviewer for their constructive and valuable comments that lead to significant improvements of the manuscript. is robust against data contamination, and can be extended to block-wise algorithm and struc- tured data sets [15], [16], [17]. Given a set of training data X =[x 1 ,...,x n ] R m×n with the corresponding known class labels Y = [y 1 ,...,y n ], the task here is to classify a new testing observation x of unknown label. SRC identifies a small subset b X R m×s from the training data to best represent the testing ob- servation, calculates the least square regression coefficients, and computes the regression resid- ual for classification. Comparing to nearest- neighbor and nearest-subspace classifiers, SRC exhibits better finite-sample performance on face recognition and is argued to be robust against image occlusion and contamination. Other steps being standard, the most crucial and time-consuming part of SRC is to extract the appropriate sparse representation for the testing observation. Among all possible rep- resentations, the sparse representation b X that minimizes the residual and the sparsity level s often yields a better inference performance by the statistic principle of parsimony and bias- variance trade-off. By adding the 0 constraint to the linear regression problem, one can mini- mize the residual and the sparsity level s at the same time. As 0 minimization is NP hard and arXiv:1502.01368v4 [stat.ML] 8 Jan 2020

1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

1

Sparse Representation Classification Beyond `1Minimization and the Subspace Assumption

Cencheng Shen, Li Chen, Yuexiao Dong, Carey E. Priebe

Abstract—The sparse representation classifier (SRC) has been utilized in various classification problems, which makesuse of `1 minimization and works well for image recognition satisfying a subspace assumption. In this paper we proposea new implementation of SRC via screening, establish its equivalence to the original SRC under regularity conditions,and prove its classification consistency under a latent subspace model and contamination. The results aredemonstrated via simulations and real data experiments, where the new algorithm achieves comparable numericalperformance and significantly faster.

Index Terms—feature screening, marginal regression, angle condition, stochastic block model

F

1 INTRODUCTION

Sparse coding is widely recognized as a usefultool in machine learning, thanks to the theo-retical advancement in regularized regressionand `1 minimization [1], [2], [3], [4], [5], [6],[7], [8], as well as numerous classification andclustering applications in computer vision andpattern recognition [9], [10], [11], [12], [13], [14].

In this paper, we concentrate on the sparserepresentation classification (SRC), which isproposed in [9] and exhibits state-of-the-art per-formance for robust face recognition. It is easyto implement, work well for data satisfyingthe subspace assumption (e.g. face recognition,motion segmentation, and activity recognition),

Cencheng Shen is with Department of Applied Economics andStatistics at University of Delaware, Li Chen is with Intel, YuexiaoDong is with Department of Statistical Science at Temple University,and Carey E. Priebe is with Department of Applied Mathematicsand Statistics at Johns Hopkins University (email: [email protected];[email protected]; [email protected]; [email protected]). This workwas partially supported by Johns Hopkins University Human Lan-guage Technology Center of Excellence, the XDATA program of theDefense Advanced Research Projects Agency administered throughAir Force Research Laboratory contract FA8750-12-2-0303 and theSIMPLEX program through SPAWAR contract N66001-15-C-4041,and the National Science Foundation Division of Mathematical Sci-ences award DMS-1712947. This paper was presented in part atJoint Statistical Meeting and ICML Learning and Reasoning withGraphs workshop. The authors thank the editor and reviewer fortheir constructive and valuable comments that lead to significantimprovements of the manuscript.

is robust against data contamination, and canbe extended to block-wise algorithm and struc-tured data sets [15], [16], [17]. Given a set oftraining data X = [x1, . . . , xn] ∈ Rm×n withthe corresponding known class labels Y =[y1, . . . , yn], the task here is to classify a newtesting observation x of unknown label. SRCidentifies a small subset X ∈ Rm×s from thetraining data to best represent the testing ob-servation, calculates the least square regressioncoefficients, and computes the regression resid-ual for classification. Comparing to nearest-neighbor and nearest-subspace classifiers, SRCexhibits better finite-sample performance onface recognition and is argued to be robustagainst image occlusion and contamination.

Other steps being standard, the most crucialand time-consuming part of SRC is to extractthe appropriate sparse representation for thetesting observation. Among all possible rep-resentations, the sparse representation X thatminimizes the residual and the sparsity level soften yields a better inference performance bythe statistic principle of parsimony and bias-variance trade-off. By adding the `0 constraintto the linear regression problem, one can mini-mize the residual and the sparsity level s at thesame time. As `0 minimization is NP hard and

arX

iv:1

502.

0136

8v4

[st

at.M

L]

8 J

an 2

020

Page 2: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

2

unfeasible for large samples, `1 minimizationis the best substitute due to its computationaladvantage, which has a rich theoretical litera-ture on exact sparsity recovery under variousconditions [3], [18], [5], [6], [7], [8]. Towardsthis direction, it is argued in [9] that SRC isable to find the most appropriate representationand ensures successful face recognition underthe subspace assumption: if data of the sameclass lie in the same subspace while data ofdifferent classes lie in different subspaces, thenthe sparse representation X identified by `1minimization shall only consist of observationsfrom the correct class. Moreover, using `1 mini-mization and assuming existence of perfect rep-resentation, [13] derives a theoretical conditionfor perfect variable selection.

However, to achieve correct classification,the sparse representation X does not need toperfectly represent the testing observation, noronly selects training data of the correct class.Indeed, a perfect representation is generallynot possible when the feature (or dimension)size m exceeds the sample size n, while an ap-proximate representation is often non-unique.A number of literature have also pointed outthat neither `1 minimization nor the subspaceassumption are indispensable for SRC to per-form well [19], [20], [21], [22]. Intuitively, SRCcan succeed whenever the sparse representa-tion X contains some training data of the cor-rect class, and the correct class can dominate theregression coefficients. It is not really requiredto recover the most sparse representation by`1 minimization or achieve a perfect variableselection under the subspace assumption.

The above insights motivate us to propose afaster SRC algorithm and investigate its classi-fication consistency. In Section 2 we introducebasic notations and review the original SRCframework. Section 3 is the main section: inSection 3.1 we propose a new SRC algorithmvia screening and a slightly different classifica-tion rule; in Section 3.2 we compare and estab-lish the equivalence between the two classifi-cation rules under regularity conditions; thenwe prove the consistency of SRC under a latentsubspace mixture model in Section 3.3, which

is further extended to contamination modelsand network models. Our results better explainthe success and applicability of SRC, makingit more appealing in terms of theoretical foun-dation, computational complexity and generalapplicability. The new SRC algorithm performsmuch faster than before and achieves compa-rable numerical performance, as supported bya wide variety of simulations and real dataexperiments on images and network graphs inSection 4. All proofs are in Section 5.

2 PRELIMINARY

NotationsLet X = [x1, x2, . . . , xn] ∈ Rm×n be the trainingdata matrix and Y = [y1, y2, . . . , yn] ∈ [K]n bethe known class label vector, where m is thenumber of dimensions (or feature size), n is thenumber of observations (or sample size), and Kis the number of classes with [K] = [1, . . . , K].Denote (x, y) ∈ Rm× [K] as the testing pair andy is the true but unobserved label.

As a common statistical assumption, we as-sume (x, y), (x1, y1), · · · , (xn, yn) are all indepen-dent realizations from a same distribution FXY .A classifier gn(x,Dn) is a function that estimatesthe unknown label y ∈ [K] based on the train-ing pairs Dn = (x1, y1), · · · , (xn, yn) and thetesting observation x. For brevity, we alwaysdenote the classifier as gn(x), and the classifieris correct if and only if gn(x) = y. Throughoutthe paper, we assume all observations are ofunit norm (‖xi‖2 = 1) because SRC scales allobservations to unit norm by default.

The sparse representation is a subset of thetraining data, which we denote as

X = [x1, x2, . . . , xs] ∈ Rm×s,

where each xi is selected from the training dataX , and s is the number of observations in therepresentation, or the sparsity level. Once X isdetermined, β denotes the s × 1 least squareregression coefficients between X and x, andthe regression residual equals ‖x − X β‖2. Foreach class k ∈ [K] and a given X , we define

Xk = xi ∈ X , i = 1, . . . , s | yxi= k

X−k = xi ∈ X , i = 1, . . . , s | yxi6= k.

Page 3: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

3

Namely, Xk is the subset of X that contains allobservations from class k, and X−k = X − Xk.Moreover, denote βk as the regression coeffi-cients of β corresponding to Xk, and β−k as theregression coefficients corresponding to X−k,i.e.,

Xkβk + X−kβ−k = X β.

Note that the original SRC in Algorithm 1 usesthe class-wise regression residual ‖x−Xkβk‖2 toclassify.

Sparse Representation Classification by `1

SRC consists of three steps: subset selection,least square regression, and classification viaregression residual. Algorithm 1 describes theoriginal algorithm: Equation 1 identifies thesparse representation X and computes the re-gression coefficients β; then Equation 2 assignsthe class by minimizing the class-wise regres-sion residual. In terms of computation timecomplexity, the `1 minimization step requiresat least O(mns), while the classification step ismuch cheaper and takes O(msK).

The `1 minimization step is the only compu-tational expensive part of SRC. Computation-wise, there exists various greedy and iterativeimplementations of similar complexity, such as`1 homotopy method [1], [2], [4], orthogonalmatching pursuit (OMP) [23], [24], augmentedLagrangian method [12], among many others.We use the homotopy algorithm for subsequentanalysis and numerical comparison withoutdelving into the algorithmic details, as most L1minimization algorithms share similar perfor-mances as shown in [12].

Note that model selection is inherent to `1minimization or almost all variable selectionmethods, i.e., one need to either specify a tol-erance noise level ε or a maximum sparsitylevel in order for the iterative algorithm to stop.The choice does not affect the theorems, butcan impact the actual numerical performanceand thus a separate topic for investigation [25],[26]. In this paper we simply set the maximalsparsity level s = minn/ log(n),m for both the`1 minimization here and the latter screening

Algorithm 1 Sparse Representation Classifica-tion by `1 Minimization and Magnitude Rule

Input: The training data matrix X , theknown label vector Y , the testing observa-tion x, and an error level ε.

`1 Minimization: For each testing obser-vation x, find X and β that solves the `1minimization problem:

β = arg min ‖β‖1 subject to ‖x− Xβ‖2 ≤ ε.(1)

Classification: Assign the testing observa-tion by minimizing the class-wise residual,i.e.,

g`1n (x) = arg mink∈[K]

‖x− Xkβk‖2, (2)

break ties deterministically. We name thisclassification rule as the magnitude rule.

Output: The estimated class label g`1n (x).

method in Section 3.1, which achieves goodempirical performance for both algorithms.

3 MAIN RESULTS

In this section, we present the new SRC algo-rithm, investigate its equivalence to the originalSRC algorithm, prove the classification consis-tency under a latent subspace mixture model,followed by further generalizations. Note thatone advantage of SRC is that it is applicable toboth high-dimensional problems (m ≥ n) andlow-dimensional problems (m < n). Its finite-sample numerical success mostly lies in high-dimensional domains where traditional classi-fiers often fail. For example, the feature size mis much larger than the sample size n in allthe image data we consider, and m = n forthe network adjacency matrices. The new SRCalgorithm inherits the same advantage, and allour theoretical results hold regardless of m.

3.1 SRC via Screening and Angle RuleThe new SRC algorithm is presented in Al-gorithm 2, which replaces `1 minimization by

Page 4: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

4

screening, then assigns the class by minimizingthe class-wise residual in angle. To distinguishwith the magnitude rule of Algorithm 1, wename Equation 3 as the angle rule. Algorithm 2has a better computation complexity due to thescreening procedure, which simply chooses sobservations out of X that are most correlatedwith the testing observation x, only requir-ing O(mn + n log(n)) in complexity instead ofO(mns) for `1.

The screening procedure has recentlygained popularity as a fast alternative of reg-ularized regression for high-dimensional dataanalysis. The speed advantage makes it a suit-able candidate for efficient data extraction forextremely large m, and can be equivalent to `1and `0 minimization under various regularityconditions [27], [28], [29], [30], [31], [32]. Inparticular, setting the maximal sparsity level ass = maxn/ log(n),m is shown to work wellfor screening [27], thus the default choice in thispaper.

3.2 Equivalence Between Angle Rule andMagnitude RuleThe angle rule in Algorithm 2 appears differentfrom the magnitude rule in Algorithm 1. Forgiven sparse representation X and the regres-sion vector β, we analyze these two rules andestablish their equivalence under certain condi-tions.

Theorem 1. Given X and x, we have g`1n (x) =gscrn (x) when either of the following conditionsholds.• K = 2 and X is of full rank;• Data of different classes are orthogonal to each

other, i.e., θ(Xyβy, Xkβk) = 0 for all k 6= y.

These conditions are quite common in clas-sification: binary classification problems areprevalent in many supervised learning tasks,and random vectors in high-dimensional spaceare orthogonal to each other with probabilityincreasing to 1 as the number of dimensionincreases [33]. Indeed, in all the multiclass high-dimensional simulations and experiments werun in Section 4.3, the two classification rulesyield very similar classification errors. We chose

Algorithm 2 Sparse Representation Classifica-tion by Screening and Angle Rule

Input: The training data matrix X , theknown label vector Y , and the testing ob-servation x.

Screening: Calculate Ω =|xT1 x|, |xT2 x|, · · · , |xTnx| (T is the transpose),and sort the elements by decreasingorder. Take X = x(1), x(2), . . . , x(s) withs = minn/ log(n),m, where |xT(i)x| is theith largest element in Ω.

Regression: Solve the ordinary least squareproblem between X and x. Namely, com-pute β = X−1x where X−1 is the Moore-Penrose inverse.

Classification: Assign the testing observa-tion by

gscrn (x) = arg mink∈[K]

θ(x, Xkβk), (3)

where θ denotes the angle between vectors.Break ties deterministically. We name thisclassification rule as the angle rule.

Output: The estimated class label gscrn (x).

to use the angle rule in the new SRC algorithmbecause it provides a direct path to classifica-tion consistency while the magnitude rule doesnot.

3.3 Consistency under Latent SubspaceMixture Model

To investigate the consistency of SRC, we firstformalize the probabilistic setting of classifica-tion based on [34]. Let

(X, Y ), (X1, Y1), . . . , (Xn, Yn)i.i.d.∼ FXY

denote the random variables of the samplerealizations (x, y), (x1, y1), . . . , (xn, yn). The priorprobability of each class k is denoted by ρk ∈

Page 5: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

5

[0, 1] withK∑k=1

ρk = 1,

and the probability error is defined by

L(gn) = Prob(gn(X) 6= Y ).

The classifier that minimizes the probabilityof error is called the Bayes classifier, whoseerror rate is optimal and denoted by L∗. Thesequence of classifiers gn is consistent for acertain distribution FXY if and only if

L(gn)→ L∗ as n→∞.

SRC cannot be universally consistent, i.e.,there exists some distribution FXY such thatSRC is not consistent. A simple example is atwo-dimensional data space where all the datalie on the same line passing through the origin,then SRC cannot distinguish between them (asthe normalized data are essentially a singlepoint), whereas a simple linear discriminantwithout normalization is consistent. To thatend, we propose the following model:

Definition (Latent Subspace Mixture Model).We say (X, Y ) ∼ FXY ∈ (Rm × [K]) follows alatent subspace mixture model if and only if thereexists a lower-dimensional continuously supportedlatent variable U ∈ Rd (d ≤ m), and m×d matricesWk ∈ R(m×d) for each k ∈ [K] such that

X|Y = WYU.

Namely, we observe a high-dimensionalobject X , and there exists a hidden low-dimensional latent variable U and an un-observed class-dependent transformation WY .The latent subspace mixture model well re-flects the original subspace assumption: dataof the same class lie in the same subspace,while data of different classes lie in differentsubspaces. The subspace location is determinedby Wk, and the model does not require aperfect linear recovery. Similar models havebeen used in a number of probabilistic high-dimensional analysis, e.g., probabilistic princi-pal component analysis in [35]. Note that eachsubspace represented by Wk does not need to

be equal dimension. As long as m denotesthe maximum dimensions of all Wk, then thematrix representation can capture all lower-dimensional transformations. For example, letK = 2,m = 3, d = 2, U = (u1, u2)

T , and

W1 =

1 00 10 0

W2 =

1 −10 00 0

.Then the subspace associated with class 1 istwo-dimensional with X|(Y = 1) = (u1, u2, u1 +u2)

T , and the subspace associated with class 2is one-dimensional with X|(Y = 2) = (u1 −u2, 0, 0)T .

Definition (The Angle Condition). Under thelatent subspace mixture model, denote

W = [W1|W2| · · · |WK ] ∈ R(m×Kd)

as the concatenation of all possible Wk, and W/Wk

denotes the same concatenation excluding Wk. WesayW satisfies the angle condition if and only if

span(Wk) ∩ span(W/Wk) = 0 (4)

for each k ∈ [K].

Essentially, the condition states that the sub-space of each class does not overlap with othersubspaces from other classes, therefore testingdata from one class cannot be perfectly repre-sented by any linear combinations of the train-ing data from other classes. The angle conditionand the latent subspace mixture model allowdata of the same class to be arbitrarily close inangle, while data of different classes to alwaysdiffer in angle, which leads to the classificationconsistency of SRC.

Theorem 2. Under the latent subspace mixturemodel and W satisfying the angle condition, Algo-rithm 2 is consistent with L∗ being zero, i.e.,

L(gscrn )n→∞→ L∗ = 0.

3.4 Robustness against ContaminationIn feature contamination, certain features ordimensions of the data are contaminated or un-observed, thus treated as zero. Under the latentsubspace mixture model, this can be equiva-lently characterized by imposing the contam-

Page 6: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

6

ination on the transformation matrix Wk, i.e.,some entries of Wk are 0. By default, we assumethere is no degenerate observation where allfeatures are contaminated to 0.

Definition (Latent Subspace Mixture Modelwith Fixed Contamination). Under the latentsubspace mixture model, define Vk as the 1 × mcontamination vector for each class k:

Vk(j) = 1 when jth dimension is not contaminated,Vk(j) = 0 when jth dimension is contaminated.

Then the contaminated random variable X is

X|Y = diag(VY )WYU,

where diag(Vk) is an m×m diagonal matrix satis-fying diag(Vk)(j, j) = Vk(j).

A more interesting contamination model isthe following:

Definition (Latent Subspace Mixture Modelwith Random Contamination). Under the latentsubspace mixture model, for each class k defineVk ∈ [0, 1]1×m as the contamination probabilityvector, and Bernoulli(VY ) as the corresponding 0-1contamination vector where 0 represents the entrybeing contaminated. Then the contaminated randomvariable X is

X|Y = diag(Bernoulli(VY ))WYU.

The two contamination models are verysimilar, except one being governed by a fixedvector while the other is being governed by arandom process.

Theorem 3. Under the fixed contamination model,Algorithm 2 is consistent when

WV = [diag(V1)W1| · · · |diag(VK)WK ]

satisfies the angle condition.Under the random contamination model, Algo-

rithm 2 is consistent when

WV = [diag(I(V1 = 1))W1| · · · |diag(I(VK = 1))WK ]

satisfies the angle condition, where I is the indicatorfunction.

Note that the notation I(Vk = 1) represents a0 − 1 vector that applies the indicator function

element-wise to class k data, which has an entryof 1 if and only if the respective dimensionof class k is un-contaminated. For example, letK = 3, m = 6, d = 1, and

I(V1 = 1) = [1, 1, 1, 0, 0, 0],

I(V2 = 1) = [1, 1, 0, 0, 1, 0],

I(V3 = 1) = [1, 1, 0, 0, 0, 1].

Namely, for class 1 data, the last three dimen-sions are randomly contaminated but not thefirst three dimensions; for class 2 data, thefirst two and the fifth dimension are not con-taminated; for class 3 data, the first two andthe last dimension are not contaminated. Thetheorem holds for the random contaminationmodel when the common un-contaminated di-mensions satisfy the angle condition, e.g., thefirst two dimensions in the above example.

SinceWV can be regarded as a projected ver-sion of W where the projection is enforced bythe contamination, Theorem 3 essentially statesthat if the angle condition can still hold for aprojected version ofW , then SRC is still consis-tent. If all dimensions can be randomly contam-inated with non-zero probability, WV becomesthe empty matrix and the theorem no longerholds. This is because different subspaces maynow overlap with each other simply chance,thus SRC cannot be as consistent as before.

In practice, one generally has no priorknowledge about how exactly the data is con-taminated. Thus Theorem 3 suggests that SRCcan still perform well when the number of con-taminated feature is relatively small or sparseamong all features. The simulation in Sec-tion 4.1.1 shows that under the two contamina-tion models in Theorem 3, SRC performs almostas well as the no-contamination case; whereas ifall dimensions are allowed to be contaminated,SRC exhibits a much worse classification error.

3.5 Consistency under Stochastic Block-ModelSRC is shown as a robust vertex classifierin [14], exhibiting superior performance thanother classifiers for both simulated and realnetworks. Here we prove SRC consistency forthe stochastic block model [36], [37], [38], which

Page 7: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

7

is a popular network model commonly usedfor classification and clustering. Although theresults are extend-able to undirected, weighted,and other similar graph models, for ease of pre-sentation we concentrate on the directed andunweighted SBM.

Definition (Directed and Unweighted Stochas-tic Block Model (SBM)). Given the class member-ship Y . A directed stochastic block model generatesan n × n binary adjacency matrix X via a classconnectivity matrix V ∈ [0, 1]K×K by Bernoullidistribution B(·):

X (i, j) = B(V (yi, yj)).

From the definition, the adjacency matrixproduced by SBM is a high-dimensional objectthat is characterized by a low-dimensional classconnectivity matrix. It is thus similar to thelatent subspace mixture model.

Theorem 4. Denote ρ ∈ [0, 1]K as the 1 × Kvector of prior probability, IY=i as the Bernoullirandom variable of probability ρi, and define a setof new random variables Qk and their un-centeredcorrelations qkl as

Qk =

K∑i=1

IY=iV (k, Y ),

qkl =E(QkQl)√E(Q2

k)E(Q2l )∈ [0, 1]

for all k, l ∈ [K].Then Algorithm 2 is consistent for SBM vertex

classification when the regression coefficients areconstrained to be non-negative and

q2kl ·E(Q2

k)

E(Q2l )<E(Qk)

E(Ql)(5)

for all 1 ≤ k < l ≤ K.

Equation 5 essentially allows data of thesame class to be more similar in angle than dataof different classes, thus inherently the sameas the angle condition for the latent subspacemixture model. Note that the theorem requiresthe regression coefficients to be non-negative.This is actually relaxed in the proof, whichpresents a more technical condition that main-tains consistency while allowing some nega-

tive coefficients. Since network data are non-negative and SBM is a binary model, in prac-tice the regression coefficients are mostly non-negative in either `1 minimization or screen-ing. Alternatively, it is also easy to enforcethe non-negative constraint in the algorithmif needed [39], which yields similar numericalperformance.

4 NUMERICAL EXPERIMENTS

In this section we compare the new SRC algo-rithm by screening to the original SRC algo-rithm in various simulations and experiments.The evaluation criterion is the leave-one-outerror: within each data set, one observation ishold out for testing and the remaining are usedfor training, do the classification, and repeatuntil each observation in the given data is hold-out once. The simulations show that SRC isconsistent under both latent subspace mixturemodel and stochastic block model and is ro-bust against contamination. The phenomenonis the same for real image and network data.Overall, we observe that Algorithm 2 performsvery similar to Algorithm 1 in accuracy andachieves so with significantly better runningtime. Additional algorithm comparison is pro-vided in Section 4.3 to compare various differ-ent choices on the variable selection and classi-fication steps. Note that we also compared twocommon benchmarks, the k-nearest-neighborclassifier and linear discriminant analysis for anumber of network simulations and real graphsin [14], which shows SRC is significantly betterthan those common benchmarks and thus notrepeated here.

4.1 Image-Related Experiments4.1.1 Latent Subspace Mixture SimulationThe model parameters are set as: m = 5, d = 2,K = 3 with ρ1 = ρ3 = 0.3, ρ2 = 0.4. The Wk

matrices are:

W1 =

3 11 11 11 1· ·

W2 =

1 13 11 11 1· ·

W3 =

1 11 13 11 1· ·

,

Page 8: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

8

50 100 150 200 250 300Sample Size

0

0.1

0.2

0.3

0.4

0.5C

lass

ifica

tion

Err

orNo Contamination

SRC by ScreeningSRC by L1

(a)

50 100 150 200 250 300Sample Size

0

0.1

0.2

0.3

0.4

0.5

Cla

ssifi

catio

n E

rror

With Contamination

(b)

200 400 600 800 1000Sample Size

0

0.1

0.2

0.3

0.4

Cla

ssifi

catio

n E

rror

Three Contamination Models

Random Contamination 1Fixed ContaminationRandom Contamination 2

(c)

Fig. 1: SRC errors under Latent Subspace Mix-ture Model. The left and center panel com-pare algorithm 1 and algorithm 2 in no-contamination and 20% random contamina-tion data. The right panel compares SRCperformance in three different contaminationschemes, which shows SRC can perform verywell when the contamination satisfies Theo-rem 3. Note that the right panel only showsalgorithm 2, as the behavior is the same foralgorithm 1.

which satisfies the angle condition in Theo-rem 2. We generate sample data (X ,Y) forn = 30, 60, . . . , 300, compute the leave-one-outerror, then repeat for 100 Monte-Carlo replicatesand plot the average errors in Figure 1. The leftpanel has no contamination, while the centerpanel has 20% of the features contaminated to0 per each observation. In both panels, Algo-rithm 2 is almost as good as Algorithm 1.

In the right panel of Figure 1, we show howdifferent contamination models may affect con-sistency. Consider three different contaminationmodels: the same random contamination as inthe center panel that sets each dimension to 0randomly for each observation; a fixed contam-ination that sets 20% dimensions to 0 for thesame features throughout all data; and a second

random contamination that never contaminatethe first three dimensions (thus keeping the an-gle condition) with 20% probability randomlysets each of the remaining dimensions to 0for each observation. The first contaminationmodel is not guaranteed to be consistent, whilethe remaining two are guaranteed consistentby Theorem 3. We use the same model settingas before except letting m = 10 and growingsample size up-to 1000 to better compare theconvergence of the errors. Indeed, the resultssupport the consistency theorem: for the fixedcontamination or the random contaminationsatisfying the angle condition, SRC achieves al-most perfect classification; while for the purelyrandom contamination, the SRC error remainsvery high and is far from being optimal.

4.1.2 Face and Object ImagesNext we experiment on two image data setswhere original SRC excels at. The ExtendedYale B database has 2414 face images of 38 indi-viduals under various poses and lighting con-ditions [40], [41], which are re-sized to 32 × 32.Thus m = 1024, n = 2414, and K = 38. TheColumbia Object Image Library (Coil20) [42]consists of 400 object images of 20 objects undervarious angles, and each image is also of size32 × 32. In this case m = 1024, n = 400, andK = 20.

The leave-one-out errors are reported in thefirst two rows of Table 2, and the runningtimes are reported in the first two columns ofTable 1. The new SRC algorithm is similar asthe original SRC in error rate with a far betterrunning time. Next we verify the robustnessof Algorithm 2 against contamination. Figure 3shows some examples of the image data, preand post contamination. As the contaminationrate increases from 0 to 50% of the pixels, theerror rate increases significantly. Algorithm 2still enjoys the same classification performanceas the original SRC in Algorithm 1, as shown intop two panels of Figure 4.

4.2 Network-Related Experiments4.2.1 Stochastic Block Model SimulationNext we generate the adjacency matrix by thestochastic block model. We set K = 3 with

Page 9: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

9

50 100 150 200 250 300Sample Size

0

0.1

0.2

0.3

0.4

0.5C

lass

ifica

tion

Err

orNo Contamination

SRC by ScreeningSRC by L1

(a)

50 100 150 200 250 300Sample Size

0

0.1

0.2

0.3

0.4

0.5

Cla

ssifi

catio

n E

rror

With Contamination

(b)

Fig. 2: SRC Errors under Stochastic BlockModel.

ρ1 = ρ3 = 0.3, ρ2 = 0.4, generate sampledata (X ,Y) for n = 30, 60, . . . , 300, compute theleave-one-out error, then repeat for 100 Monte-Carlo replicates and plot the average errors inFigure 2. The class connectivity matrix V is setto

V =

0.3 0.1 0.10.1 0.3 0.10.1 0.1 0.3

,which satisfies the condition in Theorem 4. Thenew SRC algorithm is similar to the originalSRC algorithm in error rate; and both of themhave very low errors, supporting the consis-tency result of Theorem 4.

4.2.2 Article Hyperlinks and Neural Connec-tomeIn this section we apply SRC to vertex classifi-cation of network graphs. The first graph is col-lected from Wikipedia article hyperlinks [43]. Atotal of 1382 English documents based on the 2-neighborhood of the English article “algebraicgeometry” are collected, and the adjacency ma-trix is formed via the documents’ hyperlinks.This is a directed, unweighted, and sparsegraph without self-loop, where the graph den-sity is 1.98% (number of edges divided by themaximal number of possible edges). There arefive classes based on article contents (119 arti-cles in category class, 372 articles about people,270 articles about locations, 191 articles on date,and 430 articles are real math). Thus, we havem = n = 1382 and K = 5.

The second graph we consider is the electricneural connectome of Caenorhabditis elegans

(a)

(b)

Fig. 3: Images with Contamination

(C.elegans) [44], [45], [46]. The hermaphroditeC.elegans somatic nervous system has over twohundred neurons, classified into 3 classes: mo-tor neurons, interneurons, and sensory neu-rons. The adjacency matrix is also undirected,unweighted, and sparse with density 1.32%.This is a relatively small data set where m =n = 253 and K = 3.

The leave-one-out errors are reported in thefirst two rows of Table 3, the running times arereported in the last two columns of Table 1, andthe contaminated classification performance areshown in the bottom panels of Figure 4. Theinterpretation and performance curve are verysimilar to those of the image data, where Al-gorithm 2 is much faster without losing perfor-mance.

4.3 Experiments on Various AlgorithmicChoices

The SRC algorithm can be implemented byother algorithm choices. For example, one mayadopt a different variable selection techniquefor the first step in SRC, e.g., the OMP from [24]is an ideal greedy algorithm to use, and screen-ing followed by `1 minimization may achievebetter variable selection as argued in [27]. For

Page 10: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

10

TABLE 1: Running Time Comparison on Real Data (in seconds)

Data Yale Images Coil Images Wikipedia Graph C-elegans NetworkSRC by Screening 72.7 25.1 32.3 0.3

SRC by `1 1101.5 345.9 573.7 9.1

0 0.5Contamination Rate

0

0.2

0.4

0.6

Cla

ssifi

catio

n E

rror

Yale Face Images

SRC by ScreeningSRC by L1

(a)

0 0.5Contamination Rate

0

0.05

0.1

0.15

Cla

ssifi

catio

n E

rror

COIL Object Images

(b)

0 0.5Contamination Rate

0.3

0.4

0.5

Cla

ssifi

catio

n E

rror

Wikipedia Hyperlinks

(c)

0 0.5Contamination Rate

0.4

0.45

0.5

0.55

0.6

0.65

Cla

ssifi

catio

n E

rror

C-Elegans Network

(d)

Fig. 4: SRC for Contaminated Real Data

the classifier rule, one naturally wonders whatthe practical difference between angle and mag-nitude rules is. To that end, we run the samesimulation and experiments as above withoutcontamination, via several different algorithmicchoices. The results are summarized in Table 2for the image experiments and in Table 3 for thenetwork experiments.

Overall, we do not observe much differencein accuracy regardless of these choices: 1. forthe variable selection step, OMP is quite similaras `1 homotopy, and screening `1 does notimprove screening either, which implies thatit suffices to simply use the fastest screeningmethod for SRC. 2. for the classification rule,the angle and magnitude rules also behaveeither exactly the same or very similar, and onecan be slightly better than another dependingon the data in use.

5 PROOFS

Theorem 1 ProofTo prove the theorem, we first state the follow-ing Lemma:

Lemma 1. Given the sparse representation X andthe testing observation x, g`1n (x) = y if and only if‖X−yβ−y‖2 < ‖X−kβ−k‖2 for all classes k 6= y. Andgscrn (x) = y if and only if θ(x, Xyβy) < θ(x, Xkβk)for all classes k 6= y.

Proof. This lemma can be proved as follows: thetesting observation can be decomposed via X as

x = X β + ε

= Xkβk + X−kβ−k + ε

for any class k, where ε is the regression resid-ual orthogonal to both Xkβk and X−kβ−k. Forthe magnitude rule, g`1n (x) = y if and only if

‖x− Xyβy‖ < ‖x− Xkβk‖ for all k 6= y

⇔‖X−yβ−y + ε‖ < ‖X−kβ−k + ε‖ for all k 6= y

⇔‖X−yβ−y‖ < ‖X−kβ−k‖ for all k 6= y,

where the last line follows because of the or-thogonality of the regression residual ε. For theangle rule, it is immediate that gscrn (x) = y if andonly if θ(x, Xyβy) < θ(x, Xkβk) for all k 6= y. Thiscompletes the proof of Lemma 1.

Now we prove Theorem 1:

Proof. As x = Xyβy + X−yβ−y + ε, it follows that

cos θ(x, Xyβy) = xT Xyβy/(‖x‖2‖Xyβy‖2)= (‖Xyβy‖22 + (X−yβ−y)T Xyβy)/‖Xyβy‖2= ‖Xyβy‖2 + (X−yβ−y)T Xyβy/‖Xyβy‖2= ‖Xyβy‖2 + ‖X−yβ−y‖2 · cos θ(Xyβy, X−yβ−y).

(6)

The first line expresses the angle via normalizedinner products; the second line decomposes x,and ε is eliminated because it is orthogonal to

Page 11: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

11

TABLE 2: Leave-one-out error comparison for image-related simulations and real data. The firsttwo rows correspond to the original and new SRC algorithms, and the remaining rows considerother algorithmic choices. For each data column, the best error rate is highlighted.

Algorithm / Data Latent Subspace Yale Faces Coil Objects`1 with Magnitude (Algorithm 1) 1.33% 0.62% 0.56%

Screening with Angle (Algorithm 2) 1.00% 1.66% 0.01%`1 with Angle 1.33% 2.11% 0.42%

Screening with Magnitude 1.00% 0.79% 0.01%OMP with Angle 2.00% 1.62% 1.25%

OMP with Magnitude 2.00% 0.75% 1.18%Screening `1 with Angle 1.33% 3.07% 0.21%

Screening `1 with Magnitude 1.33% 1.08% 0.21%

TABLE 3: Leave-one-out error comparison for network-related simulations and real data.

Algorithm / Data SBM Simulation Wikipedia C-elegans`1 with Magnitude (Algorithm 1) 0.67% 29.45% 48.22%

Screening with Angle (Algorithm 2) 0.33% 32.27% 42.69%`1 with Angle 0.67% 29.38% 41.50%

Screening with Magnitude 0.33% 32.27% 45.06%OMP with Angle 2.33% 30.97% 40.32%

OMP with Magnitude 2.33% 32.13% 46.25%Screening `1 with Angle 0.67% 30.68% 39.53%

Screening `1 with Magnitude 0.67% 30.82% 44.66%

both Xyβy and X−yβ−y; the third line divides thesquare of ‖Xyβy‖ by itself; and the last line re-expresses the remainder term by angle. Usingthe above equation and Lemma 1, we show thatthe magnitude rule is the same as the angle ruleunder either of the following conditions:

• When K = 2 and X is of full rank: first notethat the angle between vectors satisfies−1 ≤ cos θ(Xyβy, X−yβ−y) ≤ 1. Moreover,when the representation is of full rank,Xyβy and X−yβ−y cannot be in the samedirection, so the ≤ 1 inequality becomesstrict. Next, as there are only two classes,‖X−yβ−y‖ becomes the representation ofthe other class, and cos θ(Xyβy, X−yβ−y) isthe same for both classes. Assume Y = 1and Y = 2 are the two classes. Then Equa-tion 6 simplifies to

cos θ(x, X1β1) = a+ b · ccos θ(x, X2β2) = b+ a · c,

where a = ‖X1β1‖, b = ‖X2β2‖, andc = cos θ(Xyβy, X−yβ−y) < 1. With-out loss of generality, assume g`1n (x) =1, which is equivalent to a > b byLemma 1. This inequality holds if andonly if cos θ(x, X1β1) > cos θ(x, X2β2) orequivalently θ(x, X1β1) < θ(x, X2β2). Thusgscrn (x) = 1 by Lemma 1 on the angle rule.Therefore, the magnitude and the anglerule are the same.

• When data of one class is always orthog-onal to data of another class: under thiscondition, cos θ(Xkβk, X−kβ−k) = 0 for anyk, so Equation 6 simplifies to

cos θ(x, Xkβk) = ‖Xkβk‖= ‖x− X−kβ−k − ε‖= ‖x‖ − ‖ε‖ − ‖X−kβ−k‖,

where we used the fact that Xkβk, X−kβ−kand ε are all pairwise orthogonal to eachother in the third line. Since ‖x‖ and ‖ε‖

Page 12: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

12

are both fixed in the classification step, theclass y with the smallest ‖X−yβ−y‖ has thelargest cos θ(x, Xyβy) and thus the small-est angle. Therefore, g`1n (x) = gscrn (x) byLemma 1.

Theorem 2 Proof

Recall that (x, y) denotes the testing observationpair that is generated by (X, Y ). To prove thetheorem, we state two more lemmas here:

Lemma 2. Given a testing pair (X, Y ) generatedunder a latent subspace mixture model satisfying theangle condition. Denote X−Y = [X1, X2, . . . , Xs] asa collection of random variablesXi with Yi 6= Y , andC = [c1, c2, . . . , cs] as nonzero a coefficient vector ofsize s. Then it holds that

minθ(X, X−Y · C) > 0. (7)

Proof. This lemma essentially states that thetesting data X cannot be perfectly explainedby any linear combination of the training datafrom the incorrect class, i.e., X 6= X−Y · Cfor any C. Note that C is not necessarily theregression coefficient, rather any arbitrary co-efficients. Under the latent subspace mixturemodel, X = WYU for the testing data andXi = WYi

Ui for the training data where eachYi 6= Y . If there exists a vector C such thatX = X−YC, then

X = WY · U = X−YC =

s∑i=1

WYi· Ui · ci

⇔ span(WY ) ∩ span(W/WY ) 6= 0,

which contradicts the angle condition.

Lemma 3. Given a testing pair (X, Y ) generatedunder a latent subspace mixture model satisfyingthe angle condition. Denote X1, X2, . . . , Xn as agroup of random variables that are independentlyand identically distributed as X satisfying Yi = Yfor all i. As n→∞ it holds that

θ(X,X(1))→ 0. (8)

where X(1) denotes the order statistic with the small-est angle difference to X .

Proof. This lemma is guaranteed by the prop-erty of order statistics: for any ε > 0,

Prob(θ(X,Xi) < ε) > 0

⇒Prob(θ(X,X(1)) < ε)n→∞→ 1.

Therefore, as long as there are sufficiently manytraining data of the correct class, with proba-bility converging to 1 there exists one trainingobservation of the same class that is sufficientlyclose to the testing observation.

Now we prove Theorem 2:

Proof. To prove consistency, it suffices to provethat gscrn (X) = Y asymptotically. By Lemma 1on the angle rule, it is equivalent to prove thatfor sufficiently large n, for every k 6= y it holdsthat

θ(X, XY βY ) < θ(X, Xkβk).

By Lemma 2, there exists a constant ε

such that θ(X, Xkβk) > ε regardless of n. ByLemma 3, as sample size increases, with proba-bility converging to 1 it holds that θ(X,X(1)) < εfor which Y(1) = Y . Moreover, X(1) is guaran-teed to enter the sparse representation, as bydefinition it enters the sparse representation Xthe first during screening. Since Y(1) = Y , X(1) ispart of XY , and it follows that with probabilityconverging to 1,

θ(X, XY βY ) ≤ θ(X,X(1)) < ε < θ(X, Xkβk)

for all k 6= Y . Thus Algorithm 2 is consistentunder the latent subspace mixture model andthe angle condition.

Theorem 3 Proof

Proof. In both contamination models, it sufficesto prove thatWV satisfying the angle conditionis equivalent to W satisfying the angle condi-tion, then applying Theorem 2 yields consis-tency in both models.

In the fixed contamination case, the W ma-trix actually reduces to WV , so the conclusionfollows directly. For the random contamination

Page 13: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

13

case, observe that each Wi can be decomposedinto two disjoint components

Wi = diag(I(Vi = 1))Wi + diag(I(Vi < 1))Wi

spandiag(I(Vi = 1))Wi ∩ spandiag(I(Vi < 1))Wi= 0.

Therefore, as long as the uncontaminated part

WV = [diag(I(V1 = 1))W1| · · · |diag(I(VK = 1))WK ]

satisfies the angle condition, the random ma-trix W always satisfies the angle condition.For example, say the first three dimensions ofW are not contaminated while the remainingdimensions can be contaminated with nonzeroprobability. Then the first three dimensions sat-isfying Equation 4 guaranteesW also satisfyingEquation 4, regardless of how the remainingdimensions ofW are contaminated.

Theorem 4 Proof

Proof. By Lemma 1 on the angle rule, it sufficesto prove that for sufficiently large sample size,it holds that

θ(x, Xyβy) < θ(x, Xkβk)

for every k 6= y. Without loss of generality,assume x is the testing adjacency vector of size1 × n from class 1, x′ is a training adjacencyvector of size 1 × n also from class 1, andx1, x2, . . . , xs is any group of adjacency vec-tors not from class 1. Similar to the proof ofTheorem 2, it further suffices to prove that atsufficiently large n, it holds that

cos θ(x, x′) > cos θ(x,

s∑i=1

csxs)

for any non-negative and non-zero vector C =[c1, . . . , cs].

First for the within-class angle:

cos θ(x, x′)

=

∑nj=1 B(V (1, Yj)V (1, Yj))√∑n

j=1 B(V (1, Yj))∑n

j=1 B(V (1, Yj))

=

∑nj=1 B(V (1, Yj)V (1, Yj))/n√∑n

j=1 B(V (1, Yj))/n∑n

j=1 B(V (1, Yj))/n

n→∞→∑K

i=1 ρiV (1, i)V (1, i)∑Ki=1 ρiV (1, i)

=E(Q2

1)

E(Q1).

For the between-class angle:

cos θ(x,

s∑i=1

cixi)

=

∑nj=1

∑si=1 ciB(V (1, Yj)V (yi, Yj))√∑n

j=1 B(V (1, Yj))∑n

j=1∑s

i=1 ciB(V (yi, Yj))2

n→∞→s∑

i=1

ciE(Q1Qyi)/√√√√E(Q1)s∑

i=1

c2iE(Qyi) + 2

s∑j>i

s∑i=1

cicjE(QyiQyj)

≤∑s

i=1 ciE(Q1Qyi)√∑si=1 c

2iE(Q1)E(Qyi)

.

The last inequality is because∑sj>i

∑si=1 cicjE(QyiQyj) > 0 for any non-

negative coefficient vector C. Thus one canremove all these cross terms for the between-class angle and put a ≤ sign above. Since thenetwork data is always non-negative and SBMis a binary model, the bulk of the regressionvector is expected to be positive. In practice,the above inequality almost always holds fornon-negative data even if a few regressioncoefficients turn out to be negative.

Page 14: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

14

Therefore, consistency holds when

cos θ(x, x′) > cos θ(x,

s∑i=1

cixi) a.s.

⇐E(Q21)

E(Q1)≥

∑si=1 ciE(Q1Qyi)√∑si=1 c

2iE(Q1)E(Qyi)

⇔E(Q21)

E(Q1)≥

∑si=1 ciq1yi

√E(Q2

1)E(Q2yi

)√∑si=1 c

2iE(Q1)E(Qyi)

√√√√ s∑i=1

c2iE(Qyi)

E(Q1)>

s∑i=1

ciq1yi

√E(Q2

yi)

E(Q21)

⇐ q21k ·E(Q2

k)

E(Q21)<E(Qk)

E(Q1)for all k 6= 1,

which is exactly Equation 5 when generalizedto arbitrary class l instead of just class 1. Notethat this result is also used in Lemma 1 andTheorem 1 in [14]. The condition can be readilyverified on any given SBM model, and usuallyholds for models that are densely connectedwithin-class and sparsely connected between-class.

REFERENCES

[1] M. Osborne, B. Presnell, and B. Turlach, “A new approachto variable selection in least squares problems,” IMAJournal of Numerical Analysis, vol. 20, pp. 389–404, 2000.

[2] M. Osborne, B. Presnell, and B. Turlach, “On the lasso andits dual,” Journal of Computational and Graphical Statistics,vol. 9, pp. 319–337, 2000.

[3] D. Donoho and X. Huo, “Uncertainty principles and idealatomic decomposition,” IEEE Transactions on InformationTheory, vol. 47, pp. 2845–2862, 2001.

[4] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Leastangle regression,” Annals of Statistics, vol. 32, no. 2,pp. 407–499, 2004.

[5] E. Candes and T. Tao, “Decoding by linear programming,”IEEE Transactions on Information Theory, vol. 51, no. 12,pp. 4203–4215, 2005.

[6] D. Donoho, “For most large underdetermined systemsof linear equations the minimal l1-norm near solutionapproximates the sparest solution,” Communications onPure and Applied Mathematics, vol. 59, no. 10, pp. 907–934,2006.

[7] E. Candes and T. Tao, “Near-optimal signal recovery fromrandom projections: Universal encoding strategies?,” IEEETransactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006.

[8] E. Candes, J. Romberg, and T. Tao, “Stable signal recoveryfrom incomplete and inaccurate measurements,” Commu-nications on Pure and Applied Mathematics, vol. 59, no. 8,pp. 1207–1233, 2006.

[9] J. Wright, A. Y. Yang, A. Ganesh, S. Shankar, and Y. Ma,“Robust face recognition via sparse representation,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 31, no. 2, pp. 210–227, 2009.

[10] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, andS. Yan, “Sparse representation for computer vision andpattern recognition,” Proceedings of IEEE, vol. 98, no. 6,pp. 1031–1044, 2010.

[11] J. Yin, Z. Liu, Z. Jin, and W. Yang, “Kernel sparse represen-tation based classification,” Neurocomputing, vol. 77, no. 1,pp. 120–128, 2012.

[12] A. Yang, Z. Zhou, A. Ganesh, S. Sastry, and Y. Ma,“Fast l1-minimization algorithms for robust face recogni-tion,” IEEE Transactions on Image Processing, vol. 22, no. 8,pp. 3234–3246, 2013.

[13] E. Elhamifar and R. Vidal, “Sparse subspace clustering:Algorithm, theory, and applications,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 35, no. 11,pp. 2765–2781, 2013.

[14] L. Chen, C. Shen, J. T. Vogelstein, and C. E. Priebe, “Robustvertex classification,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 38, no. 3, pp. 578–590, 2016.

[15] Y. Eldar and M. Mishali, “Robust recovery of signals froma structured union of subspaces,” IEEE Transactions onInformation Theory, vol. 55, no. 11, pp. 5302–5316, 2009.

[16] Y. Eldar, P. Kuppinger, and H. Bolcskei, “Compressedsensing of block-sparse signals: Uncertainty relations andefficient recovery,” IEEE Transactions on Signal Processing,vol. 58, no. 6, pp. 3042–3054, 2010.

[17] E. Elhamifar and R. Vidal, “Block-sparse recovery via con-vex optimization,” IEEE Transactions on Signal Processing,vol. 60, no. 8, pp. 4094–4107, 2012.

[18] D. Donoho and M. Elad, “Optimal sparse representationin general (nonorthogonal) dictionaries via l1 minimiza-tion,” Proceedings of National Academy of Science, pp. 2197–2202, 2003.

[19] R. Rigamonti, M. Brown, and V. Lepetit, “Are sparserepresentations really relevant for image classification?,”in Computer Vision and Pattern Recognition (CVPR), 2011.

[20] L. Zhang, M. Yang, and X. Feng, “Sparse representationor collaborative representation: which helps face recog-nition?,” in International Conference on Computer Vision(ICCV), 2011.

[21] Q. Shi, A. Eriksson, A. Hengel, and C. Shen, “Is facerecognition really a compressive sensing problem?,” inComputer Vision and Pattern Recognition (CVPR), 2011.

[22] Y. Chi and F. Porikli, “Classification and boosting withmultiple collaborative representations,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 36, no. 8,pp. 1519–1531, 2013.

[23] J. Tropp, “Greed is good: Algorithmic results for sparseapproximation,” IEEE Transactions on Information Theory,vol. 50, no. 10, pp. 2231–2242, 2004.

[24] J. Tropp and A. Gilbert, “Signal recovery from randommeasurements via orthogonal matching pursuit,” IEEETransactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, 2007.

[25] T. Zhang, “On the consistency of feature selection usinggreedy least squares regression,” Journal of Machine Learn-ing Research, vol. 10, pp. 555–568, 2009.

[26] T. Cai and L. Wang, “Orthogonal matching pursuit forsparse signal recovery with noise,” IEEE Transactions onInformation Theory, vol. 57, no. 7, pp. 4680–4688, 2011.

[27] J. Fan and J. Lv, “Sure independence screening for ul-trahigh dimensional feature space,” Journal of the RoyalStatistical Society: Series B, vol. 70, no. 5, pp. 849–911, 2008.

Page 15: 1 Sparse Representation Classification Beyond Minimization and … · 2018-08-22 · 1 Sparse Representation Classification Beyond ‘1 Minimization and the Subspace Assumption

15

[28] J. Fan, R. Samworth, and Y. Wu, “Ultrahigh dimensionalfeature selection: beyond the linear model,” Journal ofMachine Learning Research, vol. 10, pp. 2013–2038, 2009.

[29] L. Wasserman and K. Roeder, “High dimensional variableselection,” Annals of statistics, vol. 37, no. 5A, pp. 2178–2201, 2009.

[30] J. Fan, Y. Feng, and R. Song, “Nonparametric indepen-dence screening in sparse ultra-high-dimensional addi-tive models,” Journal of the American Statistical Association,vol. 106, no. 494, pp. 544–557, 2011.

[31] C. Genovese, J. Lin, L. Wasserman, and Z. Yao, “A com-parison of the lasso and marginal regression,” Journal ofMachine Learning Research, vol. 13, pp. 2107–2143, 2012.

[32] M. Kolar and H. Liu, “Marginal regression for multitasklearning,” Journal of Machine Learning Research W & CP,vol. 22, pp. 647–655, 2012.

[33] R. Vershynin, High Dimensional Probability: An Introductionwith Applications in Data Science. 2017.

[34] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theoryof Pattern Recognition. Springer, 1996.

[35] M. E. Tipping and C. M. Bishop, “Probabilistic principalcomponent analysis,” Journal of the Royal Statistical Society,Series B, vol. 61, pp. 611–622, 1999.

[36] P. Holland, K. Laskey, and S. Leinhardt, “Stochastic block-models: First steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.

[37] D. Sussman, M. Tang, D. Fishkind, and C. Priebe, “Aconsistent adjacency spectral embedding for stochasticblockmodel graphs,” Journal of the American Statistical As-sociation, vol. 107, no. 499, pp. 1119–1128, 2012.

[38] J. Lei and A. Rinaldo, “Consistency of spectral clusteringin stochastic block models,” The Annals of Statistics, vol. 43,no. 1, pp. 215–237, 2015.

[39] N. Meinshausen, “Sign-constrained least squares estima-tion for high-dimensional regression,” Electronic Journal ofStatistics, vol. 7, pp. 1607–1631, 2013.

[40] A. Georghiades, P. Buelhumeur, and D. Kriegman, “Fromfew to many: Illumination cone models for face recogni-tion under variable lighting and pose,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 23, no. 6,pp. 643–660, 2001.

[41] K. Lee, J. Ho, and D. Kriegman, “Acquiring linear sub-spaces for face recognition under variable lighting,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 27, no. 5, pp. 684–698, 2005.

[42] S. Nene, S. Nayar, and H. Murase, “Columbia object imagelibrary (coil-20),” in Technical Report CUCS-005-96, 1996.

[43] C. E. Priebe, D. J. Marchette, Z. Ma, and S. Adali, “Man-ifold matching: Joint optimization of fidelity and com-mensurability,” Brazilian Journal of Probability and Statistics,vol. 27, no. 3, pp. 377–400, 2013.

[44] D. H. Hall and R. Russell, “The posterior nervous systemof the nematode caenorhabditis elegans: serial reconstruc-tion of identified neurons and complete pattern of synap-tic interactions,” The Journal of neuroscience, vol. 11, no. 1,pp. 1–22, 1991.

[45] L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, andD. B. Chklovskii, “Structural properties of the caenorhab-ditis elegans neuronal network,” PLoS computational biol-ogy, vol. 7, no. 2, p. e1001066, 2011.

[46] L. Chen, J. T. Vogelstein, V. Lyzinski, and C. E. Priebe, “Ajoint graph inference case study: the c.elegans chemicaland electrical connectomes,” Worm, vol. 5, no. 2, p. 1, 2016.