26
Noname manuscript No. (will be inserted by the editor) Bayesian Discriminant Analysis Using Many Predictors Xingqi Du · Subhashis Ghosal Received: date / Accepted: date Abstract We consider the problem of Bayesian discriminant analysis using a high dimensional predictor. In this setting, the underlying precision matrices can be estimated with reasonable accuracy only if some appropriate addi- tional structure like sparsity is assumed. We induce a prior on the precision matrix through a sparse prior on its Cholesky decomposition. For compu- tational ease, we use shrinkage priors to induce sparsity on the off-diagonal entries of the Cholesky decomposition matrix and exploit certain conditional conjugacy structure. We obtain the contraction rate of the posterior distribu- tion for the mean and the precision matrix respectively using the Euclidean and the Frobenius distance, and show that under some milder restriction on the growth of the dimension, the misclassification probability of the Bayesian classification procedure converges to that of the oracle classifier for both lin- ear and quadratic discriminant analysis. Extensive simulations show that the proposed Bayesian methods perform very well. An application to identify can- cerous breast tumor based on image data obtained using find needle aspirate is considered. Keywords Discriminant analysis · High dimensional predictor · Posterior concentration · Shrinkage prior · Sparsity Research of the second author is partially supported by NSF grant DMS-1510238. Xingqi Du North Carolina State University E-mail: [email protected] Subhashia Ghosal North Carolina State University E-mail: [email protected]

Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Noname manuscript No.(will be inserted by the editor)

Bayesian Discriminant Analysis Using ManyPredictors

Xingqi Du · Subhashis Ghosal

Received: date / Accepted: date

Abstract We consider the problem of Bayesian discriminant analysis using ahigh dimensional predictor. In this setting, the underlying precision matricescan be estimated with reasonable accuracy only if some appropriate addi-tional structure like sparsity is assumed. We induce a prior on the precisionmatrix through a sparse prior on its Cholesky decomposition. For compu-tational ease, we use shrinkage priors to induce sparsity on the off-diagonalentries of the Cholesky decomposition matrix and exploit certain conditionalconjugacy structure. We obtain the contraction rate of the posterior distribu-tion for the mean and the precision matrix respectively using the Euclideanand the Frobenius distance, and show that under some milder restriction onthe growth of the dimension, the misclassification probability of the Bayesianclassification procedure converges to that of the oracle classifier for both lin-ear and quadratic discriminant analysis. Extensive simulations show that theproposed Bayesian methods perform very well. An application to identify can-cerous breast tumor based on image data obtained using find needle aspirateis considered.

Keywords Discriminant analysis · High dimensional predictor · Posteriorconcentration · Shrinkage prior · Sparsity

Research of the second author is partially supported by NSF grant DMS-1510238.

Xingqi DuNorth Carolina State UniversityE-mail: [email protected]

Subhashia GhosalNorth Carolina State UniversityE-mail: [email protected]

Page 2: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

2 Xingqi Du, Subhashis Ghosal

1 Introduction

Classification has always been an important problem in statistics and its im-portance has grown further in the age of data science. Classifying betweentwo multivariate normal populations leads to the idea of discriminant analy-sis, which is essentially the Bayes classifier for the problem. Mahalanobis wasone of the pioneering developers of classification techniques. In 1936, he intro-duced the notion of a distance between populations (Mahalanobis [29]), andthe corresponding sample analogs, which indicate the level of difficulty in aclassification problem between two normal populations with common covari-ance matrix. The paper received more than six thousand citations and thedistance famously became known as the Mahalanobis distance, but some ofthe ideas can be even traced back to his earlier paper (Mahalanobis [28]). Heapplied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et al. [31], Mahalanobis [25,24,26,27,30], Majumdaret al. [23]). Quadratic discriminant analysis (QDA) is a popular method fortwo-class classification with a simple application of the Bayes theorem, whichconstructs a combination of the features as a Bayes classifier. The featuresfrom two classes follow multivariate normal distributions with different meansµ0, µ1 precision matrices Ω0 = Σ−1

0 , Ω1 = Σ−11 . Under the restriction that

Ω0 = Ω1 = Ω (say), the Bayes classifier becomes a linear combination offeatures, and the analysis is called linear discriminant analysis (LDA).

Data involving high dimensional predictors arise naturally in modern ap-plications from genetics, artificial intelligence and machine learning and soon. Since the classification rule largely relies on the estimation of the pre-cision matrices, obtaining good estimates of Ω0 and Ω1 is the key issue forthis problem, especially when the number of features p is even bigger thanthe number of observations n. In a high dimensional setting, these matricescan be estimated only in presence of additional structures such as sparsity. Asparse structure of a precision matrix gives rise to a graphical structure on thepredictors, in that conditional independence of two predictors given others im-plies that the corresponding off-diagonal entry of the precision matrix is zero.For Gaussian graphical models which we are currently concerned about, theconverse also holds. Thus a non-zero off-diagonal entry of the precision matriximplies an intrinsic relation between the two predictors. In a high dimensionalsetting, it is reasonable to think that only a few such intrinsic relations canbe present among predictors, and hence the precision matrix can be assumedto be sparse. Thus the real dimension of the parameter space will be muchsmaller than the ambient dimension, allowing sensible inference by explicitlyusing sparsity. Such a strategy was pursued in Fan and Fan [12] who studiedproperties of misclassification rates.

Non-Bayesian approaches to estimation based on regularization methodsfor the precision matrix Ω or the covariance matrix Σ have been proposed inliterature. Huang et al. [18] developed a nonparametric method to estimate thecovariance matrix through its Cholesky decomposition. Ledoit and Wolf [22]put forward the the estimation of well-conditioned covariance matrices. Bickel

Page 3: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 3

and Levina [5,6] introduced a method to estimate a nearly banded covariancematrix. Cai et al. [7,8] proposed a constrained `1-minimization approach toestimate a sparse precision matrix. Friedman et al. [13] proposed the graphicallasso estimator by minimizing the sum of the negative log-likelihood of Ωand a constant multiple of the `1-norm of Ω. A closely related method wasproposed by Yuan and Lin [38]. Meinshausen and Buhlmann [32] regressed eachcomponent on others imposing a lasso-type penalty to obtain sparse estimatesof the regression coefficients, which lead to a sparse graphical model, but thelack of symmetry in selection may lead to conceptual problems. Peng et al.[34] improved the approach by taking symmetry of the precision matrix intoaccount. A weighted version of their procedure that cleverly uses convexityto ensure fast convergence of the iterative algorithm was considered by Khareet al. [21]. Du and Ghosal [11] generalized the procedure in the context ofmultivariate observations at each node.

Bayesian methods for estimating covariance and precision matrices havealso been considered. In a low-dimensional setting, a Wishart prior on theprecision matrix leads to a conjugate prior in the Gaussian setting, but thisdoes not induce sparsity. In higher dimension, we need to consider priors thatgive special emphasis to the value zero. Banerjee and Ghosal [1] considered aconjugate graphical Wishart prior. They studied posterior contraction proper-ties in the spectral norm. However when the graphical structure is not given,posterior computation then requires evaluation of marginal probabilities ofnumerous different graphs. This is extremely difficult even for moderate di-mension, as the number of possible graphs increases rapidly with the numberof vertices. Bayesian methods directly imposing prior distributions on pre-cision matrices and thus inducing prior distributions on the correspondinggraphs were also proposed. Wang [36] suggested using a double-exponentialprior on off-diagonal entries and exponential prior on diagonal ones, whichallows identifying the graphical lasso as the posterior mode, and hence theprocedure is termed as the Bayesian graphical lasso. This prior also does notinduce sparsity and needs a forceful restriction to the cone of positive definitematrices. Wang [36] developed a block Gibbs sampler for efficient computa-tion. Banerjee and Ghosal [2] modified the Bayesian graphical lasso to allowpoint mass on the prior for off-diagonal entries, thus inducing genuine spar-sity, but it makes the computation a lot more challenging, especially since areversible jump Markov chain Monte Carlo (MCMC) sampling scheme willbe needed to accommodate moves across dimensions. Banerjee and Ghosal[2] proposed using a Laplace approximation technique to avoid MCMC meth-ods altogether for large samples. The method can only approximate modelposterior probabilities of “regular models” in which the graphical lasso doesnot have any zero component. Although priors which use a combination ofpoint mass and a continuous density possess attractive theoretical properties,the computational burden is generally difficult to overcome. A spike-and-slabprior (Mitchell and Beauchamp [33], George and McCulloch [14], Ishwaranand Rao [19]) is a mixture of two normal distributions, one of which has avery small variance so that the corresponding distribution is highly concen-

Page 4: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

4 Xingqi Du, Subhashis Ghosal

trated at zero. The computation is generally carried out by Stochastic SearchVariable Selection (George and McCulloch [14]), where an auxiliary variableis created to indicate whether or not to include that variable in the model. Afaster alternative for computing the maximum posterior probability model isrecently developed by Rockova and George [35].

Recently a variety of continuous shrinkage priors which could be expressedas global-local scale mixtures of Gaussians have been developed. Carvalho et al.[9,10] introduced the horseshoe prior, where the local shrinkage prior follows ahalf-Cauchy distribution. The main idea is to use a single density with a highconcentration near zero and thick tails to mimic the combined effect of a point-mass and a thick-tailed continuous density, which avoids the complication ofusing two different components of a mixture of a point-mass and a continuousdensity prior. Other alternatives such as the normal-gamma prior (Griffin andBrown [17]), Dirichlet-Laplace (Bhattacharya et al. [4]), horseshoe+ (Bhadraet al. [3]) also have a similar nature. Use of such shrinkage priors typically leadsto faster computation. In the context of a precision matrix, such shrinkagepriors may be used for off-diagonal entries. However, as the precision matrixneeds to be positive definite, it is convenient to use a Cholesky decompositionof the precision matrix so that the entries become unrestricted and hence ashrinkage prior without further restrictions can be imposed.

In this paper, we consider the problem of Bayesian discriminant analysisfor a high dimensional predictor. A full Bayesian method for classificationinstead of plugging in estimated values of the group means and penalizedestimates of the precision matrices into the Bayes classifier is more informativeas the former can automatically address the model uncertainty. We considertwo methods of constructing sparse priors on the precision matrix through itsCholesky decomposition using spike-and-slab or horseshoe priors on the off-diagonal entries. We also obtain an important relation connecting the sparsityof the precision matrix and that of its Cholesky decomposition.

The paper is organized as follows. Section 2 gives the technical descriptionof the problem and obtains the likelihood function. The spike-and-slab priorand the horseshoe prior on the Cholesky decomposition are introduced in Sec-tion 3, and their posterior computing strategies are developed. In Section 4,rates of contractions of the posterior distributions of model parameters are de-rived, and convergence of the misclassification rate of a Bayesian discriminantanalysis to that of the oracle classifier is established under a mild restrictionon the dimension assuming sparsity of the true precision matrix. An extensivesimulation study to compare the performance of the proposed methods withsome other classification methods is conducted in Section 5. An applicationto a dataset for classifying breast cancer is considered in Section 6. Proofs arepresented in Section 7.

Page 5: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 5

2 Problem Description

Consider observing training data (Xm, Ym), m = 1, . . . , n, where Xm ∈ X andYm ∈ 0, 1, and

Xm|Ym = k ∼

p0(x), if Ym = 0,

p1(x), if Ym = 1,(1)

for some densities p0 and p1. Thus the X observations come from either Class 0with density p0 or Class 1 with density p1, depending on the label Y . However,beyond training data, only X-values are observed and the problem is to predictits label Y based on the predictor X. Let π = P(Y = 1), the prevalencerate of Class 1 in the mixture population with density (1 − π)p0 + πp1. Aclassification rule (or a classifier) φ is a map φ : X → [0, 1], such that φ(x)stands for the probability of classifying a new observation to Class 1. Theserules are generally randomized, but if φ takes values only in 0, 1 (i.e. theclass is deterministically obtained given the predictor X), it will be called anon-randomized classifier. The misclassification rate of a classifier φ is givenby r(φ) = E[Y (1−φ(X)) + (1−Y )φ(X)]. If π, p0, p1 are given, then the Bayesclassifier φ is given by

φ∗(X) =

1, if πp1(x) > (1− π)p0(x),

0 if πp1(x) ≤ (1− π)p0(x).(2)

The following result, which is related to the Bayes risk minimizing prop-erty of Bayes rules, shows that the Bayes classifier has the smallest possiblemisclassification rate.

Theorem 1 The Bayes classifier φ∗(X) minimizes the misclassification rate,that is, for any other classifier φ,

E[Y (1− φ∗(X)) + (1− Y )φ∗(X)

]≤ E

[Y (1− φ(X)) + (1− Y )φ(X)

].

Although the result is essentially known, for the sake of completeness, aformal proof is given in Section 8.

Now consider the situation where the predictor variables are obtained frommultivariate normal populations: Xm ∼ N(µ0, Σ0) if Ym = 0, and Xm ∼N(µ1, Σ1) if Ym = 1. Let Ω0 = Σ−1

0 and Ω1 = Σ−11 be the corresponding

precision matrices. Then the Bayes classifier reduces to the indicator of

1

2log detΩ1 −

1

2XTΩ1X +XTΩ1µ1 −

1

2µT1 Ω1µ1 + log π (3)

>1

2log detΩ0 −

1

2XTΩ0X +XTΩ0µ0 −

1

2µT0 Ω0µ0 + log(1− π).

The decision boundary is quadratic in X and hence the classification procedureis called Quadratic Discriminant Analysis (QDA). If further Ω0 = Ω1 = Ω(say), then the classification procedure simplifies to

XTΩ(µ1 − µ0) >1

2(µT1 Ω1µ1 − µT0 Ω0µ0) + log

1− ππ

. (4)

Page 6: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

6 Xingqi Du, Subhashis Ghosal

Because the decision boundary is linear in X, the procedure is called LinearDiscriminant Analysis (LDA).

In practice, the parameters π, µ0, µ1, Ω0, Ω1 are unknown. Hence the Bayesclassifier cannot be executed and should only be viewed as an oracle classifier.The goal is then to mimic the oracle classifier in terms of misclassificationrate. If the true parameter values are given by π∗, µ∗0, µ

∗1, Ω

∗0 , Ω

∗1 but (3) is

applied with parameters (π, µ0, µ1, Ω0, Ω1), then the misclassification rate asa function of (π, µ0, µ1, Ω0, Ω1) can be written as

r(π, µ0, µ1, Ω0, Ω1) = π∗P(µ∗1 ,Ω∗1 )XTAX +BTX + C > 0 (5)

+(1− π∗)P(µ∗0 ,Ω∗0 )XTAX +BTX + C > 0,

where A = −(Ω1 −Ω0)/2, B = (Ω1µ1 −Ω0µ0), and

C = log π − log(1− π) + (log detΩ1 − log detΩ0 − µT1 Ω1µ1 + µT0 Ω0µ0)/2,

and the probabilities are evaluated under the true distributions N(µ∗1, Σ∗1 ) and

N(µ∗0, Σ∗0 ) respectively. SinceA,B,C depend continuously on (π, µ0, µ1, Ω0, Ω1),

it is clear that r(π, µ0, µ1, Ω0, Ω1) also depends continuously on r(π, µ0, µ1, Ω0, Ω1).In fact it is also locally Lipschitz continuous since the function is also clearlycontinuously differentiable. Further, for LDA, the misclassification probabilityhas an explicit expression

r(π, µ0, µ1, Ω) (6)

= π∗Φ

( 12µ

T1 Ωµ1 − 1

2µT0 Ωµ0 − (µ1 − µ0)TΩµ∗1 + log 1−π

π√(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)

)+(1− π∗)Φ

( 12µ

T0 Ωµ0 − 1

2µT1 Ωµ1 − (µ0 − µ1)TΩµ∗0 + log π

1−π√(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)

),

where Φ stands for the standard normal cumulative distribution function.Now, when the parameters are unknown, substituting their consistent es-

timators in both QDA and LDA will lead to classification rules whose mis-classification rates will converge to those of the oracle classifier, and henceasymptotic optimality is obtained. In a Bayesian setting, it is more natural toput priors on the parameters and obtain their posterior distributions given theobservations. The Bayes classifier is then given by the posterior expectation.An advantage of the Bayesian procedure is that an uncertainty quantificationabout the classification is automatically obtained. Consistency of the poste-rior then implies that the misclassification probability of the Bayes classifierconverges to that of the oracle classifier, assuming that the parameter spaceremains fixed with respect to the sample size. In high dimensional models,however, the parameter space changes with the sample size and a more re-fined analysis is necessary to conclude that the misclassification probabilityof the Bayes classifier converges to that of the oracle classifier. In particu-lar, evaluating the Lipschitz constant for the misclassification probability asa function of the dimension, a bound for the posterior contraction rate and

Page 7: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 7

a sufficiently slow growth of the dimension with respect to the sample sizeensure convergence. Theorem 2 of Section 5 addresses the issue.

Let Ω0 = Σ−10 = L0D0L

T0 and Ω1 = Σ−1

1 = L1D1LT1 be the Cholesky

decompositions of Ω0 and Ω1 respectively, where L0 = ((l0,ij)) and L1 = ((l1,ij))are lower triangular matrices with diagonal elements equal to 1 and D0 =(d0,1, . . . , d0,p) and D1 = (d1,1, . . . , d1,p) with d0,i > 0 and d1,i > 0 for all i =1, . . . , p. We put priors on the Cholesky decomposition of the precision matrixto in order to ensure the positive definiteness of the matrix. Without loss ofgenerality, we assume that the first N observations are from Class 1 and thererest are from Class 0. Let X0 = (N−n)−1

∑nm=N+1Xm, X1 = N−1

∑Nm=1Xm,

S0 =∑nm=N+1(Xm−X0)(Xm−X0)T , and S0 =

∑Nm=1(Xm−X1)(Xm−X1)T .

Then the likelihood can be written as

`(µ0, µ1, Ω0, Ω1)

∝|L0D0LT0 |(n−N)/2|L1D1L

T1 |N/2

exp− 1

2

N∑m=1

(Xm − µ1)TΩ1(Xm − µ1) +

n∑m=N+1

(Xm − µ0)TΩ0(Xm − µ0)

=

p∏i=1

d(n−N)/20,i d

N/21,i

exp− 1

2

N∑i=m

(Xm − µ1)TΩ1(Xm − µ1) +

n∑m=N+1

(Xm − µ0)TΩ0(Xm − µ0)

=

p∏i=1

d(n−N)/20,i exp−n−N

2(X0 − µ0)TL0D0L

T0 (X0 − µ0)− 1

2tr(S0L0D0L

T0 )

p∏i=1

dN/21,i exp−N

2(X1 − µ1)TL1D1L

T1 (X1 − µ1)− 1

2tr(S1L1D1L

T1 ).

If we have prior information that Ω0 = Ω1 = Ω = LDLT where D has diagonalelements d1, . . . , dp, then the likelihood can be simplified as

`(µ0, µ1, Ω)

∝p∏i=1

dn/2i exp

− 1

2

N∑m=1

(Xm − µ1)TΩ(Xm − µ1)

+

n∑m=N+1

(Xm − µ0)TΩ(Xm − µ0)

=

p∏i=1

dn/2i exp−1

2tr(SLDLT )

− n−N2

(X0 − µ0)TLDLT (X0 − µ0)− N

2(X1 − µ1)TLDLT (X1 − µ1),

where S = S0 + S1.

Page 8: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

8 Xingqi Du, Subhashis Ghosal

3 Prior assignment and posterior computation

First we consider the model in QDA. For each i = 1, . . . , p, we give dk,i agamma prior dk,i ∼ Ga(α, β) for k = 0, 1. A normal prior N(0, σ2Ip) with alarge σ is natural for µ0 and µ1. To put a sparse prior on Ω0 and Ω1 using theCholesky decomposition, we need to make the off-diagonal entries of L0 andL1 sparse (or approximately so). For LDA, just one lower-triangular matrix Land diagonal matrix D need to be considered. A problem of putting a prior onprecision matrix through the Cholesky decomposition is that the prior dependson the ordering of predictor variables. While it is not possible to completelyavoid the problem, much of it can be alleviated by appropriately changing thesparsity level at each row of L = ((lij)). As the following proposition shows,decreasing the probability of non-zero inversely proportional to the square rootof the serial number of the row makes the sparsity roughly of the order of aconstant. Let stand for equality of order.

Proposition 1 Consider a prior on the entries lij, j < i, such that the entries

are independent and let ρi = Π(lij 6= 0). If ρi = Cp/√i for some constant Cp

possibly depending on p, then for i j →∞, Π(ωij 6= 0) C2p , free of i.

In the conclusion, exact sparsity can be replaced by approximate sparsity.It is also typical to consider Cp decay like a power of p for good large samplecontraction properties of the posterior; see Section 4. The proof of propositionis given in the appendix.

For the parameter π standing for the prevalence rate of Class 1 in the entirepopulation, a beta prior π ∼ Be(a0, a1) is natural and conjugate, leading toposterior distribution π ∼ Be(a0 +N, a1 +n−N), which depends only on thecounts. This posterior concentrates around the true prevalence rate π∗ at therate n−1/2 as well as around the empirical estimate π = N/n. The differencebetween using the full Bayesian approach and plugging-in the estimator π isminimal from both theoretical and computational points of view.

3.1 Spike-and-Slab Prior

Let γk,ij = 0 or 1, for each i > j, k = 0, 1. The prior on the elements of Lkare given by the normal mixture

lk,ij |γk,ij ∼ (1− γk,ij)N(0, v0) + γk,ijN(0, v1). (7)

The indicators γk,ij are considered to be independent Bernoulli variables

γk,ij ∼ Ber(Cp/√i), (8)

where Cp only depends on p. It will be found from the simulation study thatthe misclassification rate is not sensitive to the choice of Cp.

Sampling from the posterior distribution of the parameters can be carriedout by Gibbs sampling using the following steps; here for any parameter θ, let

Page 9: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 9

θ|· stand for the posterior distribution given the remaining parameters, andfor an index i, the notation −i stands for j : j 6= i. Let

S∗1 =∑Ni=1(Xi − µ1)(Xi − µ1)T = S1 +N(X1 − µ1)(X1 − µ1)T ,

S∗0 =∑pi=N+1(Xi − µ0)(Xi − µ0)T = S0 + (n−N)(X0 − µ0)(X0 − µ0)T .

(i) Sampling of L0, L1: For k = 0, 1 and each 1 ≤ j < i ≤ p pair, let (1 −γk,ij)v0 + γk,ijv1 = τk,ij . Then

lk,ij |· ∼ N(−( 1

τk,ij+ dk,jS

∗k,ii

)−1lTk,−i,jS

∗k,−i,i,

( 1

τk,ij+ dk,jS

∗k,ii

)−1).

(ii) Sampling of D0, D1: Given Lk, µk, dk,i|· ∼ Ga(α+n/2, β+(LTk S

∗kLk)ii/2

).

(iii) Sampling of γ0, γ1: Given L0, L1 and C, γk,ij |· ∼ Ber(pk,ij), k = 0, 1, where

pk,ij =(Cp/

√i)f(lk,ij |γk,ij = 1)

(Cp/√i)f(lk,ij |γk,ij = 1) + (1− Cp/

√i)f(lk,ij |γk,ij = 0)

and f stands for the density of lk,ij given the knowledge of the component.(iv) Sampling of µ0, µ1: Given Ω0 = L0D0L

T0 , Ω1 = L1D1L

T1 and σ,

µ1|· ∼ N(

(1

σ2Ip +NΩ1)−1NΩ1X1, (

1

σ2Ip +NΩ1)−1

)and

µ0|· ∼ N(

(1

σ2Ip + (n−N)Ω0)−1(n−N)Ω1X0, (

1

σ2Ip + (n−N)Ω0)−1

),

where XT1 = (X1, . . . , XN )− µ11N and XT

1 = (XN+1, . . . , Xn)− µ01N−n;here 1k stands for k-column vector consisting of 1 at each place.

In case of LDA, as Ω0 = Ω1 = Ω (say), Steps (i) and (ii) are modified to

lij |· ∼ N(−( 1

τij+ djS

∗ii

)−1lT−i,jS

∗−i,i,

( 1

τij+ djSii

)−1)

and di|· ∼ Ga(α + n/2, β + (LTS∗L)ii/2

), where Ω = LDLT , S∗ = S∗1 + S∗0 ,

τij = τ1,ij = τij , and in (iv) Ω0, Ω1 are replaced by Ω.

3.2 Horseshoe Prior

Similar to the previous model, for each i = 1, . . . , p and k = 0, 1, we letdk,i ∼ Ga(α, β) and µ0, µ1 ∼ N(0, σ2Ip) independently, where σ2 → ∞. Toput a prior on Ωk, k = 0, 1, we let each off-diagonal entry lk,ij , 1 ≤ j < i ≤ p,of Lk have a scale mixture of normals prior

lk,ij |λk,ij ∼ N(0, λ2k,ijτ

2), λk,ij ∼ C+(0, 1), (9)

where C+(0, 1) is the standard half-Cauchy distribution on the positive realline. We refer to λk,ijs as the local shrinkage parameters and to τ as the globalshrinkage parameter.

Sampling from the posterior distribution uses the following steps of Gibbssampling:

Page 10: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

10 Xingqi Du, Subhashis Ghosal

(i) Sampling of L0, L1: For k = 0, 1 and 1 ≤ j < i ≤ p,

lk,ij |· ∼ N(−( 1

λ2k,ijτ

2+ dk,jS

∗k,ii

)−1lTk,−i,jS

∗k,−i,i,

( 1

λ2k,ijτ

2+ dk,jS

∗k,ii

)−1),

where S∗0 and S∗1 are as in the previous section.(ii) Sampling of D0, D1: Given L0, L1, α, β, S∗0 and S∗1 , for k = 0, 1 and

1 ≤ i ≤ p,dk,i|· ∼ Ga

(α+ n/2, β + (LTk S

∗kLk)ii/2

).

(iii) Sampling of λ0,ij , λ1,ij : Given L0 L1 and τ , for k = 0, 1 and each 1 ≤ j <i ≤ p pair, λk,ij has density

ρ(λk,ij) ∝1

λl,ij(1 + λ2k,ij)

exp−

l2k,ij2λ2

k,ijτ2

.

(iv) Sampling of µ0, µ1: Given Ω0 = L0D0LT0 , Ω1 = L1D1L

T1 and σ,

µ1|· ∼ N(

(1

σ2Ip +NΩ1)−1NΩ1X1, (

1

σ2Ip +NΩ1)−1

)and

µ0|· ∼ N(

(1

σ2Ip + (n−N)Ω0)−1(n−N)Ω1X0, (

1

σ2Ip + (n−N)Ω0)−1

),

where X0 and X1 are as in the previous section.

In case of LDA, as explained for spike-and-slab prior, only Steps (i) and(ii) need to be modified to work with the common values lij and di, and in(iv) the common Ω is to be used.

4 Asymptotic Properties of Bayes Classifier

In this section, we study some large sample properties of the proposed Bayesianmethods for estimation and the resulting Bayes classifier. Let µ∗0, µ∗1, Ω∗

(or Ω∗0 , Ω∗1 for QDA), and π∗ be the true values of µ0, µ1, Ω (or Ω0 andΩ1 for QDA), and π respectively. Let Ω∗ = L∗D∗L∗T (or Ω∗0 = L∗0D

∗0L∗T0 ,

Ω∗1 = L∗1D∗1L∗T1 ). Let Dn be the observed data (X1, Y1), . . . , (Xn, Yn). We

generically write Π for the prior distribution and Π(·|Dn) for the posteriordistribution given Dn. The following assumptions are made throughout.

(A1) There exists a constant B0 > 0 free of n and p, such that the entries ofµ∗0, µ∗1, D∗ and L∗ (or D∗0 , D∗1 , L∗0, L∗1 for QDA) are bounded in absolutevalue by B0.

(A2) There exist constants 0 < Λmin ≤ Λmax <∞, such that the true precisionmatrix Ω∗ satisfies Λmin ≤ λmin(Ω∗) ≤ λmax(Ω∗) ≤ Λmax, where λmin andλmax denote the smallest and largest eigenvalues of a matrix perspectively.

(A3) Let s be the number of off-diagonal elements in L∗ (or L∗0, L∗1 for QDA).

Page 11: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 11

(A4) There exists a finite constant M0 > 0, such that (µ∗1 − µ∗0)TΩ∗(µ∗1 − µ∗0) ≥M0 > 0 (or (µ∗1 − µ∗0)TΩ∗k(µ∗1 − µ∗0) ≥M0 > 0, k = 0, 1, for QDA).

We assume that the prior on L and D (or L0, L1, D0 and D1 for QDA) istruncated to the domain ‖(LDLT )−1‖S . pν for some predetermined constantν > 0. As in Wei and Ghosal [37], we assume the following conditions on theprior density πi(·) for the entry of L in the ith row: For some b′ > 2, and forall sufficiently large B > 0,

1− min1≤i≤p

∫ εn/pν

−εn/pνπi(l)dl ≤ p−b

′, (10)

− max1≤i≤p

log

(∫|l|≥B

πi(l)dl

)& B, (11)

− min1≤i≤p

log

(inf

l∈[−B0,B0]πi(l)

). log n; (12)

here and below . stands for inequality up to a constant multiple.By following the calculations of Wei and Ghosal [37], for the spike-and-

slab prior, the conditions could be simplified as log v1 + B2/2v1 . nε2n/sand v0 εn/(p

ν√

log p). The second condition holds since the tail of the

prior decays like e−B2/(2v1). For the horseshoe prior (slightly modified to have

the prior distribution of λij truncated above by 1/τ) the conditions hold if

τ < p−(2+b′2) for some constant b′2 > 0, and τ2/B < p−(1+b′3) for some constantb′3 > 0.

First we obtain posterior contraction rates of the parameters used in QDAand LDA in the following sense. Let ‖ · ‖ stand for the Euclidean norm of avector, ‖ ·‖F and ‖ ·‖S respectively for the Frobenius (Euclidean) and spectralnorm of a matrix.

Theorem 2 Under (A1)–(A3), for k = 0, 1, and any Mn →∞,

E[Π‖µk − µ∗k‖ ≤Mnεn

∣∣Dn]→ 0, (13)

E[Π‖Ωk −Ω∗k‖F ≤Mnεn

∣∣Dn]→ 0, (14)

where εn = n−1/2(p+ s)1/2(log p)1/2.For LDA, Ω0 = Ω1 = Ω, the second assertion reduces to

E[Π‖Ω −Ω∗‖F ≤Mnεn

∣∣Dn]→ 0. (15)

The following shows asymptotic equivalence of the misclassification prob-abilities of the Bayes classifier with the oracle classifier.

Theorem 3 Under Assumptions (A1)–(A4), for any Mn →∞,

Π|r(π, µ0, µ1, Ω0, Ω1)− r(π∗, µ∗0, µ∗1, Ω∗0 , Ω∗1)| ≤Mn(p+ s)

√log p

n

∣∣∣Dn→ 0

in probability under the true distribution.For LDA, the same assertion holds for |r(π, µ0, µ1, Ω)− r(π∗, µ∗0, µ∗1, Ω∗)|.

Page 12: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

12 Xingqi Du, Subhashis Ghosal

The theorem implies that if s = O(p) (which is typically the situation), thenthe Bayes classifier performs almost like the oracle classifier if p2 log p = o(n).This is a substantial improvement over the requirement p4 = o(n) needed tomake the misclassification rate converge to that of the oracle without assumingany sparse structure in the precision matrix.

5 Simulation

We conduct a series of simulation studies to assess the performance of our pro-posed method. We compare three Bayesian approaches in this Section, spike-and-slab prior (Spike-and-slab), horseshoe continuous shrinkage prior (Horse-shoe), and Wishart prior (Wishart) that does use sparsity of the precisionmatrix. We also include in our studies one non-Bayesian approach (Bickel-Levina) proposed by Bickel and Levina [5,6].

In the first study, we consider the case where Ω0 = Ω1 = Ω. We specifysix different models of Ω or Σ = Ω−1 as follows:

(i) Model 1: AR(1) model, σij = 0.7|i−j|.(ii) Model 2: AR(2) model, ωii = 1, ωi,i−1 = ωi−1,i = 0.5, ωi,i−2 = ωi−2,i =

0.25.(iii) Model 3: Block model, σii = 1, σij = 0.5 for 1 ≤ i 6= j ≤ p/2, σij = 0.5 for

p/2 + 1 ≤ i 6= j ≤ p, and σij = 0 otherwise.(iv) Model 4: Star model, where every node is connected to the first node, and

ωii = 1, ω1,i = ωi,1 = 0.1, and ωij = 0 otherwise.(v) Model 5: Circle model, ωii = 2, ωi,i−1 = ωi−1,i = 1, ω1,p = ωp,1 = 0.9.

(vi) Model 6: Scale-free model, where Ω is a randomly generated positive defi-nite matrix with 10% sparsity.

We fix total sample size n = 100 for all scenarios. The data are generatedusing the steps below:

(i) Obtain the p × p precision matrix Ω described above and compute thecovariance matrix Σ = Ω−1.

(ii) Generate a p × 1 vector µ1 = (µ1,1, . . . , µ1,p)T , where each element is

randomly sampled from N(0, 0.2), and let µ0 = −µ1.(iii) Generate Xm ∼ N(µ1, Σ) if 1 ≤ m ≤ n/2, and Xm ∼ N(µ0, Σ) if n/2 <

m ≤ n. Set Ym = 1 for m = 1, . . . , n/2 and Ym = 0 for m = n/2 + 1, . . . , n.

Thus prevalence rates are set to 50% for both classes. We randomly select 70%of the data as training data, and treat the remaining 30% as test data. Wecompare the misclassification rate in the test data. The misclassification rateswith standard errors for the test sets are given in Table 1.

All approaches give similar misclassification rates when Ω is extremelysparse and its structure is relatively simple (like AR(1) and AR(2)). How-ever, as we specify more complicated structure of Ω, both spike-and-slab andhorseshoe methods outperform the non-Bayesian method and also the Wishartmethod which does not assume sparsity in Ω. This is more obvious when wehave larger p.

Page 13: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 13

Table 1: Misclassification rates for normally distributed data with standard errors (se) inparentheses in LDA setting

Model p Wishart Spike-and-slab Horseshoe Bickel-Levina

AR(1)50 0.08 (0.051) 0.04 (0.035) 0.04 (0.035) 0.04 (0.027)100 0.11 (0.076) 0.06 (0.046) 0.05 (0.032) 0.05 (0.049)

AR(2)50 0.27 (0.091) 0.19 (0.090) 0.19 (0.075) 0.18 (0.073)100 0.28 (0.086) 0.23 (0.068) 0.23 (0.073) 0.17 (0.071)

Block50 0.11 (0.065) 0.06 (0.038) 0.06 (0.041) 0.12 (0.086)100 0.13 (0.042) 0.08 (0.079) 0.09 (0.065) 0.32 (0.116)

Star50 0.25 (0.103) 0.16 (0.113) 0.17 (0.115) 0.18 (0.075)100 0.24 (0.091) 0.20 (0.084) 0.20 (0.071) 0.39 (0.192)

Circle50 0.13 (0.084) 0.08 (0.065) 0.07 (0.061) 0.17 (0.121)100 0.19 (0.081) 0.14 (0.072) 0.12 (0.045) 0.26 (0.076)

Scale-free50 0.10 (0.048) 0.06 (0.047) 0.06 (0.047) 0.13 (0.077)100 0.07 (0.061) 0.04 (0.047) 0.04 (0.044) 0.23 (0.114)

Table 2: Misclassification rates for t-distributed data with standard errors (se) in parenthesesin LDA setting

Model p Wishart Spike-and-slab Horseshoe Bickel-Levina

AR(1)50 0.19 (0.083) 0.13 (0.083) 0.13 (0.062) 0.24 (0.137)100 0.18 (0.039) 0.15 (0.048) 0.16 (0.039) 0.16 (0.061)

AR(2)50 0.30 (0.101) 0.28 (0.055) 0.27 (0.075) 0.32 (0.109)100 0.28 (0.064) 0.28 (0.120) 0.32 (0.063) 0.37 (0.113)

Block50 0.25 (0.072) 0.16 (0.063) 0.16 (0.070) 0.27 (0.156)100 0.14 (0.060) 0.13 (0.073) 0.14 (0.077) 0.33 (0.121)

Star50 0.30 (0.092) 0.25 (0.076) 0.25 (0.084) 0.33 (0.094)100 0.25 (0.090) 0.22 (0.082) 0.25 (0.097) 0.40 (0.140)

Circle50 0.24 (0.053) 0.22 (0.062) 0.22 (0.035) 0.25 (0.086)100 0.24 (0.097) 0.18 (0.063) 0.17 (0.052) 0.34 (0.148)

Scale-free50 0.19 (0.113) 0.16 (0.081) 0.16 (0.080) 0.20 (0.093)100 0.14 (0.057) 0.12 (0.045) 0.11 (0.039) 0.27 (0.126)

In the second study, we still use the six models of Ω or Σ = Ω−1 as above.However, the data generation steps are slightly different.

(i) Obtain p×p precision matrix Ω and compute covariance matrix Σ = Ω−1.(ii) Generate a p × 1 vector µ1 = (µ1,1, . . . , µ1,p)

T , where each element israndomly sampled from N(0, 0.2), and let µ0 = −µ1.

(iii) Generate observations (Xm, Ym) for 1 ≤ m ≤ n by letting Ym = 1 andXm ∼ t2(µ1, Σ) for m = 1, . . . , n/2, and Ym = 0 and Xm ∼ t2(µ0, Σ) ifm = n/2+1, . . . , n, where t2 stands for the multivariate t-distribution with2 degrees of freedom.

We incorrectly assume that the data follows Gaussian distributions andcheck how the methods perform. The misclassification rates with standarderrors for the test sets are given in Table 2.

As expected, performance of all methods are worse than in the Gaussiancase, since the model is misspecified. Nevertheless, we note that the spike-and-slab and horseshoe approaches provide lower misclassification rates than thenon-Bayesian method and the Wishart prior method.

Page 14: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

14 Xingqi Du, Subhashis Ghosal

Table 3: Misclassification rates for normally distributed data with standard errors (se) inparentheses in QDA setting

Model p Method Wishart Spike-and-slab Horseshoe Bickel-Levina

AR(1)50

LDA 0.10 (0.061) 0.05 (0.050) 0.04 (0.041) 0.07 (0.056)QDA 0.06 (0.054) 0.04 (0.031) 0.01 (0.016) 0.07 (0.090)

100LDA 0.13 (0.080) 0.11 (0.062) 0.11 (0.068) 0.11 (0.059)QDA 0.25 (0.108) 0.09 (0.061) 0.08 (0.083) 0.46 (0.085)

AR(2)50

LDA 0.22 (0.110) 0.17 (0.086) 0.18 (0.091) 0.17 (0.127)QDA 0.30 (0.095) 0.14 (0.027) 0.13 (0.072) 0.13 (0.067)

100LDA 0.31 (0.058) 0.23 (0.060) 0.24 (0.057) 0.18 (0.066)QDA 0.18 (0.086) 0.13 (0.044) 0.14 (0.065) 0.12 (0.072)

Block50

LDA 0.07 (0.055) 0.06 (0.037) 0.06 (0.043) 0.09 (0.079)QDA 0.07 (0.044) 0.02 (0.032) 0.01 (0.023) 0.44 (0.066)

100LDA 0.12 (0.042) 0.10 (0.043) 0.11 (0.050) 0.21 (0.080)QDA 0.15 (0.103) 0.03 (0.045) 0.06 (0.052) 0.47 (0.098)

Star50

LDA 0.13 (0.101) 0.09 (0.099) 0.09 (0.099) 0.12 (0.102)QDA 0.22 (0.112) 0.09 (0.079) 0.09 (0.079) 0.41 (0.082)

100LDA 0.15 (0.069) 0.12 (0.048) 0.13 (0.047) 0.32 (0.095)QDA 0.20 (0.067) 0.10 (0.057) 0.13 (0.061) 0.43 (0.083)

Circle50

LDA 0.18 (0.055) 0.14 (0.062) 0.13 (0.083) 0.21 (0.072)QDA 0.23 (0.147) 0.08 (0.061) 0.05 (0.074) 0.38 (0.103)

100LDA 0.22 (0.063) 0.18 (0.057) 0.17 (0.076) 0.33 (0.097)QDA 0.28 (0.099) 0.13 (0.060) 0.13 (0.072) 0.43 (0.109)

Scale-free50

LDA 0.13 (0.089) 0.06 (0.049) 0.07 (0.058) 0.15 (0.098)QDA 0.07 (0.055) 0.01 (0.023) 0.03 (0.029) 0.34 (0.116)

100LDA 0.08 (0.081) 0.05 (0.048) 0.07 (0.076) 0.17 (0.146)QDA 0.05 (0.032) 0.02 (0.032) 0.02 (0.041) 0.36 (0.223)

Finally, we conduct a simulation study for the cases where data are fromtwo distributions with different covariance matrices. To be specific, let Xm ∼N(µ1, Σ1) if Ym = 1, and Xm ∼ N(µ0, Σ0) if Ym = 0, m = 1, . . . , n. Similar tothe previous settings, we specify six different models of Ω0, Ω1 or Σ0, Σ1.

(i) Model 1: AR(1) model, σ0,ij = 0.9|i−j|, σ1,ij = 0.2|i−j|.(ii) Model 2: AR(2) model, ω0,ii = ω1,ii = 1, ω0,i,i−1 = ω0,i−1,i = 0.6,

ω0,i,i−2 = ω0,i−2,i = 0.3, ω1,i,i−1 = ω1,i−1,i = 0.1, ω1,i,i−2 = ω1,i−2,i =0.05.

(iii) Model 3: Block model, σ0,ii = σ1,ii = 1, σ0,ij = 0.9 and σ1,ij = 0.2 for1 ≤ i 6= j ≤ p/2, σ0,ij = 0.9 and σ1,ij = 0.2 for p/2 + 1 ≤ i 6= j ≤ p, andσ0,ij = 0, σ1,ij = 0 otherwise.

(iv) Model 4: Star model, where every node is connected to the first node, andω0,ii = 1, ω1,ii = 3, ω0,1i = ω0,i1 = 0.05, ω1,1i = ω1,i1 = 0.3, and ω0,ij = 0,ω1,ij = 0 otherwise.

(v) Model 5: Circle model, ω0,ii = 2, ω1,ii = 1, ω0,i,i−1 = ω0,i−1,i = 1, ω0,1p =ω0,p1 = 0.9, and ω1,i,i−1 = ω1,i−1,i = 0.3, ω1,1p = ω1,p1 = 0.1

(vi) Model 6: Scale-free model, where Ω0 and Ω1 are both randomly generatedwith 10% sparsity.

The misclassification rates with standard errors for the test sets are given inTable 3.

Page 15: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 15

Similar to the first two studies, the proposed Bayesian approaches (spike-and-slab and horseshoe) outperform the other two counterparts in most sce-narios. It is also obvious that for these two methods, the QDA models alwaysgive lower misclassification rates, since the data in two classes are from distri-butions with different covariance matrices.

6 Application

Breast cancer is now the second leading cause of cancer deaths among women.There exist several methods to diagnose breast cancer, among which biopsiesmaintain the highest accuracy in distinguishing malignant lump from benignones. However, they are invasive, time consuming, and costly.

A revolutionary computer imaging system termed as “find needle aspirate”(FNA) that help diagnose breast cancer with high accuracy has been developedat University of Wisconsin-Madison. As described in Izenman [20], “A small-gauge needle is used to extract a fluid sample (i.e., FNA) from a patientsbreast lump or mass (detected by self-examination and/or mammography);the FNA is placed on a glass slide and stained to highlight the nuclei of theconstituent cells; an image from the FNA is transferred to a workstation bya video camera mounted on a microscope; and the exact boundaries of thenuclei are determined.”

In the data set, features are computed from an image of an FNA of abreast mass. Specifically, ten real-valued features of the nucleus of each cell arecomputed from fluid samples. The details of the features are given in Table 4.For each of these features, the following statistics are computed: the meanvalue, extreme value (i.e., largest or worst value, biggest size, most irregularshape), and standard deviation, resulting in a total of 30 real-valued variables.

Similar to the data pre-processing steps in Izenman [20], we replaced datavalues or zero by 0.001 and took natural logarithms of each variable. Thedata set consists of 569 cases (images), of which 212 were diagnosed as ma-lignant (confirmed by biopsy) and 357 as benign (confirmed by biopsy or bysubsequent periodic medical examinations). We apply our proposed methodsspike-and-slab and horseshoe to the data set, with comparison to Bickel-Levinaand decision tree (Tree). The prediction errors are calculated using 5-foldcross-validation. The results are given in Table 5. It can be seen that bothspike-and-slab and horseshoe give smaller prediction errors than Wishart andBickel-Levina. The prediction error of decision tree is comparatively small.However the standard error indicates that it is less stable than our proposedapproaches. If QDA is used, all methods could give better prediction errorsthan the corresponding LDA.

Page 16: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

16 Xingqi Du, Subhashis Ghosal

Table 4: Ten features computed for each cell nucleus

Feature DescriptionRadius Mean of distances from center to points on the perimeterTexture standard deviation of gray-scale valuesPerimeterAreaSmoothness Local variation in radius lengthsCompactness Calculated as perimeter2 / area −1.0Concavity Severity of concave portions of the contourConcave points Number of concave portions of the contourSymmetryFractal dimension Calculated as “coastline approximation” −1

Table 5: Prediction errors using 5-fold cross-validation for the breast data analysis withestimated standard errors in parentheses

Model Wishart Spike-and-slab Horseshoe Bickel-Levina TreeLDA 0.15 (0.017) 0.11 (0.021) 0.12 (0.024) 0.14 (0.032)

0.10 (0.031)QDA 0.13 (0.017) 0.08 (0.020) 0.07 (0.022) 0.13 (0.026)

7 Proof

Proof (Proof of Proposition 1) The (i, j)th element of a precision matrix Ω =

LDLT is zero if and only if∑min(i,j)k=1 likljk 6= 0.

Π(min(i,j)∑

k=1

likljk 6= 0)

=Π(likljk 6= 0 for some k)

=1−Π(likljk = 0 for all k)

=1−min(i,j)∏k=1

Π(likljk = 0)

=1−min(i,j)∏k=1

1−Π(likljk 6= 0)

=1−min(i,j)∏k=1

1−Π(lik 6= 0)Π(ljk 6= 0)

=1− (1− πiπj)min(i,j).

Suppose that i and j are of the same order, then the above equation can bewritten as

(1− πiπj)min(i,j) ≈ (1− π2i )i ≈ 1− iπ2

i .

In order to assume the same sparsity for each i, we need to iπ2i to be some

constant in i, thus πi ∼ Cp/√i.

Page 17: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 17

Proof (Proof of Theorem 1) The proof of the theorem uses an argument similarto that used in the proof of the Neyman-Pearson lemma.

r(φ∗)− r(φ)

=E[Y (1− φ∗(X)) + (1− Y )φ∗(X)

]− E

[Y (1− φ(X)) + (1− Y )φ(X)

]=E[Y(φ(X)− φ∗(X)

)+ (1− Y )

(φ∗(X)− φ(X)

)]=E[(φ(X)− φ∗(X)

)(2Y − 1)

]=E[E((φ(X)− φ∗(X)

)(2Y − 1)

)|X)]

=E[(φ(X)− φ∗(X)

)(2E(Y |X)− 1

)].

Clearly

E(Y |X) =πp1(X)

πp1(X) + (1− π)p0(X)>

1

2(16)

if πp1(X) > (1− π)p0(X). Since in this case φ∗(X) = 1 ≥ φ(X), we have

(φ(X)− φ∗(X))(2E(Y |X)− 1) ≤ 0. (17)

On the other hand, when the expression in (16) is less than or equal to 1/2,φ∗(X) = 0 ≤ φ(X), again leading to (17). Therefore, r(φ∗) ≤ r(φ).

Alternatively, we can view φ∗ as the Bayes decision rule for parameterspace (p1, p0), action space (1, 0) and prior distribution (π, 1 − π) based onobservation X and r(·) as the Bayes risk, which is minimized by the Bayesrule φ∗.

To prove Theorem 2, we need the following lemma, which is an extensionof Lemma A.1 of Banerjee and Ghosal [2] to accommodate non-zero means.

Lemma 1 Let q and q′ respectively denote the densities of Nm(µ,Ω−1) andNm(µ′, Ω′−1) and h(q, q′) = ‖√q −

√q′‖2 stand for their Hellinger distance,

where the dimension m is potentially growing but the eigenvalues of Ω remainsbounded between two fixed positive numbers. Then there exist positive constantsc0 (depending on Ω) and δ0 such that h2(q, q′) ≤ c0

(‖µ− µ′‖2 + ‖Ω −Ω′‖2F

)and if h(q, q′) < δ0, then

(‖µ− µ′‖2 + ‖Ω −Ω′‖2F

)≤ c0h2(q, q′). Moreover c0

can be taken to be a constant multiple of max‖Ω0‖S , ‖Ω−10 ‖S.

Proof Let λi, i = 1, · · · ,m, be the eigenvalues of the matrixA = Ω−1/2Ω′Ω−1/2.Then the squared Hellinger distance h2(q, q′) between q and q′ is given by

1− det(Ω)−1/4 det(Ω′)−1/4

det(Ω−1+Ω′−1

2 )1/2exp

− 1

8(µ− µ′)T

(Ω−1 +Ω′−1

2

)−1(µ− µ′)

=1− 2m

m∏i=1

(λ1/2i + λ

−1/2i )−1/2 exp

− 1

4(µ− µ′)T

(Ω−1 +Ω′−1

)−1(µ− µ′)

[1− 2m

m∏i=1

(λ1/2i + λ

−1/2i )−1/2

]+

1

4(µ− µ′)TΩ(µ− µ′).

Page 18: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

18 Xingqi Du, Subhashis Ghosal

Clearly the second term is bounded by a multiple of ‖µ− µ′‖2. The first termhas been bounded by a multiple of ‖Ω −Ω′‖2F in Lemma A.1 (ii) of Banerjeeand Ghosal [2].

For the converse, observe that 1−2m∏mi=1(λ

1/2i +λ

−1/2i )−1/2 is bounded by

h2(q, q′). Let δ = h(q, q′) < δ0 for a sufficiently small δ0, so that by argumentsused in the proof of Part (ii) of Lemma A.1 of Banerjee and Ghosal [2], it

follows that ‖Ω − Ω′‖2F ≤ 12c0[1 − 2m

∏mi=1(λ

1/2i + λ

−1/2i )−1/2] ≤ c0h

2(q, q′)for some constant c0 > 0. As the Frobenius norm dominates the spectralnorm, this in particular implies that ‖Ω − Ω′‖S is also small, and hence theeigenvalues of Ω′ also lie between two fixed positive numbers. Now h(q, q′) < δ0also implies that

1− exp− 1

8(µ− µ′)T

(Ω−1 +Ω′−1

2

)−1(µ− µ′)

< δ2

0 ,

which implies that

(µ− µ′)T(Ω−1 +Ω′−1

2

)−1(µ− µ′) . h2(q, q′).

Hence ‖µ− µ′‖2 ≤ 12c0h

2(q, q′). This gives the desired bound.

Proof (Proof of Theorem 2) To obtain the posterior contraction rate εn of(π, µ0, µ1, Ω0, Ω1) at (π∗, µ∗0, µ

∗1, Ω

∗0 , Ω

∗1), we apply the general theory of pos-

terior contraction rate described in Ghosal and van der Vaart [16], by executingthe steps stated below.

(i) Find a sieve Pn ⊂ P such that Π(Pcn) ≤ e−Mnε2n for a sufficiently largeconstant M > 0 and the εn-metric entropy of Pn is bounded by a constantmultiple of nε2n.

(ii) Show that the prior probability of an ε2n-neighborhood of the true density

in the Kullback-Leibler sense is at least e−cnε2n for some constant c > 0.

This gives the posterior contraction rate εn in terms of the Hellinger distanceon the densities, which can be converted to the Euclidean distance on themean and Frobenius distance on the precision matrix in view of Lemma 1.

Let qµ0,Ω0 and qµ1,Ω1 denote the two possible densities of an observationconditional on the classification information. The learning of π is solely basedon N , and the posterior for π is clearly concentrating at its true value π∗ atrate 1/

√n. Thus, to simplify notations, in the remaining analysis, we may treat

π as given to be π∗ and establish posterior concentration for (µ0, µ1, Ω0, Ω1)based on N samples from Class 1 and n−N from Class 0. Then the averagesquared Hellinger distance is expressed as

n−Nn

h2(pµ0,Ω0 , pµ∗0 ,Ω∗0

)+N

nh2(pµ1,Ω1 , pµ∗1 ,Ω∗1

). (18)

As N/n → π∗ almost surely and 0 < π∗ < 1, it suffices to establish posteriorconcentration for (µ0, Ω0) and (µ1, Ω1) separately. Therefore we use a genericnotation (µ,Ω) to denote either, and we establish the rate of convergence of

Page 19: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 19

µ and Ω here. The analysis of posterior concentration is similar to that ofTheorem 3.1 of Banerjee and Ghosal [2] with the difference being that thereis an extra mean parameter. Moreover, unlike them we use a prior on theprecision matrix through its Cholesky decomposition, and the sparsity of off-diagonal entries is only approximate since we do not use point-mass priors.However, posterior contraction for a point mass prior can be recovered fromour results since it corresponds to the limiting case v0 → 0 in the spike-and-slab prior.

Observe that for (µ,L,D), (µ′, L′, D′) with all entries of µ, µ′, L, L′, D,D′

bounded by B, for Ω = LDLT , Ω′ = L′D′L′T , we can be write ‖Ω −Ω′‖2F as

p∑i,j=1

[

p∑k=1

(dklikljk − d′kl′ikl′jk)]2

=

p∑i,j=1

[

p∑k=1

(dk − d′k)likljk + d′k(lik − l′ik)ljk + d′kl′ik(ljk − l′jk)]2

which can be bounded as

3p

p∑i,j=1

p∑k=1

(dk − d′k)2l2ikl2jk + d′2k (lik − l′ik)2l2jk + d′2k l

′2ik(ljk − l′jk)2

≤ 3B4p2pp∑k=1

(dk − d′k)2 +

p∑i=1

p∑k=1

(lik − l′ik)2 +

p∑j=1

p∑k=1

(ljk − l′jk)2

= 3B4p2[p‖D −D′‖2F + 2‖L− L′‖2F ].

Hence if ‖D−D′‖∞ ≤ εn/(3B2p2+ν), ‖L−L′‖∞ ≤ εn/(3B2p2+ν), where ‖·‖∞stands for the maximum norm for a vector or matrix, then

‖Ω −Ω′‖2F ≤ 3B4p2[p2‖D −D′‖2∞ + 2p2‖L− L′‖2∞] ≤ 9B4p4ε2n(3B2p2+ν)2

=ε2np2ν

.

Further, ‖Ω‖S ≤ tr(Ω) ≤ p2B2 ≤ pν without loss of generality, by increasingν if necessary. Define the effective edges to be the set (i, j) : |lij | > εn/p

ν , i >j. Consider the sieve Pn consisting of L with maximum number of effectiveedges at most r a sufficiently large multiple of nε2n/ log n and each entry ofµ, D and L is bounded by B ∈ [b1nε

2n, b1nε

2n + 1] in absolute value. Then

the εn/pν-metric entropy of Pn with respect to the Euclidean norm for µ and

Frobenius norm for D and L is given by

log

(Bp1/2+ν

εn

)p r∑j=1

((p2

)j

)(3B2p2+ν

εn

)j(3B2p2+ν

εn

)p,

where in the left part of the inequality, the first (B√p/εn)p is from the mean

parameter, the second is from the off-diagonal entries of L and the last isfrom the diagonal elements of D. Note that for a component with span atmost εn/p

ν , only one point is needed for a covering. In view of Lemma 1,

Page 20: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

20 Xingqi Du, Subhashis Ghosal

the Hellinger distance between the corresponding densities is bounded by εnas max‖Ω‖, ‖Ω′‖, ‖Ω−1‖, ‖Ω′−1‖ ≤ pν . Thus the entropy is bounded by amultiple of

log

r

(3B2p2+ν

εn

)r+2p

p2r

. (r + p)(log p+ logB + log(1/εn)).

For our choice of r and B, this shows that the metric entropy is bounded bya constant multiple of nε2n.

Now we bound Π(Pcn). Let R be the number of effective edges of L. ThenR ∼ Bin(

(p−1

2

), η0), where η0 = Π(|l| > εn/p) ≤ p−b

′. Then from the tail

estimate of the binomial distribution, Π(R > r) ≤ e−cr log r for some c > 0.Using the condition on the prior and the choice of B, we have

Π(Pcn) ≤ Π(R > r) + 2p2 exp(−b1nε2n) . e−Cnε2n , (19)

where C can be chosen as large as we like, by simply choosing b1 large enough.This verifies the required conditions on the sieve.

Finally to check the prior concentration rate,

ΠB(p∗, εn) := Πp : K(p∗, p) ≤ ε2n, V (p∗, p) ≤ ε2n ≥ exp(−nε2n), (20)

where K(p∗, p) =∫p∗ log(p∗/p), V (p∗, p) =

∫p∗log(p∗, p)2. Note that for

X ∼ Np(µ,Σ) and a p× p symmetric matrix A, we have

E(XTAX) = tr(AΣ) + µTAµ, var(XTAX) = 2tr(AΣAΣ) + 4µTAΣAµ.

We use the above result to find the expressions for K(p∗, p) and V (p∗, p).Denoting the eigenvalues of the matrix Ω∗−1/2ΩΩ∗−1/2 by λi, i = 1, . . . , p, wehave K(p0, p) given by

1

2log det

Ω∗

Ω− 1

2Eµ∗,Ω∗

(X − µ∗)TΩ∗(X − µ∗)− (X − µ)TΩ(X − µ)

=

1

2log det

Ω∗

Ω− p

2+

1

2Eµ∗,Ω∗

(X − µ)TΩ(X − µ)

=− 1

2

p∑i=1

(log λi − 1 + λi) +1

2(µ∗ − µ)TΩ(µ∗ − µ)

≤− 1

2

p∑i=1

log λi −1

2

p∑i=1

(1− λi) +1

2‖µ∗ − µ‖22‖Ω‖S .

Now∑pi=1(1− λi)2 = ‖I −Ω∗−1/2ΩΩ∗−1/2‖2F , so if ‖I −Ω∗−1/2ΩΩ∗−1/2‖F

is sufficiently small, then maxi |1−λi| < 1, and hence∑pi=1(1−λi− log λi) .∑

i=1(1− λi)2, leading to the relation

K(p∗, p) ≤p∑i=1

(1− λi)2 + ‖µ∗ − µ‖22‖Ω‖S . (21)

Page 21: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 21

Also, V (p∗, p) is given by

1

4Varµ∗,Ω∗

− (X − µ∗)TΩ∗(X − µ∗) + (X − µ)TΩ(X − µ)

=

1

4Varµ∗,Ω∗

(X − µ∗)T (Ω −Ω∗)(X − µ∗) + 2(µ− µ∗)TΩ(X − µ∗)

≤1

2Varµ∗,Ω∗

(X − µ∗)T (Ω −Ω∗)(X − µ∗)

+ 2Varµ∗,Ω∗

(µ− µ∗)TΩ(X − µ∗)

=tr

(Ω −Ω∗)Ω∗−1(Ω −Ω∗)Ω∗−1

+ (µ− µ∗)TΩΩ∗−1Ω(µ− µ∗)

=tr(Ip −Ω∗−1/2ΩΩ∗−1/2)2 + (µ− µ∗)TΩΩ∗−1Ω(µ− µ∗)

.p∑i=1

(1− λi)2 + ‖µ∗ − µ‖2‖Ω‖2S‖Ω∗−1‖S .

By the assumption that Ω∗−1 has bounded spectral norm, we have

p∑i=1

(1− λi)2 = ‖Ip −Ω∗−1/2ΩΩ∗−1/2‖2F ≤ ‖Ω∗−1‖2S‖Ω −Ω∗‖2F ,

implying that for some sufficiently small constant c > 0,

Πp : K(p∗, p) ≤ ε2n, V (p∗, p) ≤ ε2n ≥ Π‖Ω−Ω∗‖2F ≤ cε2n, ‖µ−µ∗‖2 ≤ cε2n.

Furthermore, we have

‖Ω −Ω∗‖2F=‖LDLT − L∗D∗L∗T ‖2F≤3‖LDLT − LDL∗T ‖2F + 3‖LDL∗T − LD∗L∗T ‖2F + 3‖LD∗L∗T − L∗D∗L∗T ‖2F≤3‖L‖2S‖D‖2S‖L− L∗‖2F + 3‖L‖2S‖L∗‖2S‖D −D∗‖2F + 3‖L∗‖2S‖D∗‖2S‖L− L∗‖2F .

Since the Frobenius norm dominates the spectral norm and ‖D∗‖S and ‖L∗‖Sare bounded by some constant, so are ‖D‖S and ‖L‖S if ‖D − D∗‖F and‖L− L∗‖F are small. Thus for some sufficiently small constant c′ > 0,

Πp : K(p∗, p) ≤ ε2n, V (p∗, p) ≤ ε2n≥Π‖L− L∗‖2F ≤ c′ε2n, ‖D −D∗‖2F ≤ c′ε2n, ‖µ− µ∗‖2 ≤ c′ε2n≥Π‖L− L∗‖∞ ≤ c′εn/p, ‖D −D∗‖∞ ≤ c′εn/

√p, ‖µ− µ∗‖∞ ≤ c′εn/

√p.

In the actual prior, we constraint L and D such that Ω−1 has spectral normbounded by B′, but such a constraint can only increase the probability ofthe Kullback-Leibler neighborhood of the true density since Ω∗ satisfies therequired constraints. Therefore, we may pretend that the components of Land D are independently distributed. Then the above expression simplifies interms of products of marginal probabilities. Let ηi = Π(|lij | > εn/p) be theprobability that an element in the ith row of L is non-zero, and ζi = Π(|lij −

Page 22: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

22 Xingqi Du, Subhashis Ghosal

l∗ij | < c2εn/p) be the probability that this element is in the neighborhood ofits true value when l∗ij 6= 0. Then

Π‖L− L∗‖∞ ≤ c′εn/p =

p∏i=1

ζsii (1− ηi)i−1−si ,

where si is the number of non-zero elements in the ith row of L∗. Note thats1 + · · · + sp = s. From the assumptions, ηi ≤ p−b

′/√i ≤ p−b

′−1/2 for some

b′ > 2 and ζi = Π(|lij − l∗ij | < c2εn/p∣∣|lij | > εn/p)Π(|lij | > εn/p) ≥ εnp−c

′for

some c′ > 0. Therefore, the lower bound for the above probability is given by

p∏i=1

ζsii (1− ηi)i−1−si ≥(c′εnp

−c′)s(

1− p−b′−1/2

)p(p−1)/2−s. e−c

′s log(p/εn).

Thus we have Π‖L−L∗‖∞ ≤ c′εn/p & (c′εn/p)s. Similarly, we have Π‖D−

D∗‖∞ ≤ c′εn/p & (c′εn/p)p and Π‖µ− µ∗‖∞ ≤ c′εn/p & (cεn/p)

p. Hence,the prior concentration rate condition holds as

(p+ s)(log p+ log(1/εn)) nε2n

for the choice εn = n−1/2(p+ s)1/2(log n)1/2.

For LDA, the proof uses a similar idea but is notationally slightly morecomplicated because the same parameter Ω is shared by two groups and thetwo group of observations need to be considered together. In this case, theX-observations are not i.i.d., so we work with the average squared Hellingerdistance (18) and dominate that by distances on µ0, µ1, Ω. The Kullback-Leibler divergences can also be bounded similarly. Then entropy and priorprobability estimates of the same nature are established analogously.

Proof (Proof of Theorem 3) We first consider the case of LDA. The misclassi-fication rate could be written as

r =π∗Φ

( 12µ

T1 Ωµ1 − 1

2µT0 Ωµ0 − (µ1 − µ0)TΩµ∗1 + log 1−π

π√(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)

)+ (1− π∗)Φ

( 12µ

T0 Ωµ0 − 1

2µT1 Ωµ1 − (µ0 − µ1)TΩµ∗0 + log π

1−π√(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)

).

Let

C1 =1

2µT1 Ωµ1 −

1

2µT0 Ωµ0 − (µ1 − µ0)TΩµ∗1,

C0 =1

2µT0 Ωµ0 −

1

2µT1 Ωµ1 − (µ0 − µ1)TΩµ∗0,

C2 =√

(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0),

Page 23: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 23

then

|r(π∗, µ∗1, µ∗0, Ω∗)− r(π, µ1, µ0, Ω)|

=∣∣∣π∗Φ(C1 + log 1−π

π

C2

)+ (1− π∗)Φ

(C0 + log π1−π

C2

)− π∗Φ

(C∗1 + log 1−π∗π∗

C∗2

)− (1− π∗)Φ

(C∗0 + log π∗

1−π∗

C∗2

)∣∣∣≤∣∣∣Φ(C1 + log 1−π

π

C2

)− Φ

(C∗1 + log 1−π∗π∗

C∗2

)∣∣∣+∣∣∣Φ(C0 + log π

1−πC2

)− Φ

(C∗0 + log π∗

1−π∗

C∗2

)∣∣∣.We have

|C22 − C∗22 | =|(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)− (µ∗1 − µ∗0)TΩ∗Σ∗Ω∗(µ∗1 − µ∗0)|

≤|(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)− (µ∗1 − µ∗0)TΩΣ∗Ω(µ1 − µ0)|+ |(µ∗1 − µ∗0)TΩΣ∗Ω(µ1 − µ0)− (µ∗1 − µ∗0)TΩΣ∗Ω(µ∗1 − µ∗0)|+ |(µ∗1 − µ∗0)T (ΩΣ∗Ω −ΩΣ∗Ω∗)(µ∗1 − µ∗0)|+ |(µ∗1 − µ∗0)T (ΩΣ∗Ω∗ −Ω∗Σ∗Ω∗)(µ∗1 − µ∗0)|.

We obtain that

|(µ1 − µ0)TΩΣ∗Ω(µ1 − µ0)− (µ∗1 − µ∗0)TΩΣ∗Ω(µ1 − µ0)|≤ ‖(µ1 − µ0)− (µ∗1 − µ∗0)‖‖Ω‖2S‖Σ∗‖S‖µ1 − µ0‖2

= O

(√p(p+ s) log p

n

),

and

(µ∗1 − µ∗0)TΩΣ∗Ω(µ1 − µ0)− (µ∗1 − µ∗0)TΩΣ∗Ω(µ∗1 − µ∗0)|≤ ‖(µ1 − µ0)− (µ∗1 − µ∗0)‖2‖Ω‖2S‖Σ∗‖S‖µ∗1 − µ∗0‖2

= O

(√p(p+ s) log p

n

).

Let R be the number of effective edges in L from Pn. From the proof of Theo-rem 2 it follows that the number of effective edges in Ω is bounded by R andby the choice of Pn, with posterior probability tending to one in probability,R . nε2/ log n = O(p + s). Let A = Ω − Ω∗, then the number of non-zeroelements in A is also O(p + s) with posterior probability tending to one inprobability. Thus,

|(µ∗1 − µ∗0)T (ΩΣ∗Ω −ΩΣ∗Ω∗)(µ∗1 − µ∗0)|

≤ ‖Ω‖S‖Σ∗‖Sp∑

i,j=1

|aij ||µ∗1,i − µ∗0,i||µ∗1,j − µ∗0,j |

Page 24: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

24 Xingqi Du, Subhashis Ghosal

which is O(n−1/2(p+ s)√

log p) because

p∑i,j=1

|aij | ≤√p+ s‖Ω −Ω∗‖F = O

(√(p+ s)2 log p

n

).

Similarly, it follows that

|(µ∗1 − µ∗0)T (ΩΣ∗Ω∗ −Ω∗Σ∗Ω∗)(µ∗1 − µ∗0)| = O

(√(p+ s)2 log p

n

).

Thus, |C22 − C∗22 | = O

(n−1/2(p+ s)

√log p

)= o(1). Hence by Assumption

(A4), C22 ≥M0 > 0. Therefore,

|r(π∗, µ∗1, µ∗0, Ω∗)− r(π, µ1, µ0, Ω)|

≤∣∣∣Φ(C1 + log 1−π

π

C2

)− Φ

(C∗1 + log 1−π∗π∗

C∗2

)∣∣∣+∣∣∣Φ(C0 + log π

1−πC2

)− Φ

(C∗0 + log π∗

1−π∗

C∗2

)∣∣∣≤ 1√

M

(|C1 − C∗1 |+ |C0 − C∗0 |

+ |log1− ππ− log

1− π∗

π∗|+ |log

π

1− π− log

π∗

1− π∗|).

We know that

|µT1 Ωµ1 − µ∗T1 Ω∗µ∗1| ≤|µT1 Ωµ1 − µT1 Ωµ∗1|+ |µT1 Ωµ∗1 − µ∗T1 Ωµ∗1|+ |µ∗T1 Ωµ∗1 − µ∗T1 Ω∗µ∗1|.

Similar to the proof above,

|µ∗T1 Ωµ∗1 − µ∗T1 Ω∗µ∗1| ≤p∑

i,j=1

|aij ||µ∗1,i||µ∗1,j | p∑

i,j=1

|aij | ≤√p+ s‖A‖F ,

which is O(n−1/2(p+ s)√

log p). Also,

|µT1 Ωµ∗1 − µ∗T1 Ωµ∗1| ≤ ‖µ1 − µ∗1‖‖Ω‖S‖µ∗1‖ = O

(√p(p+ s) log p

n

),

|µT1 Ωµ1 − µT1 ΩSµ∗1| ≤ ‖µ1 − µ∗1‖‖Ω‖S‖µ1‖ = O

(√p(p+ s) log p

n

).

Therefore, we have that

|µTkΩµk − µ∗Tk Ω∗µ∗k| = O

(√(p+ s)2 log p

n

), k = 0, 1.

Page 25: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

Bayesian Discriminant Analysis Using Many Predictors 25

In addition,

|(µ1 − µ0)TΩµ∗1 − (µ∗1 − µ∗0)TΩ∗µ∗1|≤|(µ1 − µ0)TΩµ∗1 − (µ∗1 − µ∗0)TΩµ∗1|+ |(µ∗1 − µ∗0)TΩµ∗1 − (µ∗1 − µ∗0)TΩ∗µ∗1|≤‖(µ1 − µ0)− (µ∗1 − µ∗0)‖‖Ω‖S‖µ∗1‖+ ‖µ∗1 − µ∗0‖‖µ∗1‖‖Ω −Ω∗‖F

=O

(√p(p+ s) log p

n

)+O

(√(p+ s)2 log p

n

)

=O

(√(p+ s)2 log p

n

).

Therefore, max|C1−C∗1 |, |C0−C∗0 | = O(n−1/2(p+ s)√

log p)→ 0 under thecondition that (p+ s)2 log p = o(n).

For QDA, the misclassification rate does not have an explicit expressionbut (5) leads to upper bounds of similar nature.

References

1. Banerjee, S., & Ghosal, S., Posterior convergence rates for estimating large precisionmatrices using graphical models, Electronic Journal of Statistics, 8(2), 2111–2137 (2014).

2. Banerjee, S., & Ghosal, S., Bayesian structure learning in graphical models. Journal ofMultivariate Analysis, 136, 147–162 (2015).

3. Bhadra, A., Datta, J., Polson, N. G., & Willard, B., The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis, 12(4), 1105–1131 (2017).

4. Bhattacharya, A., Pati, D., Pillai, N. S., & Dunson, D. B., DirichletLaplace priors foroptimal shrinkage. Journal of the American Statistical Association, 110(512), 1479–1490(2015).

5. Bickel, P. J., & Levina, E., Covariance regularization by thresholding. The Annals ofStatistics, 36(6), 2577–2604 (2008).

6. Bickel, P. J., & Levina, E., Regularized estimation of large covariance matrices. TheAnnals of Statistics, 199–227 (2008).

7. Cai, T., Liu, W., & Luo, X., A constrained `1 minimization approach to sparse precisionmatrix estimation. Journal of the American Statistical Association, 106(494), 594–607(2011).

8. Cai, T. T., Zhang, C. H., & Zhou, H. H., Optimal rates of convergence for covariancematrix estimation. The Annals of Statistics, 38(4), 2118–2144 (2010).

9. Carvalho, C. M., Polson, N. G., & Scott, J. G., Handling sparsity via the horseshoe. InArtificial Intelligence and Statistics, 73–80 (2009).

10. Carvalho, C. M., Polson, N. G., & Scott, J. G., The horseshoe estimator for sparsesignals. Biometrika, 97(2), 465–480 (2010).

11. Du, X. & Ghosal, S. Multivariate Gaussian network structure learning. Preprint athttp://www4.stat.ncsu.edu/~ghoshal/papers, (2017).

12. Fan, J. & Fan, X. High dimensional classification using features annealed independencerules. The Annals of Statistics, 36(6), 2605–2637 (2008).

13. Friedman, J., Hastie, T., & Tibshirani, R., Sparse inverse covariance estimation withthe graphical lasso. Biostatistics, 9(3), 432–441 (2008).

14. George, E. I., & McCulloch, R. E., Variable selection via Gibbs sampling. Journal ofthe American Statistical Association, 88(423), 881–889 (1993).

15. Ghosal, S., & van der Vaart, A., Convergence rates of posterior distributions. The Annalsof Statistics, 28(2), 500–531 (2000).

16. Ghosal, S., & van der Vaart, A., Fundamentals of Nonparametric Bayesian Inference(Vol. 44). Cambridge University Press (2017).

Page 26: Bayesian Discriminant Analysis Using Many Predictorssghosal/papers/QDA.pdf · applied discriminant analysis extensively, especially in his papers on anthro-pometry (Mahalanobis et

26 Xingqi Du, Subhashis Ghosal

17. Griffin, J. E., & Brown, P. J., Inference with normal-gamma prior distributions inregression problems. Bayesian Analysis, 5(1), 171–188 (2010).

18. Huang, J. Z., Liu, N., Pourahmadi, M., & Liu, L., Covariance matrix selection andestimation via penalised normal likelihood. Biometrika, 93(1), 85–98 (2006).

19. Ishwaran, H., & Rao, J. S., Spike and slab variable selection: frequentist and Bayesianstrategies. The Annals of Statistics, 33(2), 730–773 (2005).

20. Izenman, A. J., Modern Multivariate Statistical Techniques. Regression, Classificationand Manifold Learning. Springer Texts in Statistics, Springer-Verlag, New York (2008).

21. Khare, K., Oh, S-Y. & Rajaratnam, B. A convex pseudolikelihood framework for highdimensional partial correlation estimation with convergence guarantees. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 77 (4), 803–825, (2015).

22. Ledoit, O., & Wolf, M., A well-conditioned estimator for large-dimensional covariancematrices. Journal of multivariate analysis, 88(2), 365–411 (2004).

23. Majumdar, D. N., Rao, C. R. & Mahalanobis, P. C., Bengal anthropometric survey,1945: A statistical study. Sankhya: The Indian Journal of Statistics 19, 201–408, (1958).

24. Mahalanobis, P. C., Analysis of race-mixture in Bengal. Proceedings of the Indian Sci-ence Congress (1925).

25. Mahalanobis, P. C., Statistical study of the Chinese head. Man in India 8, 107–122,(1928).

26. Mahalanobis, P. C., A statistical study of certain anthropometric measurements fromSweden. Biometrika 22, 94–108, (1930).

27. Mahalanobis, P. C., Anthropological observations on Anglo-Indians of Calcutta, PartII: Analysis of Anglo-India head length. Rec. Indian Museum 23, (1931).

28. Mahalanobis, P. C., On test and measures of group divergence, Journal of AsiaticSociety Bengal 26, 541–588, (1930).

29. Mahalanobis, P. C., On the generalized distance in statistics. Proceedings of the NationalInstitute of Science, India 2, 49–55, (1936).

30. Mahalanobis, P. C., Historical note on the D2-statistics. Appendix I: Anthropometricsurvey of the United Provinces, 1941: a statistical study. Sankhya: The Indian Journal ofStatistics 9, 237–239, (1949).

31. Mahalanobis, P. C., Majumdar, D. N., Yeatts, M. W. M. & Rao, C. R. Anthropometricsurvey of the United Provinces, 1941: a statistical study. Sankhya: The Indian Journal ofStatistics, 3(1), 89–324, (1949).

32. Meinshausen, N. & Buhlmann, P., High-dimensional graphs and variable selection withthe lasso. The Annals of Statistics, 1436–1462 (2006).

33. Mitchell, T. J. & Beauchamp. J. J., Bayesian variable selection in linear regression.Journal of the American Statistical Association 83(404), 1023–1032 (1988).

34. Peng, J., Wang, P., Zhou, N., & Zhu, J., Partial correlation estimation by joint sparseregression models. Journal of the American Statistical Association, 104(486), 735–746(2012).

35. Rockova, V. & George, E. I. EMVS: The EM approach to Bayesian variable selection.Journal of the American Statistical Association 109(506), 828–846 (2014).

36. Wang, H., Bayesian graphical lasso models and efficient posterior computation. BayesianAnalysis, 7(4), 867–886 (2010).

37. Wei, R., & Ghosal, S., Contraction properties of shrinkage priors in logistic regression,Preprint at http://www4.stat.ncsu.edu/~ghoshal/papers, (2017).

38. Yuan, M., & Lin, Y., Model selection and estimation in the Gaussian graphical model.Biometrika, 94(1), 19–35 (2007).