Bayesian inference on high-dimensional Seemingly Unrelated ......Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22 Model Simulated data n=100 q=30,

Bayesian inference on high-dimensionalSeemingly Unrelated Regressions

Alex Lewin and Marco Banterle

Brunel University London

April 2017

Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 1 / 22

Motivation

Motivation

Genetic studies aiming at identifying association between pointmutations (SNPs) and multivariate phenotypes:

gene expression measurements

metabolomics data

protein concentrations

...

Looking for sparse variable selection

Take into account data correlations


Motivation

Multivariate Data

Predictor matrix:

- n observations

- p variables

Xn Response matrix:

- n observations

- q variables

Yn

p q

Aim: identify which of the p variables in X are significantly associatedwith the outcomes in Y


Motivation

Case study: mQTL discovery in the North FinlandBirth Cohort study (NFBC)

The NFBC66 is a cohort of 12000 adults followed since 1966

Collection of data at age 31 years, including clinical data and bloodsamples

DNA extracted leading to a sample of 5746 adults genotypedacross genome (∼ 300,000 SNPs)

Metabolite lipid profile quantified from serum samples by NMR,giving measures of 137 metabolites

Question of interest is the discovery of genetic markersassociated with metabolite regulation of lipids

These responses are highly structured, with strong correlations


Motivation

Correlations in the mQTL data set

Y: Metabolite correlations X: SNP correlations

(only subset plotted)


Model

Multivariate Regression Model with Variable Selection

Frame the problem as a multivariate linear regression model:

Yn×q

= Xn×p

Bp×q

+ En×p

or equivalently:Y ∼MN (XB, In, R)

Sparse associations: set most elements of B to zero

Correlated outcomes: allow non-diagonal residual covariancematrix R


Model

Variable selection performed through binary matrix Γ (p× q)

γjk =

{1 outcome k associated with predictor j

0 else

Sparsity prior γjk ∼ Bern(ωjk), ωjk ∼ Beta()

Gamma matrix

Predictor variables

Out

com

es

Predictor Xj only appears in a regression if γjk is 1.


Model

BayesianSetting

Very high dimensional datap ≈ 104 to 106 variables in X

q ranging from 1 to 104 variables in Y

Around n = 5000 observations

Focus on Sparse Bayesian Variable Selection (sparse BVS)Minimise arbitrary tuning of model.

Provides the posterior probability of association for each predictorand each response.

Bayesian model averagingExplore space of 2p models

Marginal posterior inclusion probabilities are model averages.


Model

SUR model

Seemingly Unrelated Regressions (SUR) model:

ykn×1

= Xγkn×dk

βγkdk×1

+ �kn×1

for k = 1, · · · , q

Cov[�k�l] = Rkl 6= 0 =⇒ Outcomes do not naturally separate.

Different variables selected for each outcome: γk, k = 1, · · · , q.

So: vectorise model:

vec(y1, y2, · · · yq) ∼ N (vec(Xγ1βγ1 , Xγ2βγ2 , · · · , Xγqβγq), R⊗ In)

Covariance matrix is not block diagonal.


Model

SUR model

Priors:

vec(β1, β2, · · · , βq)|γ1, γ2, · · · , γq ∼ N (0, (Iq ⊗W )vec(γ1,γ2,··· ,γq))

R ∼ IW(ν,M)

γkj ∼ Bern(ωk), ωk ∼ Beta()

W is a p× p matrix (constant, or g-prior g(XTX)−1); differentrows/columns selected for different outcomes.

Not conjugate, cannot integrate out β and R.


Model

SUR model

Can calculate posterior full conditionals for βk and R→ Gibbs sampler for γk, βk and R.

However, this is computationally prohibitive due to calculation of(X t(R−1 ⊗ In)X

)−1where

X =

Xγ1 0 · · · 0

0 Xγ2 · · · 00 · · · · · · 00 · · · 0 Xγq

an nq× (d1 + d2 + · · · dq) matrix, which changes every MCMC iteration.


Model

New work: computation for the SUR model

Idea from Zellner and Ando (2010): decompose the Likelihood:

y1 = Xγ1βγ1 + ε1

y2 = Xγ2βγ2 + ρ21(y1 −Xγ1βγ1) + ε2...

yk = Xγkβγk +∑l

Model

New work: priors in the transformed space

We aim to decompose into product over responses, as with Likelihood.

Beta’s straightforward: ∏k

N (βk|γk,W )

Covariance matrix:R ∼ IW(ν,M)

becomes∏k

N ({ρk1, · · · , ρk,k−1}|σ2k,M)× IG(σ2k|{ρk1, · · · , ρk,k−1}, ν,M)


Model

New: posterior conditionals in transformed space

Covariance matrixIn transformed space,∏

k

N ({ρk1, · · · , }|σ2k,M, Y,B,Γ)× IG(σ2k|{ρk1, · · · , }, ν,M, Y,B,Γ)

So MCMC updates for R parameters factorise over responses.

BetasMCMC for B not so straightfoward: Zellner and Ando used simplifiedfactorisation + Gibbs resampling


Model

New: posterior conditionals in transformed space

We found the correct full conditionals for B (does decompose):

βγk | . . . ∼ N(Wk ×Xtγk ỹk , Wk

)(k = 1, . . . ,q)

where

Wk =

(XtγkXγk

(1

σ2k+∑l>k

ρ2lkσ2l

)+W−1γk

)−1,

ỹk = function of Y,B,Γ across responses


Model

New work: computation for the SUR model

Main point: we have got rid of the(X t(R−1 ⊗ In)X

)−1 big matrixcalculation.

Using ESS (evolutionary stochastic search) algorithm (Bottolo etal.) to explore space of Γ variable selection parameters.

Comp. time v. q Comp. time v. p Comp. time v. n

These in R; in C++ will be greater relative speedup.Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22

Model

Simulated data

n=100

q=30, p=30

with correlated residuals

True Outcome Correlation


Model

Simulated data

True Gamma matrixTRUE

Predictors

Out

com

es

HESS GammaFull heat map

Predictors

Out

com

es

SUR GammaFull heat map

Predictors

Out

com

es

HESS GammaPosterior inclusion probabilities

PIP

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

SUR GammaPosterior inclusion probabilities

PIP

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80


Case Study: mQTL discovery

mQTL analysis of NFBC data

After quality control,n = 4023 peopleq = 103 metabolitesp = 9172 SNPs on chromsome 16

Data Correlation Residual Correlation


Case Study: mQTL discovery

Evidence of enhanced linkage for Chromosome 16


Summary, Acknowledgements

Summary

Bayesian SUR model with sparsity prior to perform variableselection for multiple responses.

Estimating the residual covariance matrix increases the accuracyof the variable selection

We have extended the Zellner and Ando method to obtain directlythe correct posteriors

Computational speed-up→ model can be used on large genomicdata sets.


Summary, Acknowledgements

Thank you!Sylvia Richardson

Leonardo Bottolo

Marjo-Riitta Jarvelin

Habib Saadi

Marc Chadeau-Hyam

Lewin A et al. (2015)MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICSdatasets, with application to eQTL mapping in multiple tissues, Bioinformatics 10.1093

Bottolo L, Chadeau-Hyam, M et al. (2013)Guess-ing polygenic associations with multiple phenotypes using a GPU-based EvolutionaryStochastic Search algorithm,PLoS Genetics 10.1371.

Bhadra A and Mallick BK (2013)Joint high-dimensional Bayesian variable and covariance selection with an application to eQTLanalysis, Biometrics 10.1111.

Bottolo L, Petretto, E et al. (2011)Bayesian detection of expression quantitative trait loci hot spots, Genetics 189:1449–1459.


Model-based clustering for multivariatecategorical data

Michael Fop, Keith Smart and Brendan Murphy

6th IBS Channel Network Conference

1 / 25

Back Pain Dataset

▶ A study to investigate the use of a mechanisms-basedclassification of muscoloskeletal pain in clinical practice.

▶ The aim of the study was to asses the discriminative power ofthe taxonomy of pain in Nociceptive, Peripheral Neuropathicand Central Sensitization for low-back disorders.

▶ There are N = 464 patients who were assessed according to alist of 36 binary clinical indicators (“Present”/“Absent”).

▶ Some of the indicators carry the same information about thepain categories, thus the interest here is to select a subset ofmost relevant clinical criteria, performing a partition of thepatients.

▶ Does the partition of the patients agree with the clinicaltaxonomy?

2 / 25

Clustering and Variable Selection

▶ The motivating example shows the need for:

▶ Clustering: Can we establish the existance of subgroups?How can we characterize these subgroups?

▶ Variable Selection: Can we use a subset of the variables todistinguish the subgroups?

3 / 25

Model-Based Clustering/Mixture Models

▶ Denote the N ×M data matrix by X

▶ The nth observation is denoted by Xn.

▶ Model-based clustering assumes that Xn arises from a finitemixture model

▶ Assuming G classes (components)

p(Xn|τ ,θ,G ) =G∑

g=1

τgp(Xn|θg ).

▶ τg are mixture weights

▶ p(Xn|θg ) is the component distribution.

4 / 25

Latent Class Analysis (LCA) model

▶ Latent Class Analysis (LCA) is a model for clusteringcategorical data.

▶ Let Xn = (Xn1,Xn2, . . . ,XnM) where Xnm takes a value from{1, 2, . . . ,Cm}.

▶ In LCA we assume that there is local independence betweenvariables, so that if we knew Xn was in class g we could writeit’s density as

p(Xn|θg ) =M∏

m=1

Cm∏c=1

θI(Xnm=c)gmc ,

where {θgm1, . . . , θgmCm} give the probabilities of observingthe categories {1, . . . ,Cm} in variable m

▶ θg will characterize and embody the differences betweengroups

5 / 25

LCA model (general)

▶ Model likelihood of the form,

p(Xn|θ, τ ,G ) =G∑

g=1

τg

M∏m=1

Cm∏c=1

θI(Xnm=c)gmc .

▶ More convenient to work with completed data

▶ Augment data with class labels Zn = (Zn1,Zn2, . . . ,ZnG )where

Zng =

{1 if observation n belongs to group g0 otherwise.

▶ Then we can write down completed data likelihood for anobservation

p(Xn,Zn|θ, τ ,G ) =G∏

g=1

{τg

M∏m=1

Cm∏c=1

θI(Xnm=c)gmc

}Zng.

6 / 25

LCA model (general)

▶ Estimation by EM algorithm or VB (see BayesLCA package)

▶ Note that G must be chosen in advance; possible todiscriminate the best G for the data using information criteria(eg. BIC)

▶ Bayesian approaches:▶ Pandolfi, Bartolucci and Friel (2014) use reversible jump to get

posterior probability for G .▶ White, Wyse and Murphy (2016) use a collapased Gibbs

sampler and incorporate variable selection.

7 / 25

Back Pain Data: LCA Results

Crit

.1C

rit.2

Crit

.3C

rit.4

Crit

.5C

rit.6

Crit

.7C

rit.8

Crit

.9C

rit.1

0C

rit.1

1C

rit.1

2C

rit.1

3C

rit.1

4C

rit.1

5C

rit.1

6C

rit.1

8C

rit.1

9C

rit.2

0C

rit.2

2C

rit.2

3C

rit.2

4C

rit.2

5C

rit.2

6C

rit.2

7C

rit.2

8C

rit.2

9C

rit.3

0C

rit.3

1C

rit.3

2C

rit.3

3C

rit.3

4C

rit.3

5C

rit.3

6C

rit.3

7C

rit.3

8

Class 1

Class 2

Class 3

Class 4

Class 5

0.0

0.2

0.4

0.6

0.8

1.0

8 / 25

Back Pain Data: LCA Clustering

1 2 3 4 5

Central Sensitization 48 1 5 0 41Nociceptive 0 10 96 126 3

Peripheral Neuropathic 0 89 1 1 4

9 / 25

Variable Selection: Dean & Raftery’s Greedy Search

▶ Dean & Raftery (2010) proposed a greedy stepwise variableselection algorithm for LCA.

▶ The observation vector Xn is partitioned as

Xn = (XCn ,X

Pn ,X

On )

where▶ XCn are the current clustering variables.▶ XPn is proposed to be added to the clustering variables.▶ XOn are the other variables.

10 / 25

Dean & Raftery’s Greedy Search

▶ Two competing models are compared:

z

XC

XP

XO

M1 z

XC

XP

XO

M∗2

▶ M1 assumes that the proposed variable has clusteringstructure.

▶ M∗2 assumes that the proposed variable has no clusteringstructure.

▶ This framework reduces the independence assumption of thepreviously described approach.

11 / 25

Local Independence (A Problem?)

▶ When analyzing the back pain data, we achieved very littledata reduction.

▶ In fact, only one variable was labeled as non-clustering.

▶ An explanation for this is the local independence assumptionin the model.

▶ Suppose we have two variables that are highly dependent andboth exhibit clustering.

▶ The variable selection method will include both variables inthe model, even if one variable contains no extra clusteringinformation.

12 / 25

Novel Extension: Relaxing Independence Further

▶ It is unrealistic to assume that XCn and XPn are conditionally

independent in M∗2.▶ We propose replacing M∗2 with a different model.

z

XC

XP

XO

M1 z

XC

XP

XO

XR ⊆ XC

M2

▶ M1 assumes that the proposed variable has clusteringstructure.

▶ M2 assumes that the proposed variable has no clusteringstructure beyond that explained by the clustering variables.

13 / 25

Stepwise Search Algorithm

▶ We propose a stepwise search algorithm to find an optimal setof variables for clustering.

▶ The algorithm involves the following steps:▶ Add: Add a variable to the current clustering variables.▶ Remove: Remove a variable from the current clustering

variables.▶ Swap: Swap a proposed variable with one already in the

clustering variables.

▶ Model selection is implemented using the BayesianInformation Criterion (BIC).

14 / 25

Back Pain Data

▶ The proposed model was applied to the back pain data:

VariablesN. latentclasses

BIC ARI

All 5 -12582.62 0.50All 3∗ -12763.81 0.82

35 Criteria 5 -12116.32 0.5035 Criteria 3∗ -12305.67 0.80

11 Criteria 3 -3965.24 0.75

▶ The new model achieves much greater data reduction.

15 / 25

Algorithm Run

Iter. Proposal BIC diff. Decision Proposal BIC diff. Decision1 Remove Crit.5 -122.2 Accepted2 Remove Crit.23 -126.3 Accepted Swap Crit.22 with Crit.5 -73.2 Rejected3 Remove Crit.38 -109.0 Accepted Swap Crit.25 with Crit.5 -81.5 Rejected4 Remove Crit.4 -103.5 Accepted Swap Crit.2 with Crit.38 -98.6 Rejected5 Remove Crit.1 -78.3 Accepted Swap Crit.29 with Crit.4 -23.1 Rejected6 Remove Crit.29 -73.2 Accepted Swap Crit.12 with Crit.1 2.7 Accepted7 Remove Crit.1 -73.5 Accepted Swap Crit.26 with Crit.29 3.2 Accepted8 Remove Crit.29 -66.8 Accepted Swap Crit.18 with Crit.12 -10.2 Rejected9 Remove Crit.35 -63.0 Accepted Swap Crit.7 with Crit.29 -9.0 Rejected10 Remove Crit.7 -59.6 Accepted Swap Crit.11 with Crit.35 -7.6 Rejected11 Remove Crit.10 -62.9 Accepted Swap Crit.8 with Crit.7 -76.1 Rejected12 Remove Crit.11 -50.4 Accepted Swap Crit.16 with Crit.10 6.8 Accepted13 Remove Crit.8 -54.5 Accepted Swap Crit.10 with Crit.16 -32.0 Rejected14 Remove Crit.3 -44.2 Accepted Swap Crit.31 with Crit.16 -9.5 Rejected15 Remove Crit.31 -33.2 Accepted Swap Crit.18 with Crit.16 -22.7 Rejected16 Remove Crit.22 -30.9 Accepted Swap Crit.24 with Crit.23 -1.7 Rejected17 Remove Crit.14 -22.7 Accepted Swap Crit.32 with Crit.31 -5.0 Rejected18 Remove Crit.32 -19.2 Accepted Swap Crit.37 with Crit.14 -8.0 Rejected19 Remove Crit.10 -35.4 Accepted Swap Crit.9 with Crit.3 -1.3 Rejected20 Remove Crit.24 -17.6 Accepted Swap Crit.30 with Crit.8 15.7 Accepted21 Remove Crit.34 -15.7 Accepted Swap Crit.37 with Crit.1 -0.7 Rejected22 Remove Crit.25 -13.7 Accepted Swap Crit.36 with Crit.1 3.3 Accepted23 Remove Crit.18 -10.5 Accepted Swap Crit.1 with Crit.31 8.5 Accepted24 Remove Crit.27 -13.7 Accepted Swap Crit.6 with Crit.26 6.1 Accepted25 Remove Crit.31 -1.3 Accepted Swap Crit.20 with Crit.6 5.6 Accepted26 Remove Crit.37 1.4 Rejected Swap Crit.6 with Crit.5 -3.1 Accepted27 Remove Crit.5 0.4 Rejected Swap Crit.37 with Crit.20 4.0 Rejected

16 / 25

Clustering / Clinical Taxonomy

▶ The clustering closely follows the clinical taxonomy.

Class 1 Class 2 Class 3

Nociceptive 210 21 4Peripheral Neuropathic 5 88 2Central Sensitiization 3 3 89

▶ It is not unusual for patients diagnosed as Nociceptive mayhave Peripheral Neuropathic aspects to their back pain.

17 / 25

Clustering Variables

▶ The selected variables exhibit strong clustering across thethree groups.

Crit

.2

Crit

.5

Crit

.8

Crit

.9

Crit

.13

Crit

.15

Crit

.19

Crit

.26

Crit

.28

Crit

.33

Crit

.37

Class 1

Class 2

Class 3

Selected criteria

0.0

0.2

0.4

0.6

0.8

1.0

18 / 25

Chosen Variables with Descriptions

The chosen variables have the following descriptions.Crit. Description Class 1 Class 2 Class 3

2 Pain associated to trauma, pathologic process or dysfunction 0.94 0.90 0.045 Usually intermittent and sharp with movement/mechanical provocation 0.94 0.84 0.248 Pain localized to the area of injury/dysfunction 0.97 0.50 0.319 Pain referred in a dermatomal or cutaneous distribution 0.06 1.00 0.11

13 Disproportionate, nonmechanical, unpredictable pattern of pain 0.01 0.00 0.9115 Pain in association with other dysesthesias 0.03 0.51 0.3419 Night pain/disturbed sleep 0.34 0.70 0.8626 Pain in association with high levels of functional disability 0.07 0.36 0.7928 Clear, consistent and proportionate pattern of pain 0.97 0.94 0.0733 Diffuse/nonanatomic areas of pain/tenderness on palpation 0.03 0.01 0.7337 Pain/symptom provocation on palpation of relevant neural tissues 0.07 0.57 0.19

19 / 25

Discarded Variables

▶ Many of the discarded variables are related with the clusteringvariables.

Crit.2 Crit.5 Crit.8 Crit.9 Crit.13 Crit.15 Crit.19 Crit.26 Crit.28 Crit.33 Crit.37

Crit.1

Crit.3

Crit.4

Crit.6

Crit.7

Crit.10

Crit.11

Crit.12

Crit.14

Crit.16

Crit.18

Crit.20

Crit.22

Crit.23

Crit.24

Crit.25

Crit.27

Crit.29

Crit.30

Crit.31

Crit.32

Crit.34

Crit.35

Crit.36

Crit.38

Selected criteria

Dis

card

ed c

riter

ia

0

100

200

300

400

▶ These are not clustering variables because they don’t exhibitclustering beyond what can be explained by the clusteringvariables. 20 / 25

Summary

▶ Model-based approaches to clustering and variable selectionachieve excellent performance.

▶ Removing independence assumptions in the model achievesimproved variable selection.Care needed interpreting the chosen/discarded variables.

21 / 25

Simulation 1

z

X1 X2 X3 X4

X5 X6 X7 X8

X9 X10 X11 X12

22 / 25

Simulation 1 Results

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12

0.0

00

.50

1.0

0

23 / 25

Simulation 2

z

X1 X2 X3 X4 X5

X6

X7

X8

X9

X10

24 / 25

Simulation 2 Results

●

● ●

● ●

●

●●

●●

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

0.00

0.50

1.00

25 / 25

Exploring the dependence structure betweencategorical variables: Benefits and limitations of

using variable selection within Bayesian clustering

Michail Papathomas

6th IBS Channel Network Conference - Hasselt - 2017

(University of St Andrews) 1 / 37

Collaborators

Main collaborators in the development of the clustering approach[profile regression, based on the Dirichlet process]:

I Silvia LiveraniI David HastieI John MolitorI Sylvia Richardson

Work on the relation between clustering and log-linear modelling withSylvia Richardson


Motivation

Dependence structure and log-linear modelling

I Assume observations from P categorical variables {x.1, ..., x.P}.For example,

Subject Smoking (X) Drinking (Y)John 0 0Mary 0 1Jim 1 1

...I The resulting data can be arranged as counts in a P-way

contingency table. For example,

Table 1: Smoking (X) and Drinking (Y).Smoking (X) Drinking (Y)

No (0) Yes (1)No (0) 456 44Yes (1) 583 911


Motivation


I Denote the cell counts as nl , l = 1, . . . ,n.I A Poisson distribution is assumed for the counts so that

E(nl) = µl .I A Poisson log-linear model log(µ) = XDMλ is a GLM that relates

the expected counts to the variables.

Table 1: Smoking (X) and drinking (Y).Smoking (X) Drinking (Y)

No (0) Yes (1)No (0) 456 44Yes (1) 583 911


Motivation


Sometimes, the dependence structure between {x.1, ..., x.P} (marginaland conditional independence) can be infered by the form of thelog-linear model. For example,

I

log(µij) = λ+ λXi + λYj + λ

XYij

implies that X and Y are dependent.I

log(µij) = λ+ λXi + λYj

implies marginal independence.I In practice, for more than 3 variables, interpreting the dependence

structure by looking at the presence/absence of interaction termsbecomes too difficult.


Motivation


Joint probabilities for the categorical variables are obtained using theparameters of the log-linear model. In our simple example,

P(X = i ,Y = j) =µi,j∑

i ′ ,j ′=0,1 µi ′ ,j ′.

Then, for instance,

P(smoker, drinker) = P(X = 1,Y = 1) =µ1,1

µ0,0 + µ1,0 + µ0,1 + µ1,1

=exp(λ+ λX1 + λ

Y1 + λ

XY1,1)

exp(λ) + exp(λ+ λX1 ) + ...

I So, in principle, we could fully explore the variables’ dependencestructure by calculating many joint probabilities and consideringthe laws of probability

I or, even better, by considering graphical log-linear models(University of St Andrews) 6 / 37

Motivation

Graphical models

I They allow to visualize and build complex dependence structuresfor the covariates under consideration

I Undirected Graphs and Directed Acyclic Graphs allow to defineand write complex joint distributions through factorization andconditional independence; Lauritzen (2011)

I Neighborhoods of models are easily defined, and it isstraightforward to move in the space of models by adding,removing or replacing edges.


Motivation

Example of a graphical model


Motivation

Problems with Log-linear modelling for detectinginteractions

When it is of interest to detect interactions between covariates, linearregression modelling may become problematic.

I In a classical setting, fitting linear models with many parameterssometimes requires an impractically large vector ofobservations for valid inferences (Burton et al., IJE, 2009). Also,identifiability and collinearity problems are often present.

I In Bayesian model comparison, the space of models becomesvast, and model search algorithms like the Reversible Jumpapproach (Green, Bka 1995) require an impractically largenumber of iterations before they converge (Dobra and Massam,St Meth 2010).


Bayesian clustering with the Dirichlet process


I Partitions the subjects into groups according to their profileI Flexible Bayesian clusteringI Uncertainty with regard to the clustering is evaluatedI Post-processing leads to tractable output



Notation

I Consider categorical variables x.p, p = 1, ...,P.I For individual i , denote the variable/covariate profile by

xi = (xi1, ..., xiP).I For example,

x.1: smokes, does not smokex.2: drinks, does not drinkx.3: exercises, does not exercise

I Now, for instance, xi=(smokes, drinks, does not exercise).



Notation

For individual iI zi = c allocates subject, i , to cluster c.I φcp(x) the probability that variable x.p = x , for zi = c.I Given zi = c, x.p has a multinomial distribution with cluster specific

parameters φcp = [φcp(1), ..., φcp(Mp)]I A priori, φcp ∼ Dirichlet(λ1, ..., λMp)I ψc denotes the probability that a subject is assigned to cluster c.



Statistical Framework

For φ = {φcp, c ∈ N,p = 1, ...,P},I ‘stick-breaking’ prior on the allocation weights ψcI x.1, ..., x.P are assumed independent given the clustering

allocation and parameters...I and calculating joint probabilities for the categorical variables

becomes easy!I Pr(xi |z, φ) =

∏Pp=1 φ

zip (xip) for i = 1,2, ...,n.

I This implies

Pr(xi |φ, ψ) =∞∑

c=1

Pr(zi = c|ψ)P∏

p=1

Pr(xip|zi = c) =∞∑

c=1

ψc

P∏p=1

φcp(xip).



Using 0-1 variable selection switches

I Identify covariates that contribute more than others to the formation ofclusters. [Tadesse et al. (2005), Chung and Dunson (2009);Papathomas et al. (2012)]

I Cluster specific binary indicators, γcp , so that γcp = 1 when covariate x.p isimportant for allocating subjects to cluster c; otherwise γcp = 0.

I Prior for switches: given ρp, γcp ∼ Bernoulli(1, ρp).I We consider a sparsity inducing prior for ρ with an atom at zero:

ρp ∼ 1{wp=0}δ0(ρp) + 1{wp=1}Beta(αρ, βρ)

where wj ∼ Bernoulli(0.5).Similar to Chung and Dunson (JASA, 2009), but in their set up, covariate observations contribute to thelikelihood through a regression model. In our case, covariate observations contribute directly to the likelihood,and we introduce πp(x).


Example - Using the R package PReMiuM; Liverani etal. (2015, JSS)

I Simulated observations from 10000 subjects, recording...I 10 binary variables, say {SMO,DRI,EXE ,D,E , ...,H, I, J}.

Example

Example

Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1

SMO DRI EXE D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00

Linear graphical model determination

Clustering and linear modelling

I It is not clear how clustering output translates to interactions in alog-linear regression modelling framework.

I Can we assist the process of comparing a large number of linearmodels with the clustering variable selection results?

I The important aspect of a model that combines clustering andvariable selection is that covariates are not chosen inaccordance with size of marginal effect. They are selectedbecause they combine to create distinct groups of subjects.Consequently, we expect that this type of modelling should beable to inform on interactions in a linear model setting.



Theoretical results

Theorem 1: Consider random variables x.p and x.q, 1 ≤ p,q ≤ P,p 6= q. If

∑Cc=1 γ

cp × γcq = 0 then x.p and x.q are independent.

Theorem 2: Consider a set of random variables {x.1, . . . , x.P}. If, forsome p ∈ {1, ...,P},

∑Cc=1 γ

cp × γcq = 0, for all q 6= p, then x.p is

independent of {x.1, . . . , x.P} \ x.p.

Proofs: See Papathomas and Richardson (2016, JSPI). Note that theconverse is not true.

The previous Theorems imply the following Corollary,

Corollary: Consider covariate x.p. If∑C

c=1 γcp = 0 then x.p is

independent from all other covariates.



Theoretical results

Therefore,I if the selection probability ρp for x.p is zero or close to zero,

something that implies that∑C

c=1 γcp is also zero or close to zero,

we can assume that x.p is independent from all other covariates.I Assuming that our interest lies in exploring interactions, to reduce

the dimensionality of the problem when fitting linear models tosparse contingency tables, x.p could be removed from theanalysis.


Simulated data using graphical log-linear models


Construction of matrix T γ

I For iteration it and for each cluster c with more than one subject, formmatrix T c,it , so that element (p1,p2), 1 ≤ p1 < p2 ≤ P is either zero orone, and equal to γcp1(it)× γ

cp2(it). All other matrix cells are empty.

I Sum up all matrices T c,it , weighing by cluster size, to create aninformation matrix T γ ,

T γ =∑

it

∑c

nc,it × Tc,it .

where nc,it is the size of cluster c at iteration it . Therefore, T γ is astraightforward summary of all T c,it matrices into one, with small clusterscontributing less to this summary.

I For ease of interpretation reweight the elements of T γ so that themaximum element is one, T γ = (max{T γ})−1 × T γ .

Matrix T γ is constructed in such a manner so that if element tγ(p1,p2),1 ≤ p1 < p2 ≤ P, is close to zero, this implies that an edge between x.p1 andx.p2 is not likely to be present in a highly supported graphical model.



Example. Simulation 1

I Simulated observations from 10000 subjects.I 10 binary categorical variables {A,B,C, ...,H, I, J} are observed.I Observations are simulated in accordance with the log-linear

model

log(µ) = λ+λA+λB+λC+...+λJ+λAB+λBC+λCD+λDA+λHI+λIJ+λHJ+λHIJ

I (For more details see Papathomas and Richardson (2016)) .



log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ


log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ

Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1

A B C D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00



T sim1γ =

A B C D E F G H I JA .52 .08 .50 .04 .02 .02 .20 .27 .15B .45 1 .06 .04 .03 .47 .64 .47C .45 .02 .02 .009 .12 .23 .16D .06 .04 .03 .45 .65 .48E .003 .003 .03 .04 .03F .002 .02 .03 .03G .02 .02 .02H 0.61 .56I .74




Table 3: Mixing performance of samplers. Median of iterations to best model is calculated after 30 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in

Papathomas et al (2011b). See Figure 2 for the highest posterior probability model.Simulation 1

Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model

(a) Uniformly random (PDV) 5.1 590 (452,821) 0.55(b) Cluster specific 3.8 247 (164,369) 0.55(c) Combined (30%,10%) 5.3 540 (290,674) 0.53(d) Combined (20%,20%) 4.9 403 (312,493) 0.55



Real data example

I 30 single nucleotide polymorphisms (SNPs) in chromosomes 6and 15.(Data from 4260 subjects in a genome-wide association study of lung cancer presented in Hung et al. (2008).)

I 12 SNPs were indicated as important by variable selection withinclustering.(Two from chromosome 15 and ten from chromosome 6.)

I SNPs were highly correlated. 3 SNPs included in the competinglog-linear graphical models as representatives.rs8034191 from chromosome 15 and {rs4324798,rs1950081} from chromosome 6.

I Also include age, gender and smoking status in the competinglog-linear graphical models, to search for gene-environmentinteractions.



Real data example

I Reducing the number of SNPs from 30 to 12, and then to 3, allowsfor the use of reversible jump MCMC to compare competinggraphical models. The 233 contingency table would be too sparsewith the vast majority of cells equal to zero.

I The highest posterior probability model (P=0.8) is

‘SNP1+SNP2+SNP3+AGE*GENDER*SMOKING’

which does not support the presence of gene-gene orgene-environment interactions.


Real data example

‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′

Table 6: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Genetic-environmental data (GE)

rs8034191 (A) rs4324798 (B) rs1950081 (C) age (D) gender (E) smoking (F)Median(ρp) 0.01 0.00 0.10 0.92 0.82 0.85Cluster 1 (2222) 00 00 00 >< >< Cluster 2 (2059) 00 00 00 >

Real data example

‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′

T Real dataγ =

S1 S2 S3 AGE GEN SMS1 0.002 .01 .06 0.06 .06S2 .001 .02 0.02 .02S3 .09 .07 .08AGE 1 .98GEN .88


Real data example

Table 7: Mixing performance of samplers. Median of iterations to best model is calculated after 300 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in

Papathomas et al (2011b).Genetic-environmental data [including important (characterized as such by clustering) representative SNPs]

Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model

‘A+B+C+DEF’(a) Uniformly random 6.3 564 (257,1205) 0.53(b) Cluster specific 8.4 196 (83,443) 0.51(c) Combined (30%,10%) 6.9 310 (147,670) 0.51(d) Combined (20%,20%) 7.5 235 (91,516) 0.52


Relevant work

Relevant work on latent class structures and log-linearmodelling (Johndrow et al.,2014)

I Johndrow et al. (2014) consider standard and novel latent classstructures. The DP is a special case.

I Its rank is defined as the minimum number of clusters required todescribe the joint probability tensor for the categorical covariates.

I Bounds are derived for the rank, in relation to the number and structureof the interactions in a weakly hierarchical log-linear model.

I A massive reduction in the upper bound of the rank is shown, under asparse log-linear model

I The rank of the latent structure depends only on variables that are notmarginally independent.

I A straightforward application gives that an upper bound of the rankcorresponding to simulation 1 is 27, rather than the default 29. The upperbound corresponding to simulation 5 is 28, rather than the default 299.


Relevant work

Relevant work on latent class structures and log-linearmodelling (Zhou et al., 2015)

I Zhou et al. (2015) also utilize the idea that marginally independentvariables reduce the dimensionality of the problem.

I A PARAFAC factorization is adopted, which can be viewed as a moregeneral representation of the Dirichlet process.

I Dimensionality reduction is achieved with the sparse PARAFAC(sp-PARAFAC) formulation, where marginal independence is modelledwith quantities of similar nature to the πp(x).

I The focus is in providing expressions for parameters of the log-linearmodels, assessing the level of shrinkage, and the convergence of theprobability tensor induced by sp-PARAFAC to the true probability tensor.

I The prior formulation for detecting marginally independent covariatesand reducing dimensionality is also different in the two approaches.

I Different objectives, as we focus on accelerating log-linear modelselection with the Reversible Jump by utilizing the clustering process.


Additional remarks - References

Summary

I The advantage in utilizing variable selection within partitioning toinform log-linear model selection is mostly pertinent to marginalindependence.

I For sparse contingency tables, this information can lead to thesubstantial reduction of the number of covariates considered,making the exploration of the model space feasible.

I Informing the model search algorithm with T γ often improves theefficiency of the search. Marginal independence is not alwaysdetected, because the converse of the Theorems does not hold.

I Importantly, using T γ to assist the model search never resulted ina worse algorithm, compared to the standard model searchapproach in Papathomas et al. (2011b).



Further work

I Sparse contingency tables and conditional independenceI Improved model search algorithmsI Sparse contingency tables and identifiability



Some ReferencesI Chung, Y., Dunson, D.B., 2009. Nonparametric Bayes conditional distribution modelling

with variable selection. J. Am. Stat. Assoc. 104, 1646-60.I Johndrow, J.E., Bhattacharya, A., Dunson, D.B., 2014. Tensor decompositions and sparse

log-linear models. arXiv:1404.0396v1.I Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015) PReMiuM:

An R package for Profile Regression Mixture Models using Dirichlet Processes. Journal ofStatistical Software. , 64, Issue 7, pp 1-30.

I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Children’s Health, Biostatistics,11, 484-498.

I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.

I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology.36:663-674

I Papathomas, M. and Richardson, S. (2016): Exploring dependence between categoricalvariables: benefits and limitations of using variable selection within Bayesian clustering inrelation to log-linear modelling with interaction terms. Journal of Statistical Planning andInference. 173, 47-63

I Zhou, J., Bhattacharya, A., Herring, A.H., Dunson, D.B., 2015. Bayesian factorizations ofbig sparse tensors. J. Am. Stat. Assoc. Accepted manuscript.


Fast sampling with Gaussian scale-mixturepriors in high-dimensional regression

Bani K Mallick(joint with Anirban Bhattacharya & Antik Charaborty)Department of Statistics, Texas A&M University

April 13, 2017

Outline

I High Dimensional RegressionI Global-Local scale Mixtures of GaussiansI Efficient sampling from structures multivariate Gaussian

distributionsI Applications

Motivation

I High dimensional regression with p covariates and samplesize n

I p is much larger than nI Sparsity through continuous shrinkage priorsI Need efficient computational scheme for p greater than n

problems

Linear model with global-local prior

I Y = Xβ + �, � ∼ N(0, σ2In)item X : n × p matrix of covariates where p potentiallymuch larger than n

I In such setting, one expects β, the vector of regressioncoefficients to be sparse

I Global-local sparse prior on β

Global-local Prior

I βj |λj , τ, σ ∼ N(0, λ2j τ2σ2), (j = 1, · · · ,p)I λj ∼ f , τ ∼ g,σ ∼ hI f , g,h are densities supported on (0,∞)I λ2j : local variance component works for scale mixing

I τ2: global variance component (like the regularizationparameter in the penalized likelihood formulation)

Sparsity Prior βj by scale mixing

I Create different sparsity priors by choosing f thedistribution of λj

I Student-t [Tipping 2001]: f is inverse-gammaI Double-Exponential [Park and Casella, 2008]: f

exponentialI Normal/Jeffreys [Bae and Mallick, 2004]: Jeffrey’s PriorI Horseshoe Prior [Carvalho et al., 2010]: Half-Cauchy

Horseshoe Priors

I Flat Cauchy like tails allow strong signals to remain large(un-shrunk)

I Tall spike at the origin provides severe shrinkage for thezero elements of β

I Due to scale mixture of Gaussian formulation most of theconditional distributions are in explicit form

Posterior computation: high-dimensional regression

I The conditional posterior of β given λ, τ and σ is Gaussian:

β | y , λ, τ, σ ∼ N(A−1X Ty , σ2A−1), A = (X TX + D−1),

where D=τ2diag(λ21, ..., λ2p).

I Need to invert A which is p × pI Covariance no longer diagonal.I The design matrix X distorts the prior geometry.I Need to sample from a high-dimensional Gaussian per

iterationI The p local scale parameters λj have conditionally

independent posteriors: λ = (λ1, . . . , λp)T updated in ablock.

Computational Difficulties

I A standard algorithm to sample from Gaussiandistributions can be found in (Rue, 2001)

I It avoids inverting A and instead performs a Choleskydecomposition of A and a series of linear system solutionsto generate samples

I This is efficient for moderate values of p, obtaining aCholesky decomposition of A at each MCMC step

Algorithm

I A be an n × n Symmetric, Positive Definite MatrixI Cholesky Decomposition: L is a lower triangular matrix

where Lii > 0 and A = LLTAlgorithm Solving Ax = b where A > 0(i) Compute the Cholesky decomposition A = LLT.(ii) Solve Lv = b.(iii) Solve LTx = v .(iv) Return x

I As x = (L−1)Tv = LT(L−1b) = (LLT)−1b = A−1bI The soultions in (ii) and (iii) can be done through forward

and backward substituition due to the triangular nature of L

Computational Difficulties

I It becomes highly expensive for large pI One cannot resort to precomputing the Cholesky factors

since the matrix D changes from one iteration to the otherI The resulting computational bottleneck obscures the

computational advantages of global-local priors when p islarge

General scheme

I Given X ∈

General scheme

I Rue (2001): sampling from N(µ = Q−1b,Σ = Q−1) withdensity

Algorithm Rue (2001)(i) Compute the Cholesky decomposition Q = LLT.(ii) Solve Lv = b.(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).

I Original motivation: GMRFs where Q is banded.I Cholesky in step (i) can be computed very fast.

General scheme

Present setting: Q = (X TX + D−1) & b = X TY .

Algorithm Rue (2001)(i) Compute the Cholesky decomposition (X TX + D−1) = LLT.(ii) Solve Lv = X TY .(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).

I (X TX + D−1) is a p × p dense matrix.I Cholesky decomposition costly in present setting.I Overall complexity O(p3).

Our proposal

I Sample from Np(µ,Σ) with

Σ = (X TX + D−1)−1, µ = ΣX TY .

I Woodbury matrix identity,

Σ = (X TX + D−1)−1 = D − DX T(XDX T + In)−1XD.

I Can show µ = DX T(XDX T + In)−1Y .

I Sample η ∼ N(0,Σ) & set θ = µ+ η.

I Data augmentation - sample an (n + p) dimensionalquantity.

Our proposal

I P ∈

Our proposal

I Recall P = (XDΦT + In),S = XD and R = D.I η ∼ N(0,Σ)I η = u − STP−1vI µ = DX T(XDX T + In)−1YI µ = STP−1Y .I Finally θ = µ+ ηI θ = u + STP−1(Y − v)

Proposed algorithm

Algorithm Proposed algorithm.(i) Sample u ∼ N(0,D) and δ ∼ N(0, In) independently.(ii) Set v = Xu + δ.(iii) Solve (XDX T + In)w = (Y − v).(iv) Set θ = u + DX Tw .

Overall complexity: O(n2p) if p > n.In p � n settings, reduction from cubic to linear in p.

Time comparison

Table : Absolute time (in seconds) to run 6000 iterations of the Gibbssampler reported for the two algorithms for chosen values of p.

p Timeproposed old

200 5.50 6.05500 8.79 31.03

1000 12.83 160.922000 20.04 944.783000 27.60 2616.804000 35.76 5775.705000 43.99 11314.28

Big gains when p large.

Simulation setting

I Replicated simulation study with horseshoe prior.I n = 200 & p = 5000. True β0 has 5 non-zero entries.I Two signal strengths:

(i) weak - β0S = ±(0.75,1,1.25,1.5,1.75)(ii) moderate - β0S = ±(1.5,1.75,2,2.25,2.5)

I Two types of design matrix:(i) Independent - Xj i.i.d. N(0, Ip)(ii) compound symmetry - Xj i.i.d. N(0,Σ), Σjj′ = 0.5 + 0.5δjj′

I 100 data sets generated.I Compare horseshoe with MCP, SCAD.

Simulation Results

Boxplots of `1, `2 and prediction error across 100 simulation replicates. HSme and HSm respectively denote

posterior point wise median and mean for the horeshoe prior. True β0 is 5-sparse with non-zero entries

±{1.5, 1.75, 2, 2.25, 2.5}. Top row: Σ = Ip (independent). Bottom row: Σjj = 1,Σjj′ = 0.5, j 6= j′ (compound

symmetry).

Simulation Results

Same setting as in Fig 23. True β0 is 5-sparse with non-zero entries±{0.75, 1, 1.25, 1.5, 1.75}.

Confidence/Credible intervals

I Recent focus in the frequentist literature on post selectioninference (POSI), van De Geer et al. (2013), Javanmard &Montanari (2014).

I Provide confidence sets for coefficients of subsets ofvariables.

I Bayesian variable selection by post-processing MCMCoutput with one-group priors (Bondell & Reich, 2012; Hahn& Carvalho, 2015; Li & Pati, 2015)

I Used in conjunction with shrinkage priors.

Simulation Results

p 500 1000

Design Independent Comp Symm Independent Comp Symm

HS LASSO SS HS LASSO SS HS LASSO SS HS LASSO SS

Signal Coverage 931.0 7512.0 823.7 950.9 734.0 804.0 942.0 7812.0 855.1 941.0 772.0 827.4Signal Length 42 46 41 85 71 75 39 41 42 82 76 77

Noise Coverage 1000.0 990.8 991.0 1000.0 981.0 990.6 990.0 991.0 980.9 1000.0 991.0 990.1Noise Length 2 43 40 4 69 73 0.6 42 41 0.7 76 77

Frequentist coverages (%) and 100×lengths of pointwise 95% intervals. Average coverages and lengths are

reported after averaging across all signal variables (rows 1 and 2) and noise variables (rows 3 and 4). Subscripts

denote standard errors (%) for coverages. LASSO and SS respectively stand for the methods in van De Geer et. al.

(2013) and Javanmard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior

credible intervals.

TCGA Data

I The Cancer Genome Atlas (TCGA) providescomprehensive molecular profiles for each of at least 33different human tumor types (http://cancergenome.nih.gov)(Akbani et al, 2013).

I The data portal has DNA and reverse phase protein arrays(RPPA) expression data with several other clinicalvariables e.g. survival time of the patients.

I Integrative analysis is possible as we have quantitativeprotein expression data over large cohorts of wellcharacterized TCGA patient tumors, with linked DNA andRNA analyses.

Classification Analysis

I We have integrated the RPPA data with the DNAexpression data for our analysis.

I Toward the goal of classification we consider two types ofLung tumors data viz. Lung adenocarcinoma (LUAD) andLung squamous cell carcinoma (LUSC).

I After some clean up of the raw data we end up with having179 proteins and 76 genes which are present in both thetumors. The tumors have 55 and 45 samples respectively.

I Our goal is to classify the tumor groups with the help ofgenes and protein together and simultaneously select theimportant genes and proteins.

Probit model

I Suppose yi ∈ {0,1} and Xi ∈

00 o.w

(2)

I Given a posterior of β predictions can be done using theposterior predictive distribution.

Variable selection

I To select the genes and proteins that are important to thetumor classification we use the Horseshoe shrinkage priorfrom Carvalho et. al. (2009) on the coefficients β.

I Specifically, π(βj | λj , τ) ∼ N(0, λ2j τ2) and λj , τ ∼ C+(0,1)for j = 1 . . . p.

I The latent variable formulation enables simple conditionalGibb’s steps for posterior computation.

I However, we need to employ some post processingscheme to select important variables due to the continuousnature of the prior.

Post processing posterior summaries of β

I Let β̄ denote the posterior mean of β.I The posterior predictive loss || X β̄ − Xβ ||22 can be

minimized subject to a l1 constraint to select constraint.I Decision theoretic justification of such a post processing

step can be found in Hahn & Carvalho (2015).I We minimize the following objective function to select

important variables:

βP =β|| X β̄ − Xβ ||22 +p∑

j=1

λj | βj | . (3)

I Following Chakraborty et. al. (2016) we use λj = 1/β̄j2.

Results

I To compare our results we choose the Lasso penalty(Tibshirani ,1995) with a logistic link function from the Rpackage ncvreg/glmnet.

I We report the selected model size and misclassificationrate for both methods.

I For the Horseshoe probit model the misclassification ratewas 5% and selected model size was 5: the selectedgenes are GSKA-3 and ERBB3, and proteins arePI3KP110ALPHA, MYOSINIIA_pS1943, and ANNEXIN1.

I For the LASSO logistic model they were 5% and 32respectively.

I The variables selected by the Horseshoe probit modelwere also selected by the LASSO logistic model.

The selected variables

I The effect of the GSKA-3 gene as one of theHypoxia-inducible factors resulting in solid cancer wasestablished in Gort et. al. (2008).

I The gene ERBB3 has been seen to be have direct impacton lung cancer. See Sitharaman & Anderson (2008).

I A recent work by Elkabets et. al. (2013) studies in detailthe effect of the PI3KP110ALPHA gene on breast cancerand lung cancer.

I Wong et. al. (2012) developed treatments for thesuppression of the lung cancer cell tumor markers relatedto ANNEXIN1 gene.

Survival Analysis for pan-cancer data

I One of the fundamental interest is to establish therelationship of survival of the patients on different proteins

I Data from different kind of tumorsI As sample size may not be very large in each group so

Bayesian hierarchical model will be useful to borrowstrength

I Furthermore, to deal with high dimensionality, we requireeither penalized approach (frequentist statistics) orshrinkage based approach (Bayesian statistics)

I We apply Bayesian techniques with f Horseshoe priors

Kidney Tumors

I We consider the Kidney tumors: Kidney Chromophobe(KICH) 63 samples, Kidney renal clear cell carcinoma(KIRC) 150 samples, Kidney renal papillary cell carcinoma(KIRP) 124 samples

I All the tumors have 189 proteins

AFT model

I Typically, survival outcomes have two variables: time (t)and censored indicator (δ) which allows the data torepresent either censored or not

I i : Patient, j : Protein, k : Tumor typeI We fit an Accelerated Failure rate (AFT) model:

log(tik ) =p∑

j=1

xijkβjk + �ik

where where tik is the survival time of i-th patient who hasthe k -th cancer, other symbols have their usual meaning,and �i are iid N(0, σ2).

I For Bayesian MCMC we impute the censored data wi{wik = log tik if tik is event timewik > log tik if tik is right censored

Shrinkage Prior

We can carry out a regular shrinkage analysis as in linearmodels. Here we adopt global local Horseshoe prior and fitindividual regression model for each kind of tumor

log tik |βjk , σ2 ∼ N

p∑j=1

xijkβjk , σ2k

βjk |λjk , τ, σ2 ∼ N(0, λ2jkτ2σ2k )

λjk ∼ C+(0,1)τ ∼ C+(0,1)

π(σ2k ) ∼1σ2k

Estimation of Protein Effects

Pan Cancer Model

I Previous plot shows that due to shrinkage power ofHorseshoe prior and due to absence of sufficient numberof samples in each Kidney tumor group almost all of theproteins turn out to be insignificant in explaining thesurvival curve

I So we decide to fit a pan cancer modelI While fitting this model we make use of the idea of

borrowing strength across cancers by allowing the priordistributions of the parameters accordingly.

Pan Cancer Model

log tik |βjk , σ2 ∼ N

p∑j=1

xijkβjk , σ2

βjk |λjk , τ, σ2 ∼ N(bj , λ2jkτ2σ2)

λjk |τ, σ2 ∼ C+(0,1)τ |σ2 ∼ C+(0,1)I(0,1)

π(σ2) ∼ 1σ2

bj ∼ N(0, σ2b), j = 1, . . . ,p

Then,

Corr(βjk , βjk ′) =σ2b

(λ2jkτ2σ2 + σ2b)

12 (λ2jk ′τ

2σ2 + σ2b)12

Estimation

Result

I Some of the proteins which are significant for KIRC tumorgroup are BAK, CRAF_pS338, GAB2, HER3_pY1298,PCADHERIN, PCNA, RAD51, FOXO3A_pS318S321, SF2,DIRAS3

I The effects of these proteins have been discussed in theliterature e.g. see Adams et al (2012) GAB2 - AScaffolding Protein in Cancer, Molecular Cancer Research

I The significant proteins for other tumor groups could befound in the similar fashion

1_Lewin_HasseltIBS_April2017MotivationModelCase Study: mQTL discoverySummary, Acknowledgements

2_Brendan_IBS_20173_Papathomas_Hasselt_2017MotivationBayesian clustering with the Dirichlet processLinear graphical model determinationRelevant workAdditional remarks - References

4_Mallick

Documents

Bayesian inference on high-dimensional Seemingly Unrelated ......Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22 Model Simulated data n=100 q=30,