134
Bayesian inference on high-dimensional Seemingly Unrelated Regressions Alex Lewin and Marco Banterle Brunel University London April 2017 Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 1 / 22

Bayesian inference on high-dimensional Seemingly Unrelated ......Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22 Model Simulated data n=100 q=30,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Bayesian inference on high-dimensionalSeemingly Unrelated Regressions

    Alex Lewin and Marco Banterle

    Brunel University London

    April 2017

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 1 / 22

  • Motivation

    Motivation

    Genetic studies aiming at identifying association between pointmutations (SNPs) and multivariate phenotypes:

    gene expression measurements

    metabolomics data

    protein concentrations

    ...

    Looking for sparse variable selection

    Take into account data correlations

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 2 / 22

  • Motivation

    Multivariate Data

    Predictor matrix:

    - n observations

    - p variables

    Xn Response matrix:

    - n observations

    - q variables

    Yn

    p q

    Aim: identify which of the p variables in X are significantly associatedwith the outcomes in Y

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 3 / 22

  • Motivation

    Case study: mQTL discovery in the North FinlandBirth Cohort study (NFBC)

    The NFBC66 is a cohort of 12000 adults followed since 1966

    Collection of data at age 31 years, including clinical data and bloodsamples

    DNA extracted leading to a sample of 5746 adults genotypedacross genome (∼ 300,000 SNPs)

    Metabolite lipid profile quantified from serum samples by NMR,giving measures of 137 metabolites

    Question of interest is the discovery of genetic markersassociated with metabolite regulation of lipids

    These responses are highly structured, with strong correlations

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 4 / 22

  • Motivation

    Correlations in the mQTL data set

    Y: Metabolite correlations X: SNP correlations

    (only subset plotted)

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 5 / 22

  • Model

    Multivariate Regression Model with Variable Selection

    Frame the problem as a multivariate linear regression model:

    Yn×q

    = Xn×p

    Bp×q

    + En×p

    or equivalently:Y ∼MN (XB, In, R)

    Sparse associations: set most elements of B to zero

    Correlated outcomes: allow non-diagonal residual covariancematrix R

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 6 / 22

  • Model

    Variable selection performed through binary matrix Γ (p× q)

    γjk =

    {1 outcome k associated with predictor j

    0 else

    Sparsity prior γjk ∼ Bern(ωjk), ωjk ∼ Beta()

    Gamma matrix

    Predictor variables

    Out

    com

    es

    Predictor Xj only appears in a regression if γjk is 1.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 7 / 22

  • Model

    BayesianSetting

    Very high dimensional datap ≈ 104 to 106 variables in X

    q ranging from 1 to 104 variables in Y

    Around n = 5000 observations

    Focus on Sparse Bayesian Variable Selection (sparse BVS)Minimise arbitrary tuning of model.

    Provides the posterior probability of association for each predictorand each response.

    Bayesian model averagingExplore space of 2p models

    Marginal posterior inclusion probabilities are model averages.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 8 / 22

  • Model

    SUR model

    Seemingly Unrelated Regressions (SUR) model:

    ykn×1

    = Xγkn×dk

    βγkdk×1

    + �kn×1

    for k = 1, · · · , q

    Cov[�k�l] = Rkl 6= 0 =⇒ Outcomes do not naturally separate.

    Different variables selected for each outcome: γk, k = 1, · · · , q.

    So: vectorise model:

    vec(y1, y2, · · · yq) ∼ N (vec(Xγ1βγ1 , Xγ2βγ2 , · · · , Xγqβγq), R⊗ In)

    Covariance matrix is not block diagonal.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 9 / 22

  • Model

    SUR model

    Priors:

    vec(β1, β2, · · · , βq)|γ1, γ2, · · · , γq ∼ N (0, (Iq ⊗W )vec(γ1,γ2,··· ,γq))

    R ∼ IW(ν,M)

    γkj ∼ Bern(ωk), ωk ∼ Beta()

    W is a p× p matrix (constant, or g-prior g(XTX)−1); differentrows/columns selected for different outcomes.

    Not conjugate, cannot integrate out β and R.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 10 / 22

  • Model

    SUR model

    Can calculate posterior full conditionals for βk and R→ Gibbs sampler for γk, βk and R.

    However, this is computationally prohibitive due to calculation of(X t(R−1 ⊗ In)X

    )−1where

    X =

    Xγ1 0 · · · 0

    0 Xγ2 · · · 00 · · · · · · 00 · · · 0 Xγq

    an nq× (d1 + d2 + · · · dq) matrix, which changes every MCMC iteration.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 11 / 22

  • Model

    New work: computation for the SUR model

    Idea from Zellner and Ando (2010): decompose the Likelihood:

    y1 = Xγ1βγ1 + ε1

    y2 = Xγ2βγ2 + ρ21(y1 −Xγ1βγ1) + ε2...

    yk = Xγkβγk +∑l

  • Model

    New work: priors in the transformed space

    We aim to decompose into product over responses, as with Likelihood.

    Beta’s straightforward: ∏k

    N (βk|γk,W )

    Covariance matrix:R ∼ IW(ν,M)

    becomes∏k

    N ({ρk1, · · · , ρk,k−1}|σ2k,M)× IG(σ2k|{ρk1, · · · , ρk,k−1}, ν,M)

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 13 / 22

  • Model

    New: posterior conditionals in transformed space

    Covariance matrixIn transformed space,∏

    k

    N ({ρk1, · · · , }|σ2k,M, Y,B,Γ)× IG(σ2k|{ρk1, · · · , }, ν,M, Y,B,Γ)

    So MCMC updates for R parameters factorise over responses.

    BetasMCMC for B not so straightfoward: Zellner and Ando used simplifiedfactorisation + Gibbs resampling

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 14 / 22

  • Model

    New: posterior conditionals in transformed space

    We found the correct full conditionals for B (does decompose):

    βγk | . . . ∼ N(Wk ×Xtγk ỹk , Wk

    )(k = 1, . . . ,q)

    where

    Wk =

    (XtγkXγk

    (1

    σ2k+∑l>k

    ρ2lkσ2l

    )+W−1γk

    )−1,

    ỹk = function of Y,B,Γ across responses

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 15 / 22

  • Model

    New work: computation for the SUR model

    Main point: we have got rid of the(X t(R−1 ⊗ In)X

    )−1 big matrixcalculation.

    Using ESS (evolutionary stochastic search) algorithm (Bottolo etal.) to explore space of Γ variable selection parameters.

    Comp. time v. q Comp. time v. p Comp. time v. n

    These in R; in C++ will be greater relative speedup.Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 16 / 22

  • Model

    Simulated data

    n=100

    q=30, p=30

    with correlated residuals

    True Outcome Correlation

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 17 / 22

  • Model

    Simulated data

    True Gamma matrixTRUE

    Predictors

    Out

    com

    es

    HESS GammaFull heat map

    Predictors

    Out

    com

    es

    SUR GammaFull heat map

    Predictors

    Out

    com

    es

    HESS GammaPosterior inclusion probabilities

    PIP

    0.0 0.2 0.4 0.6 0.8 1.0

    020

    4060

    80

    SUR GammaPosterior inclusion probabilities

    PIP

    0.0 0.2 0.4 0.6 0.8 1.0

    020

    4060

    80

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 18 / 22

  • Case Study: mQTL discovery

    mQTL analysis of NFBC data

    After quality control,n = 4023 peopleq = 103 metabolitesp = 9172 SNPs on chromsome 16

    Data Correlation Residual Correlation

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 19 / 22

  • Case Study: mQTL discovery

    Evidence of enhanced linkage for Chromosome 16

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 20 / 22

  • Summary, Acknowledgements

    Summary

    Bayesian SUR model with sparsity prior to perform variableselection for multiple responses.

    Estimating the residual covariance matrix increases the accuracyof the variable selection

    We have extended the Zellner and Ando method to obtain directlythe correct posteriors

    Computational speed-up→ model can be used on large genomicdata sets.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 21 / 22

  • Summary, Acknowledgements

    Thank you!Sylvia Richardson

    Leonardo Bottolo

    Marjo-Riitta Jarvelin

    Habib Saadi

    Marc Chadeau-Hyam

    Lewin A et al. (2015)MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICSdatasets, with application to eQTL mapping in multiple tissues, Bioinformatics 10.1093

    Bottolo L, Chadeau-Hyam, M et al. (2013)Guess-ing polygenic associations with multiple phenotypes using a GPU-based EvolutionaryStochastic Search algorithm,PLoS Genetics 10.1371.

    Bhadra A and Mallick BK (2013)Joint high-dimensional Bayesian variable and covariance selection with an application to eQTLanalysis, Biometrics 10.1111.

    Bottolo L, Petretto, E et al. (2011)Bayesian detection of expression quantitative trait loci hot spots, Genetics 189:1449–1459.

    Alex Lewin (Brunel University London) Bayesian Variable Selection April 2017 22 / 22

  • Model-based clustering for multivariatecategorical data

    Michael Fop, Keith Smart and Brendan Murphy

    6th IBS Channel Network Conference

    1 / 25

  • Back Pain Dataset

    ▶ A study to investigate the use of a mechanisms-basedclassification of muscoloskeletal pain in clinical practice.

    ▶ The aim of the study was to asses the discriminative power ofthe taxonomy of pain in Nociceptive, Peripheral Neuropathicand Central Sensitization for low-back disorders.

    ▶ There are N = 464 patients who were assessed according to alist of 36 binary clinical indicators (“Present”/“Absent”).

    ▶ Some of the indicators carry the same information about thepain categories, thus the interest here is to select a subset ofmost relevant clinical criteria, performing a partition of thepatients.

    ▶ Does the partition of the patients agree with the clinicaltaxonomy?

    2 / 25

  • Clustering and Variable Selection

    ▶ The motivating example shows the need for:

    ▶ Clustering: Can we establish the existance of subgroups?How can we characterize these subgroups?

    ▶ Variable Selection: Can we use a subset of the variables todistinguish the subgroups?

    3 / 25

  • Model-Based Clustering/Mixture Models

    ▶ Denote the N ×M data matrix by X

    ▶ The nth observation is denoted by Xn.

    ▶ Model-based clustering assumes that Xn arises from a finitemixture model

    ▶ Assuming G classes (components)

    p(Xn|τ ,θ,G ) =G∑

    g=1

    τgp(Xn|θg ).

    ▶ τg are mixture weights

    ▶ p(Xn|θg ) is the component distribution.

    4 / 25

  • Latent Class Analysis (LCA) model

    ▶ Latent Class Analysis (LCA) is a model for clusteringcategorical data.

    ▶ Let Xn = (Xn1,Xn2, . . . ,XnM) where Xnm takes a value from{1, 2, . . . ,Cm}.

    ▶ In LCA we assume that there is local independence betweenvariables, so that if we knew Xn was in class g we could writeit’s density as

    p(Xn|θg ) =M∏

    m=1

    Cm∏c=1

    θI(Xnm=c)gmc ,

    where {θgm1, . . . , θgmCm} give the probabilities of observingthe categories {1, . . . ,Cm} in variable m

    ▶ θg will characterize and embody the differences betweengroups

    5 / 25

  • LCA model (general)

    ▶ Model likelihood of the form,

    p(Xn|θ, τ ,G ) =G∑

    g=1

    τg

    M∏m=1

    Cm∏c=1

    θI(Xnm=c)gmc .

    ▶ More convenient to work with completed data

    ▶ Augment data with class labels Zn = (Zn1,Zn2, . . . ,ZnG )where

    Zng =

    {1 if observation n belongs to group g0 otherwise.

    ▶ Then we can write down completed data likelihood for anobservation

    p(Xn,Zn|θ, τ ,G ) =G∏

    g=1

    {τg

    M∏m=1

    Cm∏c=1

    θI(Xnm=c)gmc

    }Zng.

    6 / 25

  • LCA model (general)

    ▶ Estimation by EM algorithm or VB (see BayesLCA package)

    ▶ Note that G must be chosen in advance; possible todiscriminate the best G for the data using information criteria(eg. BIC)

    ▶ Bayesian approaches:▶ Pandolfi, Bartolucci and Friel (2014) use reversible jump to get

    posterior probability for G .▶ White, Wyse and Murphy (2016) use a collapased Gibbs

    sampler and incorporate variable selection.

    7 / 25

  • Back Pain Data: LCA Results

    Crit

    .1C

    rit.2

    Crit

    .3C

    rit.4

    Crit

    .5C

    rit.6

    Crit

    .7C

    rit.8

    Crit

    .9C

    rit.1

    0C

    rit.1

    1C

    rit.1

    2C

    rit.1

    3C

    rit.1

    4C

    rit.1

    5C

    rit.1

    6C

    rit.1

    8C

    rit.1

    9C

    rit.2

    0C

    rit.2

    2C

    rit.2

    3C

    rit.2

    4C

    rit.2

    5C

    rit.2

    6C

    rit.2

    7C

    rit.2

    8C

    rit.2

    9C

    rit.3

    0C

    rit.3

    1C

    rit.3

    2C

    rit.3

    3C

    rit.3

    4C

    rit.3

    5C

    rit.3

    6C

    rit.3

    7C

    rit.3

    8

    Class 1

    Class 2

    Class 3

    Class 4

    Class 5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    8 / 25

  • Back Pain Data: LCA Clustering

    1 2 3 4 5

    Central Sensitization 48 1 5 0 41Nociceptive 0 10 96 126 3

    Peripheral Neuropathic 0 89 1 1 4

    9 / 25

  • Variable Selection: Dean & Raftery’s Greedy Search

    ▶ Dean & Raftery (2010) proposed a greedy stepwise variableselection algorithm for LCA.

    ▶ The observation vector Xn is partitioned as

    Xn = (XCn ,X

    Pn ,X

    On )

    where▶ XCn are the current clustering variables.▶ XPn is proposed to be added to the clustering variables.▶ XOn are the other variables.

    10 / 25

  • Dean & Raftery’s Greedy Search

    ▶ Two competing models are compared:

    z

    XC

    XP

    XO

    M1 z

    XC

    XP

    XO

    M∗2

    ▶ M1 assumes that the proposed variable has clusteringstructure.

    ▶ M∗2 assumes that the proposed variable has no clusteringstructure.

    ▶ This framework reduces the independence assumption of thepreviously described approach.

    11 / 25

  • Local Independence (A Problem?)

    ▶ When analyzing the back pain data, we achieved very littledata reduction.

    ▶ In fact, only one variable was labeled as non-clustering.

    ▶ An explanation for this is the local independence assumptionin the model.

    ▶ Suppose we have two variables that are highly dependent andboth exhibit clustering.

    ▶ The variable selection method will include both variables inthe model, even if one variable contains no extra clusteringinformation.

    12 / 25

  • Novel Extension: Relaxing Independence Further

    ▶ It is unrealistic to assume that XCn and XPn are conditionally

    independent in M∗2.▶ We propose replacing M∗2 with a different model.

    z

    XC

    XP

    XO

    M1 z

    XC

    XP

    XO

    XR ⊆ XC

    M2

    ▶ M1 assumes that the proposed variable has clusteringstructure.

    ▶ M2 assumes that the proposed variable has no clusteringstructure beyond that explained by the clustering variables.

    13 / 25

  • Stepwise Search Algorithm

    ▶ We propose a stepwise search algorithm to find an optimal setof variables for clustering.

    ▶ The algorithm involves the following steps:▶ Add: Add a variable to the current clustering variables.▶ Remove: Remove a variable from the current clustering

    variables.▶ Swap: Swap a proposed variable with one already in the

    clustering variables.

    ▶ Model selection is implemented using the BayesianInformation Criterion (BIC).

    14 / 25

  • Back Pain Data

    ▶ The proposed model was applied to the back pain data:

    VariablesN. latentclasses

    BIC ARI

    All 5 -12582.62 0.50All 3∗ -12763.81 0.82

    35 Criteria 5 -12116.32 0.5035 Criteria 3∗ -12305.67 0.80

    11 Criteria 3 -3965.24 0.75

    ▶ The new model achieves much greater data reduction.

    15 / 25

  • Algorithm Run

    Iter. Proposal BIC diff. Decision Proposal BIC diff. Decision1 Remove Crit.5 -122.2 Accepted2 Remove Crit.23 -126.3 Accepted Swap Crit.22 with Crit.5 -73.2 Rejected3 Remove Crit.38 -109.0 Accepted Swap Crit.25 with Crit.5 -81.5 Rejected4 Remove Crit.4 -103.5 Accepted Swap Crit.2 with Crit.38 -98.6 Rejected5 Remove Crit.1 -78.3 Accepted Swap Crit.29 with Crit.4 -23.1 Rejected6 Remove Crit.29 -73.2 Accepted Swap Crit.12 with Crit.1 2.7 Accepted7 Remove Crit.1 -73.5 Accepted Swap Crit.26 with Crit.29 3.2 Accepted8 Remove Crit.29 -66.8 Accepted Swap Crit.18 with Crit.12 -10.2 Rejected9 Remove Crit.35 -63.0 Accepted Swap Crit.7 with Crit.29 -9.0 Rejected10 Remove Crit.7 -59.6 Accepted Swap Crit.11 with Crit.35 -7.6 Rejected11 Remove Crit.10 -62.9 Accepted Swap Crit.8 with Crit.7 -76.1 Rejected12 Remove Crit.11 -50.4 Accepted Swap Crit.16 with Crit.10 6.8 Accepted13 Remove Crit.8 -54.5 Accepted Swap Crit.10 with Crit.16 -32.0 Rejected14 Remove Crit.3 -44.2 Accepted Swap Crit.31 with Crit.16 -9.5 Rejected15 Remove Crit.31 -33.2 Accepted Swap Crit.18 with Crit.16 -22.7 Rejected16 Remove Crit.22 -30.9 Accepted Swap Crit.24 with Crit.23 -1.7 Rejected17 Remove Crit.14 -22.7 Accepted Swap Crit.32 with Crit.31 -5.0 Rejected18 Remove Crit.32 -19.2 Accepted Swap Crit.37 with Crit.14 -8.0 Rejected19 Remove Crit.10 -35.4 Accepted Swap Crit.9 with Crit.3 -1.3 Rejected20 Remove Crit.24 -17.6 Accepted Swap Crit.30 with Crit.8 15.7 Accepted21 Remove Crit.34 -15.7 Accepted Swap Crit.37 with Crit.1 -0.7 Rejected22 Remove Crit.25 -13.7 Accepted Swap Crit.36 with Crit.1 3.3 Accepted23 Remove Crit.18 -10.5 Accepted Swap Crit.1 with Crit.31 8.5 Accepted24 Remove Crit.27 -13.7 Accepted Swap Crit.6 with Crit.26 6.1 Accepted25 Remove Crit.31 -1.3 Accepted Swap Crit.20 with Crit.6 5.6 Accepted26 Remove Crit.37 1.4 Rejected Swap Crit.6 with Crit.5 -3.1 Accepted27 Remove Crit.5 0.4 Rejected Swap Crit.37 with Crit.20 4.0 Rejected

    16 / 25

  • Clustering / Clinical Taxonomy

    ▶ The clustering closely follows the clinical taxonomy.

    Class 1 Class 2 Class 3

    Nociceptive 210 21 4Peripheral Neuropathic 5 88 2Central Sensitiization 3 3 89

    ▶ It is not unusual for patients diagnosed as Nociceptive mayhave Peripheral Neuropathic aspects to their back pain.

    17 / 25

  • Clustering Variables

    ▶ The selected variables exhibit strong clustering across thethree groups.

    Crit

    .2

    Crit

    .5

    Crit

    .8

    Crit

    .9

    Crit

    .13

    Crit

    .15

    Crit

    .19

    Crit

    .26

    Crit

    .28

    Crit

    .33

    Crit

    .37

    Class 1

    Class 2

    Class 3

    Selected criteria

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    18 / 25

  • Chosen Variables with Descriptions

    The chosen variables have the following descriptions.Crit. Description Class 1 Class 2 Class 3

    2 Pain associated to trauma, pathologic process or dysfunction 0.94 0.90 0.045 Usually intermittent and sharp with movement/mechanical provocation 0.94 0.84 0.248 Pain localized to the area of injury/dysfunction 0.97 0.50 0.319 Pain referred in a dermatomal or cutaneous distribution 0.06 1.00 0.11

    13 Disproportionate, nonmechanical, unpredictable pattern of pain 0.01 0.00 0.9115 Pain in association with other dysesthesias 0.03 0.51 0.3419 Night pain/disturbed sleep 0.34 0.70 0.8626 Pain in association with high levels of functional disability 0.07 0.36 0.7928 Clear, consistent and proportionate pattern of pain 0.97 0.94 0.0733 Diffuse/nonanatomic areas of pain/tenderness on palpation 0.03 0.01 0.7337 Pain/symptom provocation on palpation of relevant neural tissues 0.07 0.57 0.19

    19 / 25

  • Discarded Variables

    ▶ Many of the discarded variables are related with the clusteringvariables.

    Crit.2 Crit.5 Crit.8 Crit.9 Crit.13 Crit.15 Crit.19 Crit.26 Crit.28 Crit.33 Crit.37

    Crit.1

    Crit.3

    Crit.4

    Crit.6

    Crit.7

    Crit.10

    Crit.11

    Crit.12

    Crit.14

    Crit.16

    Crit.18

    Crit.20

    Crit.22

    Crit.23

    Crit.24

    Crit.25

    Crit.27

    Crit.29

    Crit.30

    Crit.31

    Crit.32

    Crit.34

    Crit.35

    Crit.36

    Crit.38

    Selected criteria

    Dis

    card

    ed c

    riter

    ia

    0

    100

    200

    300

    400

    ▶ These are not clustering variables because they don’t exhibitclustering beyond what can be explained by the clusteringvariables. 20 / 25

  • Summary

    ▶ Model-based approaches to clustering and variable selectionachieve excellent performance.

    ▶ Removing independence assumptions in the model achievesimproved variable selection.Care needed interpreting the chosen/discarded variables.

    21 / 25

  • Simulation 1

    z

    X1 X2 X3 X4

    X5 X6 X7 X8

    X9 X10 X11 X12

    22 / 25

  • Simulation 1 Results

    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12

    0.0

    00

    .50

    1.0

    0

    23 / 25

  • Simulation 2

    z

    X1 X2 X3 X4 X5

    X6

    X7

    X8

    X9

    X10

    24 / 25

  • Simulation 2 Results

    ● ●

    ● ●

    ●●

    ●●

    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

    0.00

    0.50

    1.00

    25 / 25

  • Exploring the dependence structure betweencategorical variables: Benefits and limitations of

    using variable selection within Bayesian clustering

    Michail Papathomas

    6th IBS Channel Network Conference - Hasselt - 2017

    (University of St Andrews) 1 / 37

  • Collaborators

    Main collaborators in the development of the clustering approach[profile regression, based on the Dirichlet process]:

    I Silvia LiveraniI David HastieI John MolitorI Sylvia Richardson

    Work on the relation between clustering and log-linear modelling withSylvia Richardson

    (University of St Andrews) 2 / 37

  • Motivation

    Dependence structure and log-linear modelling

    I Assume observations from P categorical variables {x.1, ..., x.P}.For example,

    Subject Smoking (X) Drinking (Y)John 0 0Mary 0 1Jim 1 1

    ...I The resulting data can be arranged as counts in a P-way

    contingency table. For example,

    Table 1: Smoking (X) and Drinking (Y).Smoking (X) Drinking (Y)

    No (0) Yes (1)No (0) 456 44Yes (1) 583 911

    (University of St Andrews) 3 / 37

  • Motivation

    Dependence structure and log-linear modelling

    I Denote the cell counts as nl , l = 1, . . . ,n.I A Poisson distribution is assumed for the counts so that

    E(nl) = µl .I A Poisson log-linear model log(µ) = XDMλ is a GLM that relates

    the expected counts to the variables.

    Table 1: Smoking (X) and drinking (Y).Smoking (X) Drinking (Y)

    No (0) Yes (1)No (0) 456 44Yes (1) 583 911

    (University of St Andrews) 4 / 37

  • Motivation

    Dependence structure and log-linear modelling

    Sometimes, the dependence structure between {x.1, ..., x.P} (marginaland conditional independence) can be infered by the form of thelog-linear model. For example,

    I

    log(µij) = λ+ λXi + λYj + λ

    XYij

    implies that X and Y are dependent.I

    log(µij) = λ+ λXi + λYj

    implies marginal independence.I In practice, for more than 3 variables, interpreting the dependence

    structure by looking at the presence/absence of interaction termsbecomes too difficult.

    (University of St Andrews) 5 / 37

  • Motivation

    Dependence structure and log-linear modelling

    Joint probabilities for the categorical variables are obtained using theparameters of the log-linear model. In our simple example,

    P(X = i ,Y = j) =µi,j∑

    i ′ ,j ′=0,1 µi ′ ,j ′.

    Then, for instance,

    P(smoker, drinker) = P(X = 1,Y = 1) =µ1,1

    µ0,0 + µ1,0 + µ0,1 + µ1,1

    =exp(λ+ λX1 + λ

    Y1 + λ

    XY1,1)

    exp(λ) + exp(λ+ λX1 ) + ...

    I So, in principle, we could fully explore the variables’ dependencestructure by calculating many joint probabilities and consideringthe laws of probability

    I or, even better, by considering graphical log-linear models(University of St Andrews) 6 / 37

  • Motivation

    Graphical models

    I They allow to visualize and build complex dependence structuresfor the covariates under consideration

    I Undirected Graphs and Directed Acyclic Graphs allow to defineand write complex joint distributions through factorization andconditional independence; Lauritzen (2011)

    I Neighborhoods of models are easily defined, and it isstraightforward to move in the space of models by adding,removing or replacing edges.

    (University of St Andrews) 7 / 37

  • Motivation

    Example of a graphical model

    (University of St Andrews) 8 / 37

  • Motivation

    Problems with Log-linear modelling for detectinginteractions

    When it is of interest to detect interactions between covariates, linearregression modelling may become problematic.

    I In a classical setting, fitting linear models with many parameterssometimes requires an impractically large vector ofobservations for valid inferences (Burton et al., IJE, 2009). Also,identifiability and collinearity problems are often present.

    I In Bayesian model comparison, the space of models becomesvast, and model search algorithms like the Reversible Jumpapproach (Green, Bka 1995) require an impractically largenumber of iterations before they converge (Dobra and Massam,St Meth 2010).

    (University of St Andrews) 9 / 37

  • Bayesian clustering with the Dirichlet process

    Bayesian clustering with the Dirichlet process

    I Partitions the subjects into groups according to their profileI Flexible Bayesian clusteringI Uncertainty with regard to the clustering is evaluatedI Post-processing leads to tractable output

    (University of St Andrews) 10 / 37

  • Bayesian clustering with the Dirichlet process

    Notation

    I Consider categorical variables x.p, p = 1, ...,P.I For individual i , denote the variable/covariate profile by

    xi = (xi1, ..., xiP).I For example,

    x.1: smokes, does not smokex.2: drinks, does not drinkx.3: exercises, does not exercise

    I Now, for instance, xi=(smokes, drinks, does not exercise).

    (University of St Andrews) 11 / 37

  • Bayesian clustering with the Dirichlet process

    Notation

    For individual iI zi = c allocates subject, i , to cluster c.I φcp(x) the probability that variable x.p = x , for zi = c.I Given zi = c, x.p has a multinomial distribution with cluster specific

    parameters φcp = [φcp(1), ..., φcp(Mp)]I A priori, φcp ∼ Dirichlet(λ1, ..., λMp)I ψc denotes the probability that a subject is assigned to cluster c.

    (University of St Andrews) 12 / 37

  • Bayesian clustering with the Dirichlet process

    Statistical Framework

    For φ = {φcp, c ∈ N,p = 1, ...,P},I ‘stick-breaking’ prior on the allocation weights ψcI x.1, ..., x.P are assumed independent given the clustering

    allocation and parameters...I and calculating joint probabilities for the categorical variables

    becomes easy!I Pr(xi |z, φ) =

    ∏Pp=1 φ

    zip (xip) for i = 1,2, ...,n.

    I This implies

    Pr(xi |φ, ψ) =∞∑

    c=1

    Pr(zi = c|ψ)P∏

    p=1

    Pr(xip|zi = c) =∞∑

    c=1

    ψc

    P∏p=1

    φcp(xip).

    (University of St Andrews) 13 / 37

  • Bayesian clustering with the Dirichlet process

    Using 0-1 variable selection switches

    I Identify covariates that contribute more than others to the formation ofclusters. [Tadesse et al. (2005), Chung and Dunson (2009);Papathomas et al. (2012)]

    I Cluster specific binary indicators, γcp , so that γcp = 1 when covariate x.p isimportant for allocating subjects to cluster c; otherwise γcp = 0.

    I Prior for switches: given ρp, γcp ∼ Bernoulli(1, ρp).I We consider a sparsity inducing prior for ρ with an atom at zero:

    ρp ∼ 1{wp=0}δ0(ρp) + 1{wp=1}Beta(αρ, βρ)

    where wj ∼ Bernoulli(0.5).Similar to Chung and Dunson (JASA, 2009), but in their set up, covariate observations contribute to thelikelihood through a regression model. In our case, covariate observations contribute directly to the likelihood,and we introduce πp(x).

    (University of St Andrews) 14 / 37

  • Example - Using the R package PReMiuM; Liverani etal. (2015, JSS)

    I Simulated observations from 10000 subjects, recording...I 10 binary variables, say {SMO,DRI,EXE ,D,E , ...,H, I, J}.

  • Example

  • Example

    Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1

    SMO DRI EXE D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00

  • Linear graphical model determination

    Clustering and linear modelling

    I It is not clear how clustering output translates to interactions in alog-linear regression modelling framework.

    I Can we assist the process of comparing a large number of linearmodels with the clustering variable selection results?

    I The important aspect of a model that combines clustering andvariable selection is that covariates are not chosen inaccordance with size of marginal effect. They are selectedbecause they combine to create distinct groups of subjects.Consequently, we expect that this type of modelling should beable to inform on interactions in a linear model setting.

    (University of St Andrews) 18 / 37

  • Linear graphical model determination

    Theoretical results

    Theorem 1: Consider random variables x.p and x.q, 1 ≤ p,q ≤ P,p 6= q. If

    ∑Cc=1 γ

    cp × γcq = 0 then x.p and x.q are independent.

    Theorem 2: Consider a set of random variables {x.1, . . . , x.P}. If, forsome p ∈ {1, ...,P},

    ∑Cc=1 γ

    cp × γcq = 0, for all q 6= p, then x.p is

    independent of {x.1, . . . , x.P} \ x.p.

    Proofs: See Papathomas and Richardson (2016, JSPI). Note that theconverse is not true.

    The previous Theorems imply the following Corollary,

    Corollary: Consider covariate x.p. If∑C

    c=1 γcp = 0 then x.p is

    independent from all other covariates.

    (University of St Andrews) 19 / 37

  • Linear graphical model determination

    Theoretical results

    Therefore,I if the selection probability ρp for x.p is zero or close to zero,

    something that implies that∑C

    c=1 γcp is also zero or close to zero,

    we can assume that x.p is independent from all other covariates.I Assuming that our interest lies in exploring interactions, to reduce

    the dimensionality of the problem when fitting linear models tosparse contingency tables, x.p could be removed from theanalysis.

    (University of St Andrews) 20 / 37

  • Simulated data using graphical log-linear models

  • Linear graphical model determination

    Construction of matrix T γ

    I For iteration it and for each cluster c with more than one subject, formmatrix T c,it , so that element (p1,p2), 1 ≤ p1 < p2 ≤ P is either zero orone, and equal to γcp1(it)× γ

    cp2(it). All other matrix cells are empty.

    I Sum up all matrices T c,it , weighing by cluster size, to create aninformation matrix T γ ,

    T γ =∑

    it

    ∑c

    nc,it × Tc,it .

    where nc,it is the size of cluster c at iteration it . Therefore, T γ is astraightforward summary of all T c,it matrices into one, with small clusterscontributing less to this summary.

    I For ease of interpretation reweight the elements of T γ so that themaximum element is one, T γ = (max{T γ})−1 × T γ .

    Matrix T γ is constructed in such a manner so that if element tγ(p1,p2),1 ≤ p1 < p2 ≤ P, is close to zero, this implies that an edge between x.p1 andx.p2 is not likely to be present in a highly supported graphical model.

    (University of St Andrews) 22 / 37

  • Linear graphical model determination

    Example. Simulation 1

    I Simulated observations from 10000 subjects.I 10 binary categorical variables {A,B,C, ...,H, I, J} are observed.I Observations are simulated in accordance with the log-linear

    model

    log(µ) = λ+λA+λB+λC+...+λJ+λAB+λBC+λCD+λDA+λHI+λIJ+λHJ+λHIJ

    I (For more details see Papathomas and Richardson (2016)) .

    (University of St Andrews) 23 / 37

  • Example. Simulation 1

    log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ

  • Example. Simulation 1

    log(µ) = λ+λA+λB+λC+...+λJ +λAB+λBC+λCD+λDA+λHI+λIJ +λHJ +λHIJ

    Table 2: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Simulation 1

    A B C D E F G H I JMedian(ρp) 0.36 0.78 0.32 0.75 0.06 0.05 0.00 0.48 0.57 0.50Group 1 (5465) >< 00 00 00 00 >< >< Group 2 (3159) >< 00 >< 00 00 00 >< ><Group 3 (1376) 00 >< 00 00 00 00 00

  • Linear graphical model determination

    Example. Simulation 1

    T sim1γ =

    A B C D E F G H I JA .52 .08 .50 .04 .02 .02 .20 .27 .15B .45 1 .06 .04 .03 .47 .64 .47C .45 .02 .02 .009 .12 .23 .16D .06 .04 .03 .45 .65 .48E .003 .003 .03 .04 .03F .002 .02 .03 .03G .02 .02 .02H 0.61 .56I .74

    (University of St Andrews) 26 / 37

  • Linear graphical model determination

    Example. Simulation 1

    Table 3: Mixing performance of samplers. Median of iterations to best model is calculated after 30 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in

    Papathomas et al (2011b). See Figure 2 for the highest posterior probability model.Simulation 1

    Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model

    (a) Uniformly random (PDV) 5.1 590 (452,821) 0.55(b) Cluster specific 3.8 247 (164,369) 0.55(c) Combined (30%,10%) 5.3 540 (290,674) 0.53(d) Combined (20%,20%) 4.9 403 (312,493) 0.55

    (University of St Andrews) 27 / 37

  • Linear graphical model determination

    Real data example

    I 30 single nucleotide polymorphisms (SNPs) in chromosomes 6and 15.(Data from 4260 subjects in a genome-wide association study of lung cancer presented in Hung et al. (2008).)

    I 12 SNPs were indicated as important by variable selection withinclustering.(Two from chromosome 15 and ten from chromosome 6.)

    I SNPs were highly correlated. 3 SNPs included in the competinglog-linear graphical models as representatives.rs8034191 from chromosome 15 and {rs4324798,rs1950081} from chromosome 6.

    I Also include age, gender and smoking status in the competinglog-linear graphical models, to search for gene-environmentinteractions.

    (University of St Andrews) 28 / 37

  • Linear graphical model determination

    Real data example

    I Reducing the number of SNPs from 30 to 12, and then to 3, allowsfor the use of reversible jump MCMC to compare competinggraphical models. The 233 contingency table would be too sparsewith the vast majority of cells equal to zero.

    I The highest posterior probability model (P=0.8) is

    ‘SNP1+SNP2+SNP3+AGE*GENDER*SMOKING’

    which does not support the presence of gene-gene orgene-environment interactions.

    (University of St Andrews) 29 / 37

  • Real data example

    ‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′

    Table 6: Cluster profiles. In parenthesis the number of subjects typically allocated to each group.Genetic-environmental data (GE)

    rs8034191 (A) rs4324798 (B) rs1950081 (C) age (D) gender (E) smoking (F)Median(ρp) 0.01 0.00 0.10 0.92 0.82 0.85Cluster 1 (2222) 00 00 00 >< >< Cluster 2 (2059) 00 00 00 >

  • Real data example

    ‘SNP1 + SNP2 + SNP3 + AGE ∗GENDER ∗ SMOKING′

    T Real dataγ =

    S1 S2 S3 AGE GEN SMS1 0.002 .01 .06 0.06 .06S2 .001 .02 0.02 .02S3 .09 .07 .08AGE 1 .98GEN .88

  • Linear graphical model determination

    Real data example

    Table 7: Mixing performance of samplers. Median of iterations to best model is calculated after 300 runs of the reversible jumpMCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in

    Papathomas et al (2011b).Genetic-environmental data [including important (characterized as such by clustering) representative SNPs]

    Acceptance rate Iterations (median) to highest Posterior probabilityas a percentage posterior probability model for highest probability model

    ‘A+B+C+DEF’(a) Uniformly random 6.3 564 (257,1205) 0.53(b) Cluster specific 8.4 196 (83,443) 0.51(c) Combined (30%,10%) 6.9 310 (147,670) 0.51(d) Combined (20%,20%) 7.5 235 (91,516) 0.52

    (University of St Andrews) 32 / 37

  • Relevant work

    Relevant work on latent class structures and log-linearmodelling (Johndrow et al.,2014)

    I Johndrow et al. (2014) consider standard and novel latent classstructures. The DP is a special case.

    I Its rank is defined as the minimum number of clusters required todescribe the joint probability tensor for the categorical covariates.

    I Bounds are derived for the rank, in relation to the number and structureof the interactions in a weakly hierarchical log-linear model.

    I A massive reduction in the upper bound of the rank is shown, under asparse log-linear model

    I The rank of the latent structure depends only on variables that are notmarginally independent.

    I A straightforward application gives that an upper bound of the rankcorresponding to simulation 1 is 27, rather than the default 29. The upperbound corresponding to simulation 5 is 28, rather than the default 299.

    (University of St Andrews) 33 / 37

  • Relevant work

    Relevant work on latent class structures and log-linearmodelling (Zhou et al., 2015)

    I Zhou et al. (2015) also utilize the idea that marginally independentvariables reduce the dimensionality of the problem.

    I A PARAFAC factorization is adopted, which can be viewed as a moregeneral representation of the Dirichlet process.

    I Dimensionality reduction is achieved with the sparse PARAFAC(sp-PARAFAC) formulation, where marginal independence is modelledwith quantities of similar nature to the πp(x).

    I The focus is in providing expressions for parameters of the log-linearmodels, assessing the level of shrinkage, and the convergence of theprobability tensor induced by sp-PARAFAC to the true probability tensor.

    I The prior formulation for detecting marginally independent covariatesand reducing dimensionality is also different in the two approaches.

    I Different objectives, as we focus on accelerating log-linear modelselection with the Reversible Jump by utilizing the clustering process.

    (University of St Andrews) 34 / 37

  • Additional remarks - References

    Summary

    I The advantage in utilizing variable selection within partitioning toinform log-linear model selection is mostly pertinent to marginalindependence.

    I For sparse contingency tables, this information can lead to thesubstantial reduction of the number of covariates considered,making the exploration of the model space feasible.

    I Informing the model search algorithm with T γ often improves theefficiency of the search. Marginal independence is not alwaysdetected, because the converse of the Theorems does not hold.

    I Importantly, using T γ to assist the model search never resulted ina worse algorithm, compared to the standard model searchapproach in Papathomas et al. (2011b).

    (University of St Andrews) 35 / 37

  • Additional remarks - References

    Further work

    I Sparse contingency tables and conditional independenceI Improved model search algorithmsI Sparse contingency tables and identifiability

    (University of St Andrews) 36 / 37

  • Additional remarks - References

    Some ReferencesI Chung, Y., Dunson, D.B., 2009. Nonparametric Bayes conditional distribution modelling

    with variable selection. J. Am. Stat. Assoc. 104, 1646-60.I Johndrow, J.E., Bhattacharya, A., Dunson, D.B., 2014. Tensor decompositions and sparse

    log-linear models. arXiv:1404.0396v1.I Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015) PReMiuM:

    An R package for Profile Regression Mixture Models using Dirichlet Processes. Journal ofStatistical Software. , 64, Issue 7, pp 1-30.

    I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Children’s Health, Biostatistics,11, 484-498.

    I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.

    I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology.36:663-674

    I Papathomas, M. and Richardson, S. (2016): Exploring dependence between categoricalvariables: benefits and limitations of using variable selection within Bayesian clustering inrelation to log-linear modelling with interaction terms. Journal of Statistical Planning andInference. 173, 47-63

    I Zhou, J., Bhattacharya, A., Herring, A.H., Dunson, D.B., 2015. Bayesian factorizations ofbig sparse tensors. J. Am. Stat. Assoc. Accepted manuscript.

    (University of St Andrews) 37 / 37

  • Fast sampling with Gaussian scale-mixturepriors in high-dimensional regression

    Bani K Mallick(joint with Anirban Bhattacharya & Antik Charaborty)Department of Statistics, Texas A&M University

    April 13, 2017

  • Outline

    I High Dimensional RegressionI Global-Local scale Mixtures of GaussiansI Efficient sampling from structures multivariate Gaussian

    distributionsI Applications

  • Motivation

    I High dimensional regression with p covariates and samplesize n

    I p is much larger than nI Sparsity through continuous shrinkage priorsI Need efficient computational scheme for p greater than n

    problems

  • Linear model with global-local prior

    I Y = Xβ + �, � ∼ N(0, σ2In)item X : n × p matrix of covariates where p potentiallymuch larger than n

    I In such setting, one expects β, the vector of regressioncoefficients to be sparse

    I Global-local sparse prior on β

  • Global-local Prior

    I βj |λj , τ, σ ∼ N(0, λ2j τ2σ2), (j = 1, · · · ,p)I λj ∼ f , τ ∼ g,σ ∼ hI f , g,h are densities supported on (0,∞)I λ2j : local variance component works for scale mixing

    I τ2: global variance component (like the regularizationparameter in the penalized likelihood formulation)

  • Sparsity Prior βj by scale mixing

    I Create different sparsity priors by choosing f thedistribution of λj

    I Student-t [Tipping 2001]: f is inverse-gammaI Double-Exponential [Park and Casella, 2008]: f

    exponentialI Normal/Jeffreys [Bae and Mallick, 2004]: Jeffrey’s PriorI Horseshoe Prior [Carvalho et al., 2010]: Half-Cauchy

  • Horseshoe Priors

    I Flat Cauchy like tails allow strong signals to remain large(un-shrunk)

    I Tall spike at the origin provides severe shrinkage for thezero elements of β

    I Due to scale mixture of Gaussian formulation most of theconditional distributions are in explicit form

  • Posterior computation: high-dimensional regression

    I The conditional posterior of β given λ, τ and σ is Gaussian:

    β | y , λ, τ, σ ∼ N(A−1X Ty , σ2A−1), A = (X TX + D−1),

    where D=τ2diag(λ21, ..., λ2p).

    I Need to invert A which is p × pI Covariance no longer diagonal.I The design matrix X distorts the prior geometry.I Need to sample from a high-dimensional Gaussian per

    iterationI The p local scale parameters λj have conditionally

    independent posteriors: λ = (λ1, . . . , λp)T updated in ablock.

  • Computational Difficulties

    I A standard algorithm to sample from Gaussiandistributions can be found in (Rue, 2001)

    I It avoids inverting A and instead performs a Choleskydecomposition of A and a series of linear system solutionsto generate samples

    I This is efficient for moderate values of p, obtaining aCholesky decomposition of A at each MCMC step

  • Algorithm

    I A be an n × n Symmetric, Positive Definite MatrixI Cholesky Decomposition: L is a lower triangular matrix

    where Lii > 0 and A = LLTAlgorithm Solving Ax = b where A > 0(i) Compute the Cholesky decomposition A = LLT.(ii) Solve Lv = b.(iii) Solve LTx = v .(iv) Return x

    I As x = (L−1)Tv = LT(L−1b) = (LLT)−1b = A−1bI The soultions in (ii) and (iii) can be done through forward

    and backward substituition due to the triangular nature of L

  • Computational Difficulties

    I It becomes highly expensive for large pI One cannot resort to precomputing the Cholesky factors

    since the matrix D changes from one iteration to the otherI The resulting computational bottleneck obscures the

    computational advantages of global-local priors when p islarge

  • General scheme

    I Given X ∈

  • General scheme

    I Rue (2001): sampling from N(µ = Q−1b,Σ = Q−1) withdensity

    Algorithm Rue (2001)(i) Compute the Cholesky decomposition Q = LLT.(ii) Solve Lv = b.(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).

    I Original motivation: GMRFs where Q is banded.I Cholesky in step (i) can be computed very fast.

  • General scheme

    Present setting: Q = (X TX + D−1) & b = X TY .

    Algorithm Rue (2001)(i) Compute the Cholesky decomposition (X TX + D−1) = LLT.(ii) Solve Lv = X TY .(iii) Solve LTm = v .(iv) Solve LTw = z, where z ∼ N(0, Ip).(vi) Set θ = m + w . Then, θ ∼ N(µ,Σ).

    I (X TX + D−1) is a p × p dense matrix.I Cholesky decomposition costly in present setting.I Overall complexity O(p3).

  • Our proposal

    I Sample from Np(µ,Σ) with

    Σ = (X TX + D−1)−1, µ = ΣX TY .

    I Woodbury matrix identity,

    Σ = (X TX + D−1)−1 = D − DX T(XDX T + In)−1XD.

    I Can show µ = DX T(XDX T + In)−1Y .

    I Sample η ∼ N(0,Σ) & set θ = µ+ η.

    I Data augmentation - sample an (n + p) dimensionalquantity.

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I P ∈

  • Our proposal

    I Recall P = (XDΦT + In),S = XD and R = D.I η ∼ N(0,Σ)I η = u − STP−1vI µ = DX T(XDX T + In)−1YI µ = STP−1Y .I Finally θ = µ+ ηI θ = u + STP−1(Y − v)

  • Proposed algorithm

    Algorithm Proposed algorithm.(i) Sample u ∼ N(0,D) and δ ∼ N(0, In) independently.(ii) Set v = Xu + δ.(iii) Solve (XDX T + In)w = (Y − v).(iv) Set θ = u + DX Tw .

    Overall complexity: O(n2p) if p > n.In p � n settings, reduction from cubic to linear in p.

  • Time comparison

    Table : Absolute time (in seconds) to run 6000 iterations of the Gibbssampler reported for the two algorithms for chosen values of p.

    p Timeproposed old

    200 5.50 6.05500 8.79 31.03

    1000 12.83 160.922000 20.04 944.783000 27.60 2616.804000 35.76 5775.705000 43.99 11314.28

    Big gains when p large.

  • Simulation setting

    I Replicated simulation study with horseshoe prior.I n = 200 & p = 5000. True β0 has 5 non-zero entries.I Two signal strengths:

    (i) weak - β0S = ±(0.75,1,1.25,1.5,1.75)(ii) moderate - β0S = ±(1.5,1.75,2,2.25,2.5)

    I Two types of design matrix:(i) Independent - Xj i.i.d. N(0, Ip)(ii) compound symmetry - Xj i.i.d. N(0,Σ), Σjj′ = 0.5 + 0.5δjj′

    I 100 data sets generated.I Compare horseshoe with MCP, SCAD.

  • Simulation Results

    Boxplots of `1, `2 and prediction error across 100 simulation replicates. HSme and HSm respectively denote

    posterior point wise median and mean for the horeshoe prior. True β0 is 5-sparse with non-zero entries

    ±{1.5, 1.75, 2, 2.25, 2.5}. Top row: Σ = Ip (independent). Bottom row: Σjj = 1,Σjj′ = 0.5, j 6= j′ (compound

    symmetry).

  • Simulation Results

    Same setting as in Fig 23. True β0 is 5-sparse with non-zero entries±{0.75, 1, 1.25, 1.5, 1.75}.

  • Confidence/Credible intervals

    I Recent focus in the frequentist literature on post selectioninference (POSI), van De Geer et al. (2013), Javanmard &Montanari (2014).

    I Provide confidence sets for coefficients of subsets ofvariables.

    I Bayesian variable selection by post-processing MCMCoutput with one-group priors (Bondell & Reich, 2012; Hahn& Carvalho, 2015; Li & Pati, 2015)

    I Used in conjunction with shrinkage priors.

  • Simulation Results

    p 500 1000

    Design Independent Comp Symm Independent Comp Symm

    HS LASSO SS HS LASSO SS HS LASSO SS HS LASSO SS

    Signal Coverage 931.0 7512.0 823.7 950.9 734.0 804.0 942.0 7812.0 855.1 941.0 772.0 827.4Signal Length 42 46 41 85 71 75 39 41 42 82 76 77

    Noise Coverage 1000.0 990.8 991.0 1000.0 981.0 990.6 990.0 991.0 980.9 1000.0 991.0 990.1Noise Length 2 43 40 4 69 73 0.6 42 41 0.7 76 77

    Frequentist coverages (%) and 100×lengths of pointwise 95% intervals. Average coverages and lengths are

    reported after averaging across all signal variables (rows 1 and 2) and noise variables (rows 3 and 4). Subscripts

    denote standard errors (%) for coverages. LASSO and SS respectively stand for the methods in van De Geer et. al.

    (2013) and Javanmard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior

    credible intervals.

  • TCGA Data

    I The Cancer Genome Atlas (TCGA) providescomprehensive molecular profiles for each of at least 33different human tumor types (http://cancergenome.nih.gov)(Akbani et al, 2013).

    I The data portal has DNA and reverse phase protein arrays(RPPA) expression data with several other clinicalvariables e.g. survival time of the patients.

    I Integrative analysis is possible as we have quantitativeprotein expression data over large cohorts of wellcharacterized TCGA patient tumors, with linked DNA andRNA analyses.

  • Classification Analysis

    I We have integrated the RPPA data with the DNAexpression data for our analysis.

    I Toward the goal of classification we consider two types ofLung tumors data viz. Lung adenocarcinoma (LUAD) andLung squamous cell carcinoma (LUSC).

    I After some clean up of the raw data we end up with having179 proteins and 76 genes which are present in both thetumors. The tumors have 55 and 45 samples respectively.

    I Our goal is to classify the tumor groups with the help ofgenes and protein together and simultaneously select theimportant genes and proteins.

  • Probit model

    I Suppose yi ∈ {0,1} and Xi ∈

    00 o.w

    (2)

    I Given a posterior of β predictions can be done using theposterior predictive distribution.

  • Variable selection

    I To select the genes and proteins that are important to thetumor classification we use the Horseshoe shrinkage priorfrom Carvalho et. al. (2009) on the coefficients β.

    I Specifically, π(βj | λj , τ) ∼ N(0, λ2j τ2) and λj , τ ∼ C+(0,1)for j = 1 . . . p.

    I The latent variable formulation enables simple conditionalGibb’s steps for posterior computation.

    I However, we need to employ some post processingscheme to select important variables due to the continuousnature of the prior.

  • Post processing posterior summaries of β

    I Let β̄ denote the posterior mean of β.I The posterior predictive loss || X β̄ − Xβ ||22 can be

    minimized subject to a l1 constraint to select constraint.I Decision theoretic justification of such a post processing

    step can be found in Hahn & Carvalho (2015).I We minimize the following objective function to select

    important variables:

    βP =β|| X β̄ − Xβ ||22 +p∑

    j=1

    λj | βj | . (3)

    I Following Chakraborty et. al. (2016) we use λj = 1/β̄j2.

  • Results

    I To compare our results we choose the Lasso penalty(Tibshirani ,1995) with a logistic link function from the Rpackage ncvreg/glmnet.

    I We report the selected model size and misclassificationrate for both methods.

    I For the Horseshoe probit model the misclassification ratewas 5% and selected model size was 5: the selectedgenes are GSKA-3 and ERBB3, and proteins arePI3KP110ALPHA, MYOSINIIA_pS1943, and ANNEXIN1.

    I For the LASSO logistic model they were 5% and 32respectively.

    I The variables selected by the Horseshoe probit modelwere also selected by the LASSO logistic model.

  • The selected variables

    I The effect of the GSKA-3 gene as one of theHypoxia-inducible factors resulting in solid cancer wasestablished in Gort et. al. (2008).

    I The gene ERBB3 has been seen to be have direct impacton lung cancer. See Sitharaman & Anderson (2008).

    I A recent work by Elkabets et. al. (2013) studies in detailthe effect of the PI3KP110ALPHA gene on breast cancerand lung cancer.

    I Wong et. al. (2012) developed treatments for thesuppression of the lung cancer cell tumor markers relatedto ANNEXIN1 gene.

  • Survival Analysis for pan-cancer data

    I One of the fundamental interest is to establish therelationship of survival of the patients on different proteins

    I Data from different kind of tumorsI As sample size may not be very large in each group so

    Bayesian hierarchical model will be useful to borrowstrength

    I Furthermore, to deal with high dimensionality, we requireeither penalized approach (frequentist statistics) orshrinkage based approach (Bayesian statistics)

    I We apply Bayesian techniques with f Horseshoe priors

  • Kidney Tumors

    I We consider the Kidney tumors: Kidney Chromophobe(KICH) 63 samples, Kidney renal clear cell carcinoma(KIRC) 150 samples, Kidney renal papillary cell carcinoma(KIRP) 124 samples

    I All the tumors have 189 proteins

  • AFT model

    I Typically, survival outcomes have two variables: time (t)and censored indicator (δ) which allows the data torepresent either censored or not

    I i : Patient, j : Protein, k : Tumor typeI We fit an Accelerated Failure rate (AFT) model:

    log(tik ) =p∑

    j=1

    xijkβjk + �ik

    where where tik is the survival time of i-th patient who hasthe k -th cancer, other symbols have their usual meaning,and �i are iid N(0, σ2).

    I For Bayesian MCMC we impute the censored data wi{wik = log tik if tik is event timewik > log tik if tik is right censored

  • Shrinkage Prior

    We can carry out a regular shrinkage analysis as in linearmodels. Here we adopt global local Horseshoe prior and fitindividual regression model for each kind of tumor

    log tik |βjk , σ2 ∼ N

    p∑j=1

    xijkβjk , σ2k

    βjk |λjk , τ, σ2 ∼ N(0, λ2jkτ2σ2k )

    λjk ∼ C+(0,1)τ ∼ C+(0,1)

    π(σ2k ) ∼1σ2k

  • Estimation of Protein Effects

  • Pan Cancer Model

    I Previous plot shows that due to shrinkage power ofHorseshoe prior and due to absence of sufficient numberof samples in each Kidney tumor group almost all of theproteins turn out to be insignificant in explaining thesurvival curve

    I So we decide to fit a pan cancer modelI While fitting this model we make use of the idea of

    borrowing strength across cancers by allowing the priordistributions of the parameters accordingly.

  • Pan Cancer Model

    log tik |βjk , σ2 ∼ N

    p∑j=1

    xijkβjk , σ2

    βjk |λjk , τ, σ2 ∼ N(bj , λ2jkτ2σ2)

    λjk |τ, σ2 ∼ C+(0,1)τ |σ2 ∼ C+(0,1)I(0,1)

    π(σ2) ∼ 1σ2

    bj ∼ N(0, σ2b), j = 1, . . . ,p

    Then,

    Corr(βjk , βjk ′) =σ2b

    (λ2jkτ2σ2 + σ2b)

    12 (λ2jk ′τ

    2σ2 + σ2b)12

  • Estimation

  • Result

    I Some of the proteins which are significant for KIRC tumorgroup are BAK, CRAF_pS338, GAB2, HER3_pY1298,PCADHERIN, PCNA, RAD51, FOXO3A_pS318S321, SF2,DIRAS3

    I The effects of these proteins have been discussed in theliterature e.g. see Adams et al (2012) GAB2 - AScaffolding Protein in Cancer, Molecular Cancer Research

    I The significant proteins for other tumor groups could befound in the similar fashion

    1_Lewin_HasseltIBS_April2017MotivationModelCase Study: mQTL discoverySummary, Acknowledgements

    2_Brendan_IBS_20173_Papathomas_Hasselt_2017MotivationBayesian clustering with the Dirichlet processLinear graphical model determinationRelevant workAdditional remarks - References

    4_Mallick