22
Dirichlet process Bayesian clustering with the R package PReMiuM Dr Silvia Liverani Brunel University London July 2015 Silvia Liverani (Brunel University London) Prole Regression 1 / 18

Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Dirichlet process Bayesian clusteringwith the R package PReMiuM

Dr Silvia LiveraniBrunel University London

July 2015

Silvia Liverani (Brunel University London) Profile Regression 1 / 18

Page 2: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Outline

I MotivationI MethodI R package PReMiuM

I Examples

Silvia Liverani (Brunel University London) Profile Regression 2 / 18

Page 3: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Many collaborators

I John Molitor (University of Oregon)I Sylvia Richardson (Medical Research Centre Biostatistics Unit)I Michail Papathomas (University of St Andrews)I David HastieI Aurore Lavigne (University of Lille 3, France)I Lucy Leigh (University of Newcastle, Australia)I . . .

Silvia Liverani (Brunel University London) Profile Regression 3 / 18

Page 4: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Motivation

Multicollinearity

I Goal of epidemilogical studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...

I ... but highly correlated risk factors create collinearity problems!

ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and

I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)

Silvia Liverani (Brunel University London) Profile Regression 4 / 18

Page 5: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Motivation

Multicollinearity

I Goal of epidemilogical studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...

I ... but highly correlated risk factors create collinearity problems!

ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and

I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)

Silvia Liverani (Brunel University London) Profile Regression 4 / 18

Page 6: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Motivation

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

86 88 90 92 94 96 98 100

01

23

Weight

BS

A

BP=y , Weight = x1, BSA = x2

Silvia Liverani (Brunel University London) Profile Regression 5 / 18

Page 7: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Motivation

I Highly correlated risk factors create collinearity problems, causinginstability in model estimation

Model β̂1 SE β̂1 β̂2 SE β̂2

y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28

I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.

I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.

Silvia Liverani (Brunel University London) Profile Regression 6 / 18

Page 8: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Motivation

I Highly correlated risk factors create collinearity problems, causinginstability in model estimation

Model β̂1 SE β̂1 β̂2 SE β̂2

y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28

I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.

I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.

Silvia Liverani (Brunel University London) Profile Regression 6 / 18

Page 9: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Issues caused byI correlated risk factorsI interacting risk factors

Profile regressionI partitions the multi-dimensional risk surface into groups

having similar risksI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model

Silvia Liverani (Brunel University London) Profile Regression 7 / 18

Page 10: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Issues caused byI correlated risk factorsI interacting risk factors

Profile regressionI partitions the multi-dimensional risk surface into groups

having similar risksI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model

Silvia Liverani (Brunel University London) Profile Regression 7 / 18

Page 11: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Notation

For individual i

yi outcome of interestxi = (xi1, . . . , xiP) covariate profilewi fixed effectszi = c the allocation variable indicates the

cluster to which individual i belongs

Silvia Liverani (Brunel University London) Profile Regression 8 / 18

Page 12: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I For example for discrete covariates

f (xi |zi = c, φc) =J∏

j=1

φzi ,j,xi,j

I For example, for Bernoulli response

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

Silvia Liverani (Brunel University London) Profile Regression 9 / 18

Page 13: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I For example for discrete covariates

f (xi |zi = c, φc) =J∏

j=1

φzi ,j,xi,j

I For example, for Bernoulli response

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

Silvia Liverani (Brunel University London) Profile Regression 9 / 18

Page 14: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet

Process)P(Zi = c|ψ) = ψc ψ1 = V1

ψc = Vc

∏l<c

(1− Vl) Vc ∼ Beta(1, α)

I larger concentration parameter α the more evenly distributed is theresulting distribution.

I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero

Silvia Liverani (Brunel University London) Profile Regression 10 / 18

Page 15: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

R package PReMiuM

Implementation: R package PReMiuM

We have implemented profile regression in C++ within the R packagePReMiuM.

I Binary, binomial, categorical, Normal, Poisson and survivaloutcome

I Allows for spatial correlationI Fixed effects (global parameters) including also spatial CAR termI Normal and/or discrete covariatesI Dependent or independent slice sampling (Kalli et al., 2011) or

truncated Dirichlet process model (Ishwaran and James, 2001)I Fixed alpha or update alpha, or use the Pitman-Yor process priorI Handles missing data

Silvia Liverani (Brunel University London) Profile Regression 11 / 18

Page 16: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

R package PReMiuM

Implementation: R package PReMiuM

We have implemented profile regression in C++ within the R packagePReMiuM.

I Allows users to run predictive scenariosI Performs post processingI Contains plotting functions

Currently working on:I Quantile profile regressionI Enriched Dirichlet processes

Silvia Liverani (Brunel University London) Profile Regression 12 / 18

Page 17: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Applications and features of the model

Example: Simulated data

The profiles are given byy : outcome, Bernoullix : 5 covariates, all discrete with 3 levelsw : 2 fixed effects, continuous or discrete

Silvia Liverani (Brunel University London) Profile Regression 13 / 18

Page 18: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model
Page 19: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Applications and features of the model

Survival response with censoring: sleep study

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Time

Sur

viva

l

Silvia Liverani (Brunel University London) Profile Regression 15 / 18

Page 20: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Variable selection

Page 21: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Applications and features of the model

Spatial correlated response: deprivation in London

[−39.1,−7.73]

(−7.73,−2.04]

(−2.04,2.47]

(2.47,7.88]

(7.88,31.1]

Silvia Liverani (Brunel University London) Profile Regression 17 / 18

Page 22: Dirichlet process Bayesian clustering with the R package ... · Motivation I Highly correlated risk factors create collinearity problems, causing instability in model estimation Model

Applications and features of the model

References

I Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015)PReMiuM: An R package for Profile Regression Mixture Models using DirichletProcesses. Journal for Statistical Software, 64(7), 1-30.

I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Childrens Health, Biostatistics,11, 484-498.

I Molitor, J., Brown, I. J., Papathomas, M., Molitor, N., Liverani, S., Chan, Q., Richardson, S.,Van Horn, L., Daviglus, M. L., Stamler, J. and Elliott, P. (2014) Blood pressure differencesassociated with DASH-like lower sodium compared with typical American higher sodiumnutrient profile: INTERMAP USA. Hypertension.

I Hastie, D. I., Liverani, S. and Richardson, S. (2015) Sampling from Dirichlet processmixture models with unknown concentration parameter: Mixing issues in large dataimplementations. To appear in Statistics and Computing.

I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.

I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology .

Silvia Liverani (Brunel University London) Profile Regression 18 / 18