44
Advanced statistical methods II

Lecture 2

Embed Size (px)

Citation preview

Page 1: Lecture 2

Advanced statistical methods II

Page 2: Lecture 2

Learning objectives

• Implement an analytic strategy using mediation

• Implement an analytic strategy using path analysis

• Appreciate the role of structural equation modelling

Page 3: Lecture 2

stata usage

Menu or command driven

• Use the menus to find out how to write a command, then save as a program

Use a program in a ‘do’ file for analysis and save it

• Replication

Page 4: Lecture 2

How to use do file

– Click “do-file editor” on the toolbar

– Type your command and click “run” button

– Click “file”-> “save as” to save your do-file You can choose to save it under your account (eg. “stata01”)

– To see what files are under your account• ls– Use “do-file editor” -> “open” to open a

saved do-file

Page 5: Lecture 2

Recap - stata

Key commands (use class examples as templates)• browse

– to look at your data• tabu edup ov7 if sex==1

– to cross tab your data with a selection• bysort sex:summarize bw

– to get means for different groups • xi:regress bmi7z i.edup bw i.sex

– to do regression• xi:logistic ov7 i.edup bw i.sex

– to do logistic regression

Page 6: Lecture 2

Interpreting stata output (1)

• browse, tabu, summarize – self-explanatory

• regress shows you– number of obs– R-squared– For each exposure

• Coef. May think of it as beta (β)• Std. err.• P>|t| - may think of it as p-value• 95% CI

Page 7: Lecture 2

Interpreting stata output (2)

• logistic shows you– number of obs– LR chi2(df)– For each exposure

• Odds ratio• Std. err.• P>|z| - may think of it as p-value• 95% CI

Page 8: Lecture 2

Stata – things to be aware of

Stata does not include in the analysis observations with missing values (.)

In regression (any sort) stata always uses the lowest group as the reference group for a categorical value

There is a fantastic website with annotated stata analysis

http://www.ats.ucla.edu/stat/stata

Page 9: Lecture 2

New statistics tools

• Mediation– examining the ‘active’ ingredient

• Path analysis– examining several ‘active’ ingredients

• Structural equation modeling– examining several ‘active’ ingredients

including unmeasured concepts

Page 10: Lecture 2

Mediation

• The ‘active ingredient’ of an exposure• The mechanism by which an exposure works• Increases biological plausibility of theory• Reduces likelihood exposure disease relationship is

caused by confounder

E D

M

Direct effect: (not through hypothesized

mediator, E-D)

Indirect effect: (through

hypothesized mediator, E-M-D)

Page 11: Lecture 2

Mediation as causal explanation• Crude OR for BMI and breast cancer = 2.0• Adjusted for estradiol, aOR = 1.0• Full mediation – all of the association goes through

this pathway• If adjusted was 1.5 – partial mediation

(another pathway exists)• Identifies more proximal cause of disease for

potential intervention/prevention

BMI Breast cancer

Estradiol

Page 12: Lecture 2

Baron & Kenny 4-step approach1. Exposure (BMI) should be associated with outcome (BrCa) 2. Exposure (BMI) should be associated with mediator

(Estradiol) 3. Mediator (Estradiol) should be associated with outcome

(BrCa) 4. Association of exposure (BMI) with outcome (BrCa) should

be reduced by adjusting for mediator (Estradiol)

BMIBreast cancer

Estradiol

OR > 1?

OR > 1?

Crude OR > Adjusted OR

2 3

1,4

Page 13: Lecture 2

Sobel and Goodman tests

Sobel and Goodman tests• Null hypothesis indirect effect is 0• Test statistic (normally distributed)

α * β

σαβ

– σαβ approximated by

Sqrt(α2 σβ2+ β2 σα

2) Sobel

Sqrt(α2 σβ2+ β2 σα

2- σα2 σβ

2 )Goodman

E D

M

Direct effect: (not through hypothesized mediator, E-D)

Indirect effect: (through

hypothesized mediator, E-M-D)

α β

Page 14: Lecture 2

Example of mediation

Question:Does childhood growth mediate the association

between infant growth and adolescent systolic blood pressure?

In the dataset you have• Systolic blood pressure - bpsys • Infant growth as change in weight z-score from

birth to 3 months - w0to3mz• Height z-score at ~7 years - height7z

Page 15: Lecture 2

Theoretical model

Infant growth

Systolic blood pressure

Childhood growth

c

a b

Page 16: Lecture 2

Read data

• use /home/asm2/s2/mediation,clear

See what kind of variables you got

• describe

Page 17: Lecture 2

Install “sgmediation” package• findit sgmediation

Page 18: Lecture 2

Install “sgmediation” package

Page 19: Lecture 2

Sobel-Goodman mediation tests

• sgmediation bpsys, mv(height7z) iv(w0to3mz)

http://www.ats.ucla.edu/stat/stata/faq/sgmediation.htm

_cons 102.1715 .202861 503.65 0.000 101.7738 102.5693 w0to3mz .6655623 .2268029 2.93 0.003 .2208664 1.110258 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 377212.343 3158 119.446594 Root MSE = 10.916 Adj R-squared = 0.0024 Residual 376186.198 3157 119.159391 R-squared = 0.0027 Model 1026.14489 1 1026.14489 Prob > F = 0.0034 F( 1, 3157) = 8.61 Source SS df MS Number of obs = 3159

Model with dv regressed on iv (path c)

. sgmediation bpsys, mv(height7z) iv(w0to3mz)

Page 20: Lecture 2

Direct Association

Infant growth

Systolic blood pressure

Childhood growth

β =0.67

Page 21: Lecture 2

Sobel-Goodman mediation tests

_cons -.1604 .0168525 -9.52 0.000 -.1934429 -.1273571 w0to3mz .1766453 .0188414 9.38 0.000 .1397027 .213588 height7z Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 2668.45379 3158 .8449822 Root MSE = .90684 Adj R-squared = 0.0268 Residual 2596.17089 3157 .822353782 R-squared = 0.0271 Model 72.2828983 1 72.2828983 Prob > F = 0.0000 F( 1, 3157) = 87.90 Source SS df MS Number of obs = 3159

Model with mediator regressed on iv (path a)

Page 22: Lecture 2

Association of exposure with mediator

Infant growth

Systolic blood pressure

Childhood growth

cβ =0.67

aβ =0.17

Page 23: Lecture 2

Sobel-Goodman mediation tests

_cons 102.7592 .1960212 524.23 0.000 102.3749 103.1435 w0to3mz .0184025 .2190649 0.08 0.933 -.4111216 .4479266 height7z 3.663611 .2041073 17.95 0.000 3.263415 4.063808 bpsys Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 377212.343 3158 119.446594 Root MSE = 10.4 Adj R-squared = 0.0945 Residual 341340.27 3156 108.155979 R-squared = 0.0951 Model 35872.0721 2 17936.036 Prob > F = 0.0000 F( 2, 3156) = 165.83 Source SS df MS Number of obs = 3159

Model with dv regressed on mediator and iv (paths b and c')

Page 24: Lecture 2

Association of mediator with outcome

Infant growth

Systolic blood pressure

Childhood growth

cInitial β =0.67

With mediator β =0.02

aβ =0.17

bβ =3.66

Page 25: Lecture 2

Sobel-Goodman mediation tests

Ratio of indirect to direct effect: 35.166928Proportion of total effect that is mediated: .97235043

Total effect = .66556235 Direct effect = .01840251Indirect effect = .64715983

Goodman-2 .64715983 .07778151 8.32 0Goodman-1 .64715983 .07797141 8.3 0Sobel .64715983 .07787652 8.31 0 Coef Std Err Z P>|Z|

Sobel-Goodman Mediation Tests

Page 26: Lecture 2

Issues• Other mediators may exist on ‘direct’ pathway, just not

mediators we are evaluating• Need to consider confounding when estimating direct or

indirect effect• Not all the conditions may be necessary• May need other methods to estimate Sobel test statistics for

other than linear variables

BMIBreast cancer

Estradiol

C

Page 27: Lecture 2

Why path analysis?

E1

E2

E4

E3

D

Multiple regression model Path analysis

E1

E2

E4

E3

D

Do not know how the exposures relate to each other

Any mediation here?

Page 28: Lecture 2

How to do path analysis

1. Draw out ‘a priori’ path diagram

2. Compute the co-efficients for each path– based on multiple regression – Use stata “pathreg”

3. Draw the final path diagram with only the paths with significant co-efficients

Page 29: Lecture 2

Path analysis question

• How does late infant growth affect systolic blood pressure in adolescence and does bmi z-score at 7 years play a role?

Page 30: Lecture 2

‘a priori’ path diagram

BMI z-score at 7 years

Height z-score change from 3 to 9 months

Weight z-score change from 3 to 9 months

Blood pressure at

11 years

Note – no ‘causal’ loops

Page 31: Lecture 2

Path analysis

• findit pathreg

• corr (w3to9mz h3to9mz)

h3to9mz 0.2782 1.0000 w3to9mz 1.0000 w3to9mz h3to9mz

(obs=3159). corr (w3to9mz h3to9mz)

http://www.ats.ucla.edu/stat/stata/faq/pathreg.htm

Page 32: Lecture 2

Path analysis

• pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)

n = 3159 R2 = 0.1198 sqrt(1 - R2) = 0.9382 _cons 101.9994 .1965874 518.85 0.000 . h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bmi7z 3.104725 .1542642 20.13 0.000 .3371197 bpsys Coef. Std. Err. t P>|t| Beta

n = 3159 R2 = 0.0057 sqrt(1 - R2) = 0.9971 _cons .1752513 .0224686 7.80 0.000 . h3to9mz .049892 .0345088 1.45 0.148 .0267163 w3to9mz .1243758 .036085 3.45 0.001 .0636919 bmi7z Coef. Std. Err. t P>|t| Beta

. pathreg (bmi7z w3to9mz h3to9mz)(bpsys bmi7z w3to9mz h3to9mz)

Page 33: Lecture 2

Path analysis

BMI z-score at 7 years

Height z-score change from 3 to 9 months

Weight z-score change from 3 to 9 months

Blood pressure at

11 years

0.06

0.28

0.06

0.34

Standardized co-efficientsThat fraction of the standard deviation of the dependent variable for which the designated factor is directly responsible

Page 34: Lecture 2

How do coefficients in path analysis relate to that in multiple regression?

• regress bpsys w3to9mz h3to9mz bmi7z, beta

_cons 101.9994 .1965874 518.85 0.000 . bmi7z 3.104725 .1542642 20.13 0.000 .3371197 h3to9mz 1.109793 .299163 3.71 0.000 .064528 w3to9mz .0197162 .3133117 0.06 0.950 .0010963 bpsys Coef. Std. Err. t P>|t| Beta

Total 377212.343 3158 119.446594 Root MSE = 10.258 Adj R-squared = 0.1190 Residual 332007.26 3155 105.232095 R-squared = 0.1198 Model 45205.083 3 15068.361 Prob > F = 0.0000 F( 3, 3155) = 143.19 Source SS df MS Number of obs = 3159

. regress bpsys w3to9mz h3to9mz bmi7z,beta

Same beta co-efficients as from the second step of the path analysis

Page 35: Lecture 2

What did path analysis do?

• 2 regressions

• Standardized the coefficients

• We manually calculated the correlation between the two exogeneous variables (infant weight growth and infant height growth)

Page 36: Lecture 2

Pros & Cons of path analysis

• Path analysis may help distinguish plausibility of different hypotheses (could compare models)

• Path analysis may help present complicated relations

• Path diagram does NOT imply causal associations– Cannot establish direction of causality

• Path analysis only works with continuous variables

• Path analysis only uses observed variables

Page 37: Lecture 2

Structural Equation Modeling

Combination of

• path analysis

• factor analysis

E1

E2

E4

E3

D

E1

E2

E4

E3

E7E8

E5

E9

E10

E12

E0

E11

L1

L2

L represents some underlying or latent construct, e.g., growth potential, ability, etc, also called hypothetical, unobserved

E measured, observed or manifest

Page 38: Lecture 2

Structural Equation Modeling

D

E1

E2

E4

E3

E7E8

E5

E9

E10

E12

E0

E11

L1

L2

Also called covariance structure analysis, covariance structure modeling, and analysis of covariance structures.

On-way arrows stand for regression weightsTwo way arrows stand for correlation among the predictors

Page 39: Lecture 2

Uses of SEM

• Confirm a model

• Compare models

• Create a new model

POST HOC ANALYSIS– Needs cross-validation

based on some theoretical model

using external knowledge

}

Page 40: Lecture 2

Key steps in SEM

• Create theoretical model• Common factor analysis to establish the number of latents • Confirmatory factor analysis to confirm the measurement

model. As a further refinement, factor loadings can be constrained to 0 for any measured variable's crossloadings on other latent variables, so every measured variable loads only on its latent.

• Test nested models to get the most parsimonious one. Alternatively, test other research studies' findings or theory by constraining parameters as they suggest should be the case. Consider raising the alpha significance level from .05 to .01 to test for a more significant model.

• Relate back to theory

Page 41: Lecture 2

Reminder on factor analysis

• ASM 1

• Groups measured variables according to the correlations between them

• Enables measured variables to be grouped into distinct latent factors representing the same concept

Page 42: Lecture 2

Test the model

• Absolute fit index, e.g.– RMSEA, how well the model, with unknown but

optimally chosen parameter estimates would fit the populations covariance matrix. Magic numbers <0.07 to 0.10

• Incremental fit indices, e.g.,– CFI this statistic assumes that all latent variables are

uncorrelated (null/independence model) and compares the sample covariance matrix with this null model. Magic numbers >0.9

• Parsimony fit indices, e.g.,– AIC can be used to compare models

Page 43: Lecture 2

Pros and Cons of SEM• SEM may help distinguish plausibility of different hypotheses (if

you compare models)

• SEM may help present complicated relations

• SEM works best with normally distributed continuous variables

• Needs specialised software, e.g., LISREL, AMOS, Mplus

• Model construction can be quite subjective

• Does NOT imply causal associations

• GIGO

Page 44: Lecture 2

Take home messages