34
Interpretable Linear Modeling Jeffrey M. Girard Carnegie Mellon University

Interpretable Linear Modeling - Piazza

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

InterpretableLinear Modeling

Jeffrey M. GirardCarnegie Mellon University

LM GLM MLM

11/19/2019 Interpretable Linear Modeling 2

Overview

Linear Modeling

Generalized Linear Modeling

Multilevel Modeling

Linear Modeling

โ€ข Linear models (LM) are statistical prediction models with high interpretabilityโ€ข They are best combined with a small number of interpretable featuresโ€ข They allow you to quantify the size and direction of each feature's effectโ€ข They allow you to isolate the unique effect of each feature

โ€ข LM incorporates regression, ANOVA, ๐‘ก๐‘ก-tests, and ๐น๐น-testsโ€ข LM allows for multiple ๐‘‹๐‘‹ (predictor or independent) variablesโ€ข LM allows for multiple ๐‘Œ๐‘Œ (explained or dependent) variables

โ€ข LM assumes the Y variables are normally distributed (to avoid โ†’ GLM)โ€ข LM assumes the data points are independent (to avoid โ†’ MLM)

11/19/2019 Interpretable Linear Modeling 4

What is linear modeling?

โ€ข Linear regression is often presented with the following notation:

๐‘ฆ๐‘ฆ๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ1๐‘ฅ๐‘ฅ1๐‘–๐‘– + โ‹ฏ+ ๐›ฝ๐›ฝ๐‘๐‘๐‘ฅ๐‘ฅ๐‘๐‘๐‘–๐‘– + ๐œ–๐œ–๐‘–๐‘–

โ€ข ๐‘ฆ๐‘ฆ is a vector of observations on the explained variableโ€ข ๐‘ฅ๐‘ฅ is a vector of observations on the predictor variableโ€ข ๐›ผ๐›ผ is the interceptโ€ข ๐›ฝ๐›ฝ are the slopesโ€ข ๐œ–๐œ– is a vector of normally distributed and independent residuals

11/19/2019 Interpretable Linear Modeling 5

Simple linear regression

โ€ข Instead, we will use the following (equivalent) notationโ€ข This notation will make it easier to generalize LM

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal ๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ1๐‘ฅ๐‘ฅ1๐‘–๐‘– + โ‹ฏ+ ๐›ฝ๐›ฝ๐‘๐‘๐‘ฅ๐‘ฅ๐‘๐‘๐‘–๐‘–

โ€ข This can be read as ๐‘ฆ๐‘ฆ is normally distributed around ๐œ‡๐œ‡๐‘–๐‘– with SD ๐œŽ๐œŽโ€ข ๐œ‡๐œ‡๐‘–๐‘– relates ๐‘ฆ๐‘ฆ๐‘–๐‘– to ๐‘ฅ๐‘ฅ๐‘–๐‘– and defines the "best fit" prediction lineโ€ข ๐œŽ๐œŽ captures the residuals or errors when ๐‘ฆ๐‘ฆ๐‘–๐‘– โ‰  ๐œ‡๐œ‡๐‘–๐‘–

11/19/2019 Interpretable Linear Modeling 6

Simple linear regression

Likelihood

Linear model

โ€ข The goal is to find the parameter values that minimize the residualsโ€ข In linear modeling, this means estimating ๐›ผ๐›ผ, ๐›ฝ๐›ฝ, and ๐œŽ๐œŽ

โ€ข This can be accomplished in several different waysโ€ข Maximum likelihood (i.e., minimize the residual sum of squares)โ€ข Bayesian approximation (e.g., maximum a posteriori or MCMC)

โ€ข I will provide code for use in R's lme4 and in Python's statsmodelsโ€ข These both use maximum likelihood estimation by defaultโ€ข I also highly recommend MCMC (especially for MLM)

11/19/2019 Interpretable Linear Modeling 7

Estimating the regression coefficients

Simulated Data of 150 YouTube Book Review Videos

11/19/2019 Interpretable Linear Modeling 8

Simulated LM example dataset

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ

review_score ~ 1

11/19/2019 Interpretable Linear Modeling 9

Example of LM with intercept-only

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 6.51 [6.20, 6.83]๐œŽ๐œŽ Residual SD 1.94

Deviance = 563.16, Adjusted ๐‘…๐‘…2 = 0.000

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–

review_score ~ 1 + is_critic

11/19/2019 Interpretable Linear Modeling 10

Example of LM with binary predictor

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 6.85 [6.46, 7.24]๐›ฝ๐›ฝ is_critic โˆ’0.87 [โˆ’1.50,โˆ’0.24]๐œŽ๐œŽ Residual SD 1.90

Deviance = 536.15, Adjusted ๐‘…๐‘…2 = 0.042

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–

review_score ~ 1 + smile_rate

11/19/2019 Interpretable Linear Modeling 11

Example of LM with continuous predictor

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 5.40 [4.90, 5.90]๐›ฝ๐›ฝ smile_rate 1.15 [0.72, 1.57]๐œŽ๐œŽ Residual SD 1.78

Deviance = 471.43, Adjusted ๐‘…๐‘…2 = 0.157

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ1๐‘ฅ๐‘ฅ1๐‘–๐‘– + ๐›ฝ๐›ฝ2๐‘ฅ๐‘ฅ2๐‘–๐‘–

review_score ~ 1 + smile_rate + is_critic

11/19/2019 Interpretable Linear Modeling 12

Example of LM with two predictors

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 5.71 [5.13, 6.29]๐›ฝ๐›ฝ1 smile_rate 1.07 [0.65, 1.49]๐›ฝ๐›ฝ2 is_critic โˆ’0.61 [โˆ’1.20,โˆ’0.01]๐œŽ๐œŽ Residual SD 1.77

Deviance = 458.68, Adjusted ๐‘…๐‘…2 = 0.174

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ1๐‘ฅ๐‘ฅ1๐‘–๐‘– + ๐›ฝ๐›ฝ2๐‘ฅ๐‘ฅ2๐‘–๐‘– + ๐›ฝ๐›ฝ3๐‘ฅ๐‘ฅ1๐‘–๐‘–๐‘ฅ๐‘ฅ2๐‘–๐‘–

review_score ~ 1 + smile_rate * is_critic

11/19/2019 Interpretable Linear Modeling 13

Example of LM with interaction effect

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 5.96 [5.31, 6.61]๐›ฝ๐›ฝ1 smile_rate 0.83 [0.33, 1.34]๐›ฝ๐›ฝ2 is_critic โˆ’1.31 [โˆ’2.32,โˆ’0.30]๐›ฝ๐›ฝ3 Interaction 0.79 [โˆ’0.13, 1.70]๐œŽ๐œŽ Residual SD 1.76

Deviance = 449.88, Adjusted ๐‘…๐‘…2 = 0.185

Generalized Linear Modeling

โ€ข Generalized linear modeling (GLM) is an extension of LMโ€ข It allows LM to accommodate non-normally distributed ๐‘Œ๐‘Œ variablesโ€ข Examples include variables that are: binary, discrete, and bounded

โ€ข This is accomplished using GLM families and link functionsโ€ข Families model the likelihood function using a specific distributionโ€ข Link functions connect the linear model to the mean of the likelihoodโ€ข Link function also ensure that the mean is the "right" kind of number

11/19/2019 Interpretable Linear Modeling 15

What is generalized linear modeling?

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Family ๐œ‡๐œ‡๐‘–๐‘– , โ€ฆ

link(๐œ‡๐œ‡๐‘–๐‘–) = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ1๐‘ฅ๐‘ฅ1๐‘–๐‘– + โ‹ฏ+ ๐›ฝ๐›ฝ๐‘๐‘๐‘ฅ๐‘ฅ๐‘๐‘๐‘–๐‘–

11/19/2019 Interpretable Linear Modeling 16

Common GLM families and link functions

Family Uses Supports Link Function

Normal Linear data โ„: (โˆ’โˆž,โˆž) Identity ๐œ‡๐œ‡

Gamma Non-negative data(reaction times) โ„: (0,โˆž) Inverse

or Power ๐œ‡๐œ‡โˆ’1

Poisson Count data โ„ค: 0, 1, 2, โ€ฆ Log log(๐œ‡๐œ‡)

Binomial Binary dataCategorical data

โ„ค: 0, 1โ„ค: [0,๐พ๐พ) Logit log

๐œ‡๐œ‡1 โˆ’ ๐œ‡๐œ‡

11/19/2019 Interpretable Linear Modeling 17

Simulated GLM example dataset

Simulated Data of 150 Romantic Couples' Interactions

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–

divorced ~ 1 + satisfaction

11/19/2019 Interpretable Linear Modeling 18

Example of LM for binary data

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 0.98 [0.89, 1.07]๐›ฝ๐›ฝ satisfaction โˆ’0.22 [โˆ’0.25,โˆ’0.19]๐œŽ๐œŽ Residual SD 0.304

Deviance = 13.70, Adjusted ๐‘…๐‘…2 = 0.617

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Binomial(1,๐‘๐‘๐‘–๐‘–)

logit(๐‘๐‘๐‘–๐‘–) = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–divorced ~ 1 + satisfaction

family = binomial(link = "logit")

11/19/2019 Interpretable Linear Modeling 19

Example of GLM for binary data

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 3.52 [2.45, 4.87]๐›ฝ๐›ฝ satisfaction โˆ’1.78 [โˆ’2.40,โˆ’1.30]

Deviance = 82.04, Pseudo ๐‘…๐‘…2 = 0.664

Note: Results are in transformed (i.e., logit) units

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–

interruptions ~ 1 + satisfaction

11/19/2019 Interpretable Linear Modeling 20

Example of LM for count data

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 17.11 [15.76, 18.45]๐›ฝ๐›ฝ satisfaction โˆ’3.95 [โˆ’4.38,โˆ’3.52]๐œŽ๐œŽ Residual SD 4.61

Deviance = 3143.62, Adjusted ๐‘…๐‘…2 = 0.688

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Poisson(๐œ†๐œ†๐‘–๐‘–)

log(๐œ†๐œ†๐‘–๐‘–) = ๐›ผ๐›ผ + ๐›ฝ๐›ฝ๐‘ฅ๐‘ฅ๐‘–๐‘–interruptions ~ 1 + satisfactionfamily = poisson(link = "log")

11/19/2019 Interpretable Linear Modeling 21

Example of GLM for count data

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 3.16 [3.08, 3.24]๐›ฝ๐›ฝ satisfaction โˆ’0.77 [โˆ’0.83,โˆ’0.72]

Deviance = 325.24, Pseudo ๐‘…๐‘…2 = 0.895

Note: Results are in transformed (i.e., log) units

Multilevel Modeling

โ€ข MLM allows (G)LM to handle data points that are dependent/clusteredโ€ข This is problematic because data within a cluster are likely to be similarโ€ข Effects (e.g., intercepts and slopes) may also differ between clusters

11/19/2019 Interpretable Linear Modeling 23

What is multilevel modeling?

Person 1

Obs. 1

X

Y

Obs. 2

X

Y

Obs. 3

X

Y

Person 2

Obs. 1

X

Y

Obs. 2

X

Y

Obs. 3

X

Y

Person 3

Obs. 1

X

Y

Obs. 2

X

Y

Obs. 3

X

Y

โ€ข One approach would be to estimate effects across all clusters (complete pooling)โ€ข But this would require all clusters to be nearly identical, which they may not be

โ€ข Another would be to estimate effects for each cluster separately (no pooling)โ€ข But this would ignore similarities between clusters and may overfit the data

โ€ข The MLM approach tries to have the best of both approaches (partial pooling)โ€ข MLM tries to learn the similarities and the differences between clustersโ€ข It estimates effects for each individual cluster separatelyโ€ข But these effects are assumed to be drawn from a population of clustersโ€ข MLM explicitly models this population and the cluster-specific variation

11/19/2019 Interpretable Linear Modeling 24

What is multilevel modeling?

โ€ข MLM comes with several important benefits over single-level modelsโ€ข Improved estimates for repeat samplingโ€ข Improved estimates for imbalanced samplingโ€ข Estimates of variation across clustersโ€ข Avoid averaging and retain variationโ€ข Better accuracy and inference!

โ€ข Many have argued that "multilevel regression deserves to be the default approach"

11/19/2019 Interpretable Linear Modeling 25

What is multilevel modeling?

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal ๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผCLUSTER[๐‘–๐‘–]

๐›ผ๐›ผCLUSTER ~ Normal(๐›ผ๐›ผ,๐œŽ๐œŽCLUSTER)

โ€ข ๐›ผ๐›ผ captures the overall intercept (i.e., the mean of all clusters' intercepts)โ€ข ๐›ผ๐›ผCLUSTER ๐‘–๐‘– captures each cluster's deviation from the overall interceptโ€ข ๐œŽ๐œŽCLUSTER captures how much variability there is across clusters' interceptsโ€ข We now have an intercept for every cluster and a "population" of interceptsโ€ข The model learns about each cluster's intercept from the population of intercepts

โ€ข This pooling within parameters is very helpful for imbalanced sampling

11/19/2019 Interpretable Linear Modeling 26

Varying intercepts

11/19/2019 Interpretable Linear Modeling 27

Actual MLM example dataset

Real Data of 751 Smiles from 136 Subjects (Girard, Shandar, et al. 2019)

11/19/2019 Interpretable Linear Modeling 28

Complete pooling (underfitting)

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ

smile_int ~ 1

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 2.14 [2.10, 2.19]๐œŽ๐œŽ Residual SD 0.66

Deviance = 1495.8

11/19/2019 Interpretable Linear Modeling 29

No pooling (overfitting)

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผPERSON[๐‘–๐‘–]

smile_int ~ 1 + factor(subject)

Variable Estimate 95% CI

๐›ผ๐›ผ 1 Intercept for P1 2.14 [1.93, 2.89]๐›ผ๐›ผ 2 Intercept for P2 2.17 [1.49, 2.85]โ€ฆ โ€ฆ โ€ฆ โ€ฆ

๐œŽ๐œŽ Residual SD 0.60

Deviance = 1210.10

11/19/2019 Interpretable Linear Modeling 30

Partial pooling (ideal fit)

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal(๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ)

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ ๐‘–๐‘–

๐›ผ๐›ผ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ ~ Normal ๐›ผ๐›ผ,๐œŽ๐œŽ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ๐‘ƒ

smile_int ~ 1 + (1 | subject)

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 2.15 [2.09, 2.21]๐œŽ๐œŽ๐‘ƒ๐‘ƒ Intercept SD 0.27๐œŽ๐œŽ Residual SD 0.60

Deviance = 1458.81

11/19/2019 Interpretable Linear Modeling 31

Shrinkage in action

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal ๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผCLUSTER ๐‘–๐‘– + ๐›ฝ๐›ฝCLUSTER ๐‘–๐‘– ๐‘ฅ๐‘ฅ๐‘–๐‘–๐›ผ๐›ผCLUSTER๐›ฝ๐›ฝCLUSTER ~ MVNormal

๐›ผ๐›ผ๐›ฝ๐›ฝ, ๐’๐’

๐’๐’ =๐œŽ๐œŽ๐›ผ๐›ผ 00 ๐œŽ๐œŽ๐›ฝ๐›ฝ

๐‘๐‘๐œŽ๐œŽ๐›ผ๐›ผ 00 ๐œŽ๐œŽ๐›ฝ๐›ฝ

โ€ข We can pool and learn from the covariance between varying intercepts and slopesโ€ข We do this with a 2D normal distribution with means ๐›ผ๐›ผ and ๐›ฝ๐›ฝ and covariance matrix ๐’๐’โ€ข We build ๐’๐’ through matrix multiplication of the variances and correlation matrix ๐‘๐‘โ€ข This is more complex but results in even better pooling: within and across parameters

11/19/2019 Interpretable Linear Modeling 32

Varying intercepts and slopes

11/19/2019 Interpretable Linear Modeling 33

MLM with varying intercepts and slopes

๐‘ฆ๐‘ฆ๐‘–๐‘– ~ Normal ๐œ‡๐œ‡๐‘–๐‘– ,๐œŽ๐œŽ

๐œ‡๐œ‡๐‘–๐‘– = ๐›ผ๐›ผPERSON ๐‘–๐‘– + ๐›ฝ๐›ฝPERSON ๐‘–๐‘– ๐‘ฅ๐‘ฅ๐‘–๐‘–๐›ผ๐›ผPERSON๐›ฝ๐›ฝPERSON ~ MVNormal

๐›ผ๐›ผ๐›ฝ๐›ฝ, ๐’๐’

๐’๐’ =๐œŽ๐œŽ๐›ผ๐›ผ 00 ๐œŽ๐œŽ๐›ฝ๐›ฝ

๐‘๐‘๐œŽ๐œŽ๐›ผ๐›ผ 00 ๐œŽ๐œŽ๐›ฝ๐›ฝ

smile_int ~ 1 + amused_sr + (1 + amused_sr| subject)

Deviance = 1458.81

Variable Estimate 95% CI

๐›ผ๐›ผ (Intercept) 1.98 [1.91, 2.06]๐›ฝ๐›ฝ amused_sr 0.12 [0.09, 0.15]๐œŽ๐œŽ๐›ผ๐›ผ Intercept SD 0.24 [0.17, 0.32]๐œŽ๐œŽ๐›ฝ๐›ฝ Slope SD 0.04 [0.00, 0.08]

๐‘๐‘ corr ๐œŽ๐œŽ๐›ผ๐›ผ ,๐œŽ๐œŽ๐›ฝ๐›ฝ 0.30 [โˆ’0.69, 0.96]

๐œŽ๐œŽ Residual SD 0.57 [0.54, 0.60]

11/19/2019 Interpretable Linear Modeling 34

Depiction of varying intercepts and slopes