Biostatistics Case Studies 2014

  • View

  • Download

Embed Size (px)


Biostatistics Case Studies 2014. Session 4 : Regression Models and Multivariate Analyses. Youngju Pak, PhD. Biostatistician What and Why?. Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. - PowerPoint PPT Presentation


REI Summer Fellowship Biostatistics

Biostatistics Case Studies 2014Youngju Pak, PhD.Biostatisticianypak@labiomed.orgSession 4: Regression Models and Multivariate Analyses110/7/2014Biostatistics Case Study: Session 4What and Why?Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once.Compared with univariate or bivariate

Data richness with computational technologies advanced Data reductions or classificationseg., Factor analysis, Principal Component Analysis(PCA)

Several variables are potentially correlated with some degree potential confounding bias the resulteg., Analysis of Covariance (ANCOVA), Multiple Linear or Generalized Linear Regression Models

What and Why ? Many variables are all interrelated with multiple dependent and independent variableseg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models. This Session will focus on multiple regression models. Why regression models?To reduce Random Noise in Data => better variance estimations by adding source of variability of your dependent variableseg. ANCOVATo determine a optimal set of predictors => predictive modelseg. Variable selection procedures for multiple regression modelsTo adjust for potential confounding effectseg, regression models with covariates

Actual mathematical ModelsANOVAYij=+i+ij,,whereYijrepresents thejthobservation (j=1,2,,n) on theithtreatment (i=1,2,,llevels). The errorsijare assumed to be normally and independently (NID) distributed, with mean zero and variance2.

ANCOVA with k number of covariates Yij=+i+X1ij + X2ij + + Xkij + ij,MANOVA (with p number of outcome variables) Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)

Actual mathematical ModelsSimple Linear Regression Models (SLR)Yi = 0 + 1 Xi + i Y (true mean value of Y)

=error (random noise due to random sampling error), assumed follow a normal distribution with mean=0, variance=2

0 & 1 = intercept & slope often called Regression (or beta) Coefficients

Y=Dependent Variable(DV) X=Independent Variable (IV)eg., Y= Insulin Sensitivity X= FattyAcid in percentage

Multiple Linear Regression Models (MLR)Simple Logistic Models(SL)Multiple Logistic Models(ML)

SLR: Example SPSS outputTwo-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is not zero Fatty Acid(%) is significantly related to insulin sensitivity . Mean Insulin sensitivity increase by 37.208 unit as Fatty Acid(%) increase by one percent.

10/7/2014Biostatistics Case Study: Session 4SLR w/CI

10/7/2014Biostatistics Case Study: Session 4Checking the assumptions using a residual Plot

A plot has to be looked as RANDOM no special pattern is supposed to be shown if the assumptions are met.10/7/2014Biostatistics Case Study: Session 4Actual mathematical ModelsMultiple Linear Regression Models (SLR)Y = 0+ 1X1 + 2 X2 + + k Xk +

Y (true mean value of Y)

Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called Multicollinearity, which will make model very unstable.Diagnosis for multicollinearity Variance Inflation Factor (VIF) = 1 OK VIF < 5 Tolerable VIF > 5 Problematic Remove the variable which has a high VIF or do PCA

Multiple Linear Regression Models (MLR)Simple Logistic Models(SL)Multiple Logistic Models(ML)

MRL: ExamplemY = -56.935 + 1.634X1 + 0.249X2

111.634*FlexibilityFor every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength.

0.249*StrengthFor every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.10/7/2014Biostatistics Case Study: Session 4What do mean by adjusted for?If categorical covariates? eg.,

Mean % gain w/o adjustment for GenderExercise & Diet: (20%x10+10%x40) / 50 = 12 %Exercise only: (15%x40 + 5%x10) / 50 = 13 % Mean % gain with adjustment for GenderExercise & Diet: Male avg. x 0.5 + Female avg. x 0.5 = 20% x 0.5 + 10% x 0.5=15 %Exercise only: Male avg. x 0.5 + Female avg. x 0.5 = 15% x 0.5 + 5% x 0.5=10%

Mean muscle gain % (n)Exercise & DietExercise onlyMale20% (10) 15% (40)Female10% (40)5 % (10)Why different? % gain for males are 10% higher than female in both diet potential confounding However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group dilute the treatment effectIf continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.

Graphical illustration : Adjusting for a continuous covariate

* Changes in Adiponectin (a glucose regulating protein) b/w two groups Multiple Logistic Regression ModelsThe model:Logit()= 0 + 1X1 + 2X2 + +kXkwhere =Prob (event =1), Logit()= ln[ /(1- )]or = e LP / (1+ e LP ), where Lp= 0 + 1X1 + 2X2 + +kXkInterpretation of the coefficients in logistic regression modelsFor a continuous predictor, a coefficient (e ) represents the multiplicative increase in the mean odds of Y=1 for one unit change in X odds ratio for X+1 to X.

Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0).

Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for.1610/7/2014Biostatistics Case Study: Session 4Estimated Prob. Vs. Age

1710/7/2014Biostatistics Case Study: Session 4Other Models

Ordinal Logistic Regression for ordinal responses such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups.Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead.Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design.Cox proportional Hazard models: multivariate models for survival data.

General Linear Modelvs. Generalized Linear Model(GLM)A Linear Model General Linear Modeleg., ANOVA, ANCOVA, MANOVA, MANCOVA, Linear regression, mixed modelA Non Linear Model Generalized Linear Model Eg., Logistic, Ordinary Logistic, Possion All these used a link function for a response variable (Y) such as a logit link or possion link. GEE(Generalized Estimating Equation) models are an extension of GLM.Variable Selection Procedures

ForwardBy adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha levelBackwardStart a full model with all predictors and eliminate the predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha levelStepwiseCombination of Forward and Backward Level of stay : 0.01, Level of entry: 0.05 usually usedSimulation studies show Backward is most recommendable based on many simulation studies.

Bariatric SurgeryRoux-en-Y gastric bypass, Sleeve gastrectomy, Gastric banding, Biliopancreatic diversion.

Table 1Figure 1Appendix ?

Factors Associated with Achieving The Primary End Points at 3 Years2710/7/2014Biostatistics Case Study: Session 4