26
BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics [email protected]

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics [email protected]

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

BIO503: Lecture 5

Harvard School of Public Health Wintersession 2009

Jess Mar

Department of Biostatistics

[email protected]

Page 2: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Roadmap for Today

Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models

– Logistic Regression

– Poisson Regression

– Survival Analysis

Multivariate Data Analysis

Programming Tutorials

Bits & Pieces

Page 3: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Tutorial 4

Page 4: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Multiple Linear Regression

Some handy functions to know about:

new.model <- update(old.model, new.formula)

Model Selection functions available in the MASS package

drop1, dropterm

add1, addterm

step, stepAIC

Similarly,

anova(modObj, test="Chisq")

Page 5: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Generalized Linear Models

Linear regression models hinge on the assumption that the response variable follows a Normal distribution.

Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Page 6: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Logistic Regression

When faced with a binary response Y = (0,1), we use logistic regression.

),|1( xiii YP

T

ip

i

i

x

x

x

1

T

p

i

1

where

jijj

T

ii

i

ii

iix

YP

YPxx

x

1log

),|0(

),|1(log

jijj

jijj

i

x

x

exp1

exp

Page 7: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Problem 2 – Logistic Regression

Read in the anaesthetic data set, data file: anaesthetic.txt.

Covariates:

move binary numeric vector for patient movement

(1 = movement, 0 = no movement)

conc anaethestic concentration

Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Page 8: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Fit the Logistic Regression Model

> anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic)

The output summary looks like this: > summary(anes.logit)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 **conc 5.567 2.044 2.724 0.00645 **

Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Page 9: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Estimating Log Odds Ratio

To get back the log odds ratio

> anes.logit$linear.predictors

> plot(anesthetic$conc, anes.logit$linear.predictors)

> abline(coefficients(anes.logit))

Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Page 10: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Problem 3 – Multiple Logistic RegressionRead in data set birthwt.txt.

low indicator of birth weight less than 2.5kg age mother's age in years lwt mother's weight in pounds at last menstrual period race mother's race (1 = white, 2 = black, 3 = other) smoke smoking status during pregnancy ptl number of previous premature labours ht history of hypertension ui presence of uterine irritability ftv number of physician visits during the first trimester bwt birth weight in grams

We fit a logistic regression using the glm function and using the binomial family.

Page 11: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Problem 4 - Poisson Regression

Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease.

Example: schooldata.csv.

We can fit the Poisson regression model using the glm function and the poisson family.

Page 12: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Survival Analysis

library(survival)

Example: aml leukemia data

Kaplan-Meier curve

fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11]))

summary(fit1)

plot(fit1)

Log-rank test

survdiff(Surv(time, status)~x, data=aml)

Page 13: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Survival Analysis

Fit a Cox proportional hazards model

coxfit1 <- coxph(Surv(time, status)~x, data=aml)

summary(coxfit1)

Cumulative baseline hazard estimator:

basehaz(coxph(Surv(time, status)~x, data=aml))

Survival function for one group:

plot(survfit(coxfit1, newdata=data.frame(x=1)))

Page 14: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Tutorial 5

Page 15: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Cluster Analysis

Hierarchical Methods:

(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…

Model-based Methods:

Mixed models. Plaid models. Mixture models…

A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.

Clustering observations on the basis of experiments or across a time series.

Clustering experiments together on the basis of observations.

Page 16: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Examples of Clustering Algorithms Available in R

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

Hierarchical Methods:

hclust

agnes

Partitioning Methods:

som

kmeans

pam

Packages:

cluster

Different Samples

Ob

servation

s

Page 17: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Hierarchical Clustering

n genes in n clusters

n genes in 1 cluster

divisive

agg

lom

erat

ive

We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.

Euclidean distance

(Pearson) correlation

Source: J-Express Manual

Page 18: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Single linkage

Complete linkage

Average linkage

Different Ways to Determine Distances Between Clusters

Page 19: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Partitioning Methods

Examples of partitioning methods are k-means, partitioning about medoids (pam).

Gap statistic:

source("http://www.bioconductor.org/biocLite.R")

biocLite("SAGx")

?gap

The goal is to minimize the gap statistic.

Page 20: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

W – within variance

B – between variance

K-means Clustering

Reference: J-Express manual

Page 21: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

241 genes from 19 cell samples into 6 clusters.

Page 22: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Classification (Machine Learning)

Machine learning algorithms predict new classes based on patterns discerned from existing data.

Classification algorithms are a form of supervised learning.

Clustering algorithms are a form of unsupervised learning.

R Package: class – contains knn, SOMnnetMLInterfaces - Biconductor

A simplified way to construct machine learning algorithms from microarray data.

Goal: derive a rule (classifier) that assigns a new object (e.g. patient

microarray profile) to a pre-specified group (e.g. aggressive vs non-

aggressive prostate cancer).

Page 23: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Classification

Linear Discriminant Analysis lda

Support Vector Machines library(e1071) svm

K-nearest neighborsknn

Tree-based methods:rpartrandomForest

Page 24: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Scaling Methods

Principal Component Analysis

prcomp

Multi-dimensional Scaling

MDS

Self Organizing Maps

SOM

Independent Component Analysis

fastICA

Page 25: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

R Shortcuts

Ctrl + A:

Ctrl + E:

Ctrl + K

Esc

{Up, Down} Arrow

Page 26: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu

Laundry List

.Rprofile file

Outline of R packages

Graphics – lattice, Rwiki

Homework

R/SAS/Stata Comparison

Exercises