BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics [email protected]

BIO503: Lecture 5

Harvard School of Public Health Wintersession 2009

Jess Mar

Department of Biostatistics

[email protected]

Roadmap for Today

Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models

– Logistic Regression

– Poisson Regression

– Survival Analysis

Multivariate Data Analysis

Programming Tutorials

Bits & Pieces

Tutorial 4

Multiple Linear Regression

Some handy functions to know about:

new.model <- update(old.model, new.formula)

Model Selection functions available in the MASS package

drop1, dropterm

add1, addterm

step, stepAIC

Similarly,

anova(modObj, test="Chisq")

Generalized Linear Models

Linear regression models hinge on the assumption that the response variable follows a Normal distribution.

Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression

When faced with a binary response Y = (0,1), we use logistic regression.

),|1( xiii YP

T

ip

i

i

x

x

x

1

T

p

i

1

where

jijj

T

ii

i

ii

iix

YP

YPxx

x

1log

),|0(

),|1(log

jijj

jijj

i

x

x

exp1

exp

Problem 2 – Logistic Regression

Read in the anaesthetic data set, data file: anaesthetic.txt.

Covariates:

move binary numeric vector for patient movement

(1 = movement, 0 = no movement)

conc anaethestic concentration

Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Fit the Logistic Regression Model

> anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic)

The output summary looks like this: > summary(anes.logit)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 **conc 5.567 2.044 2.724 0.00645 **

Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Estimating Log Odds Ratio

To get back the log odds ratio

> anes.logit$linear.predictors

> plot(anesthetic$conc, anes.logit$linear.predictors)

> abline(coefficients(anes.logit))

Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Problem 3 – Multiple Logistic RegressionRead in data set birthwt.txt.

low indicator of birth weight less than 2.5kg age mother's age in years lwt mother's weight in pounds at last menstrual period race mother's race (1 = white, 2 = black, 3 = other) smoke smoking status during pregnancy ptl number of previous premature labours ht history of hypertension ui presence of uterine irritability ftv number of physician visits during the first trimester bwt birth weight in grams

We fit a logistic regression using the glm function and using the binomial family.

Problem 4 - Poisson Regression

Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease.

Example: schooldata.csv.

We can fit the Poisson regression model using the glm function and the poisson family.

Survival Analysis

library(survival)

Example: aml leukemia data

Kaplan-Meier curve

fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11]))

summary(fit1)

plot(fit1)

Log-rank test

survdiff(Surv(time, status)~x, data=aml)

Survival Analysis

Fit a Cox proportional hazards model

coxfit1 <- coxph(Surv(time, status)~x, data=aml)

summary(coxfit1)

Cumulative baseline hazard estimator:

basehaz(coxph(Surv(time, status)~x, data=aml))

Survival function for one group:

plot(survfit(coxfit1, newdata=data.frame(x=1)))

Tutorial 5

Cluster Analysis

Hierarchical Methods:

(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…

Model-based Methods:

Mixed models. Plaid models. Mixture models…

A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.

Clustering observations on the basis of experiments or across a time series.

Clustering experiments together on the basis of observations.

Examples of Clustering Algorithms Available in R

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211


Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211


Genes

Hierarchical Methods:

hclust

agnes

Partitioning Methods:

som

kmeans

pam

Packages:

cluster

Different Samples

Ob

servation

s

Hierarchical Clustering

n genes in n clusters

n genes in 1 cluster

divisive

agg

lom

erat

ive

We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.

Euclidean distance

(Pearson) correlation

Source: J-Express Manual

Single linkage

Complete linkage

Average linkage

Different Ways to Determine Distances Between Clusters

Partitioning Methods

Examples of partitioning methods are k-means, partitioning about medoids (pam).

Gap statistic:

source("http://www.bioconductor.org/biocLite.R")

biocLite("SAGx")

?gap

The goal is to minimize the gap statistic.

W – within variance

B – between variance

K-means Clustering

Reference: J-Express manual

241 genes from 19 cell samples into 6 clusters.

Classification (Machine Learning)

Machine learning algorithms predict new classes based on patterns discerned from existing data.

Classification algorithms are a form of supervised learning.

Clustering algorithms are a form of unsupervised learning.

R Package: class – contains knn, SOMnnetMLInterfaces - Biconductor

A simplified way to construct machine learning algorithms from microarray data.

Goal: derive a rule (classifier) that assigns a new object (e.g. patient

microarray profile) to a pre-specified group (e.g. aggressive vs non-

aggressive prostate cancer).

Classification

Linear Discriminant Analysis lda

Support Vector Machines library(e1071) svm

K-nearest neighborsknn

Tree-based methods:rpartrandomForest

Scaling Methods

Principal Component Analysis

prcomp

Multi-dimensional Scaling

MDS

Self Organizing Maps

SOM

Independent Component Analysis

fastICA

R Shortcuts

Ctrl + A:

Ctrl + E:

Ctrl + K

Esc

{Up, Down} Arrow

Laundry List

.Rprofile file

Outline of R packages

Graphics – lattice, Rwiki

Homework

R/SAS/Stata Comparison

Exercises

Documents

BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics [email protected]