Stat 101: Lecture 5fei/Teaching/stat101/Lect/handout5.pdf · versus father. e.g. For a man of 5’10”, the best estimate of the weight is 170 pounds, but that does not mean for

Stat 101: Lecture 5

Summer 2006

Outline

Review and Questions

Causation and Association

Regression

Quiz

The correlation

r measures the strength of the linear association between Xand Y values in a scatterplot.

I r ∈ [−1, 1].I r = 1 iff all points are on a positive slope line.I r = −1 iff all points are on a negative slope line.I |r | 6= 0 does not imply casual relationship between X and Y.I r2, the coefficient of determination, is the proportion of

variation in Y that is explained by X.

Some Examples

Causation and Association

Correlations are often high when some factor affects both Xand Y. ⇒ Association!.

I GPA and SAT scores are both affected by Study style.I GPA and number of sleeping hours are both affected by life

style.

Correlation does not often lead to causation.

Sometimes, there is causal relationship. e.g. hours of studyand GPA.

Ecological Correlations

Ecological correlation occur when X or Y or both is an average,proportion, or a percentage for a group.

The scatterplot of lung cancer rate against the proportion ofsmokers for 11 countries shows positive linearity.

What is the confounding? (In each country, it may be thenon-smokers who are getting the lung cancer).

The lesson is Never take average if possible!

An example

A study showed that cantons in Switzerland with high literacyrates also had high suicide rates.

This is an ecological correlation. What might be wrong?

What kinds of studies would help to draw causation?

Regression

Some terms:

I Y: response variable or dependent variable.

I X: explanatory variable, independent variable, covariate,predictor.

I Multiple regression has more than one explanatoryvariables.

SD Line

I The SD line passes through (X̄ , Ȳ ), with a slope r SDySDx.

I Regression fallacy (Page 169, § 4):Regression effect in test-retest example.

Regressing father against son will be different line from sonversus father. e.g. For a man of 5’10”, the best estimate ofthe weight is 170 pounds, but that does not mean for aman of 170 pounds, the best estimate of the height is5’10”. It is usually regressed towards the mean.

slopes: r SDfSDs versus rSDsSDf

⇒ less steep than expected.

I A baseball player performs exceptional in the first halfseason, tends to perform worse in the second. The chanceof both performing exceptional is small.

Regression Line

I What is regression?

It tries to fit the “best” line to the data.

I What is regression used for?

It predicts the average value of Y for a specific value of X.

I The line is representing a population instead of anindividual.

I Regression effect.

Mathematical model for regression

I Each point (Xi , Yi) in the scatterplot satisfies:

Yi = a + bXi + �i

I �i ∼ N(0, sd = σ). σ is usually unknown. The �’s havenothing to do with one another (independent). e.g., big �idoes not imply big �j .

I We know Xi ’s exactly. This imply that all error occurs in thevertical direction.

Estimating the regression line

ei = Yi − (a + bXi) is called residuals. It measures the verticaldistance from a point to the regression line.One estimates â and b̂ by minimizing,

f (a, b) =n∑

i=1

(Yi − (a + bXi))2

Take the derivative of f (a, b) w.r.t a and b, and set them to 0, weget,

â = Ȳ − b̂X̄ ; b̂ =1n

∑n1 XiYi − X̄ Ȳ

1n

∑n1 X

2i − X̄ 2

f (a, b) is also referred as Sum of Squared Errors (SSE).

An example

A biologist wants to predict brain weight from body weight,based on a sample of 62 mammals. A scatter plot showsbelow. Ecological correlation?

I The regression equation is,

Y = 90.996 + 0.966X

I The correlation is 0.9344. But it is heavily influenced by afew outliers.

I The sd of the residuals is 334.721. This stands for thetypical distance of a point to the regression line in thevertical direction.

I Under the “Parameter Estimates” portion of the printout,the last column tells whether the intercept and slope aresignificantly different from 0. Small numbers indicatesignificant differences; values less than 0.05 are usuallytaken to indicate real differences from zero, as opposed tochance errors.

I The root mean square (RMSE) is the standard deviation ofthe vertical distances between each point and theestimated line. It is an estimate of the standard deviation ofthe vertical distances between the observations and thetrue line.Formally,

RMSE =

√√√√1n

n∑1

(Yi −

(â + b̂Xi

)2)I Note that â + b̂Xi is the mean of the Y-value at Xi .

I The regression line predicts the average value for the Yvalues at a given X.

I In practice, one wants to predict the individual value for aparticular value of X. e.g. if my weight is 50 (kg), then howmuch would my brain weigh?

The prediction (g) is,

log Ŷ = â + b̂ log X= 90.96 + 0.9665 ∗ 50= 98.325

I But this is just the average for all mammals who weigh asmuch as I do.

I The individual value is less exact than the average value.To predict the average value, the only source of uncertaintyis the exact location of the regression line (i.e. â, b̂ areestimates of the true intercept and slope.)

I In order to predict my brainweight, the uncertainty aboutmy deviation from the average is added to the uncertaintyabout the location of the line.

I For example, if I weights 50 (kg), then my brain shouldweigh 98.325(g) + �. Assuming the regression model iscorrect, then � has a normal distribution with mean zeroand standard deviation 334.721.

I Note: with this model, my brain could easily have“negative” weight. This could make us question theregression assumptions.

Transformations

I The scatterplot of the brainweight against body weightshowed the line was probably controlled by a few largevalues (high-leverage points). Even worse, the scatterplotdid not resemble the football-shaped point cloud thatsupports the regression assumptions listed before.

I In cases like this, one can consider making atransformation of the response variable or the explanatoryvariable or both. For this data, consider taking thelogarithm (10 base) of the brainweight and the bodyweight.

I The scatterplot is much better.

I Taking the log shows that the outliers are not surprising.The regression equation is now:

log Y = 0.908 + 0.763 log X

I Now 91.23% of the variation in brain weight is explained bybody weight. Both the intercept and the slope are highlysignificant. The estimated sd of � is 0.317. This is thetypical vertical distance between a point and the line.

I Makeing transformations is an art. here the analysissuggests that,

Y = 8.1× X 0.763

I So there is a power-law relationship between brain massand body mass.

Extrapolation

I Predicting Y values for X values outside the range of Xvalues observed in the data is called extrapolation.

I This is risky, because you have no evidence that the linearrelationship you have seen in the scatterplot continues tohold in the new X region. Extrapolated values can beentirely wrong.

I It is unreliable to predict the brain weight of a blue whale orthe hog-nosed bat.

Residuals

I Estiamte the regression line (using JMP software or bycalculating â and b̂ by hand).

I Then find the differnece between each observed Yi andthe predicted value Ŷi using the fitted line. Thesedifferences are called the residuals.

I Plot each difference against the corresponding Xi value.This plot is called a residual plot.

I If the assumptions for linear regressin hold, what should onsee in the residual plot?

I If the pattern of the residuals around the horizontal line atzero is:

I Curved, then the assumption of linearity is violated.I fan-shaped, then the assumption of constant sd is violated.I filled with many outliers, then again the assumption of

constant sd is violated.I shows a pattern (e.g. positive, negative, positive, negative),

then the assumption of independent errors is violated.

I When the residuals have a histogram that looks normaland when the residual plot shows no pattern, then we canuse the normal distribution to make inferences aboutindividuals.

I Suppose we do not make the log transformation. Whatpercentage of 20-kilogram mammals have brain that weighmore than 180 grams?

I The regression equation says that the mean brainweightfor 20 kilogram animals is 90.996 + 0.966 * 20 = 110.33.The sd of the residuals is 334.721. Under the regressionassumptions, the 20-kilogram mammals have brainweightsthat are normally distributed with mean 110.33 and sd334.721.

I The z-transformation is (180 - 110.33) / 334.72 = 0.208.From the table, the area under the curve to the right of0.208 is (100 - 15.85) / 2 = 42.075%

Quiz instruction

I The quiz is 20 minutes with ten questions based on thelecture notes.

I You can use calculator.I The normal table is on Page A-105.I You can bring with you one page sheet.

Enjoy your PIRATE

Review and QuestionsCausation and AssociationRegressionQuiz

Documents

Stat 101: Lecture 5fei/Teaching/stat101/Lect/handout5.pdf · versus father. e.g. For a man of 5’10”, the best estimate of the weight is 170 pounds, but that does not mean for