Unit I: Introduction to simple linear regression

© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 1

Unit I: Introduction to simple linear regression


The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together


In this unit, we’re going to learn about…

• The 3 trinities for describing research: 3 types of variables, predictors and research questions

• Statistical models and how they differ from deterministic models• Examining predictor and outcome distributions and scatterplots• Mathematically representing the population model and interpreting

its components – Using sample data to motivate a hypothesized population linear

regression model– Assumptions made in postulating the simple linear regression

model

• Fitting the model to data—Understanding the method of least squares

• Residuals—definitions and interpretations• Uses of the fitted regression model

– How the fitted regression model helps improve our predictions

• Explained variation—what the R2 statistic is (and what it is not) • Using the analysis of variance to estimate the mean square error

(MSE)


The continuing consequences of segregationCharles, Dinwiddie and Massey (2004) Social Science Quarterly

RQ: “We seek to determine whether the high levels of African-American residential segregation experienced have continuing academic consequences”

Hypothesis: “Because segregation works to concentrate poverty and the social problems associated with it, the friends and relatives of African-American students face an elevated risk of stressful life events, which undermine grade performance”

Sample: Representative sample of 3,924 students—n’s by race/ethnicity—who participated in the National Longitudinal Survey of Freshmen (NLSF)

Target population: African American, Latino, Asian and White undergraduates at 28 selective US colleges & universities

Analytic approach: “Estimate a regression model to connect segregation to academic performance through the intervening variable of family stress.”

Variables: • Student race/ethnicity • Segregation of the HS neighborhood• Family SES—education, $, etc.• Stressful life events during college• College GPA

Results: “African-American students from segregated neighborhoods experience higher levels of family stress than others. This stress is largely a function of violence and disorder in segregated neighborhoods. Students respond by devoting more time to family issues and their health and grades suffer as a result”

Charles, C.Z., Dinwiddie, G., & Massey, D.S. 2004. The continuing consequences of segregation: Family stress and college academic performance. Social Science Quarterly, 85(5): 1353-1373.


Gray peril or loyal support?Berkman & Plutzer (2004) Social Science Quarterly

RQ: “Do large concentrations of elderly represent a ‘gray peril’ to maintaining adequate educational expenditures?”

Hypothesis: “The gray peril hypothesis is a misleading caricature of more complex political dynamics…not equally applicable to all elderly. Expenditures will decline as the concentration of newly arrived elderly increases; high concentrations of longstanding elderly will have no effect or result in expenditure increases”

Sample: All 9,129 districts that met this criterion.

Target population: All fiscally independent US school districts with > 35 students in 1989-1990.

Analytic approach: “We regress per PPE on the %age of the population over 60…and add a series of economic and demographic controls”

Variables: • Pct district residents who are >

60• Pct also newly arrived• Pct also longstanding

• SES and demographic controls• Per pupil expenditure (PPE)

Results: “Older residents represent a source of support for educational expenditures while elderly migrants lower spending. … The gray peril hypothesis … must be rejected”

Go to template for reporting education research

Berkman, M.B., & Plutzer, E. 2004. Gray peril or loyal support? The effects of the elderly on educational expenditures. Social Science Quarterly, 85(5): 1178-1192.


The 3 trinities for describing researchThe 3 types of Variables, Predictors, and Research Questions (RQs)

Question Predictors

variables whose effects you want

to study

Outcomesvariables used to

measure the predictors’ effects

Covariatesvariables whose effects you want

to ‘control’

Innovations and Interventions (e.g., vouchers,

a new curriculum)

Potentially changeable

characteristics (e.g., class size,

per pupil expenditures)

Fixed attributes(e.g., race, gender)

Descriptive RQsProvide descriptive

statistics for an outcome

Relational RQsIdentify relationships between a predictor

and an outcome

Causal RQsDemonstrate a

predictor’s causal impact on an outcome


Models: Simplified representations of relationships between variables

Mathematical models Statistical models

Outcome = Systematic component+ Residual

Goal 1: Identify the systematic components and determine

how they fit the data

Goal 2: Assess how well we did by examining the magnitude of

the residuals

Modeling geometric shapes (e.g, squares)

• Perimeter = 4(side)• Area = (side)2

Mathematical models are deterministic—

• Some are linear; some nonlinear, but…

• All squares behave this way—once we know the “rule,” we can use it to fit the model to data perfectly.

Modeling people, organizations, …, any type of social unit – all the kinds of models we expect to develop and fit to data

Statistical models must allow for:• Other systematic components

(not included in the model or not measured)

• Measurement error• Individual variation


Target Population

How do we “do” statistical modeling?

Step 1: Articulate your RQs in terms of outcomes, question predictors, and covariates(RQs often also specify the target population and sample)...this is a matter of substanceStep 2: Postulate a statistical model and fit the model to sample data…what we’ll discuss in this unit

Step 3: Determine whether the relationship we think we found in this sample is happenstance or whether we think it really exists in the population…what we’ll discuss in Unit 3

Population and Sample

Population

Group of interest.You want to make

inferences about the target population.

Sample

Group that you will study in your research.

Subset of the larger population

Sample

Draw a sample

Make inferences


Clarifying the “standard” terminology

ResponseDependent variableLeft-hand side variableY

Variable whose behavior we are trying to explain

Outcome

Independent variableRight-hand side variableX

Variable we are using to explain the variation in the outcome

Predictor

Association CorrelationCovariation

How two variables relate to each other, without implying causality

Relationship

SynonymsDefinitionTerm

Let’s get started by studying one of the oldest social

science RQs:nature vs. nurture...


Studying the origins of “natural ability”: Meet Sir Francis Galton

(16 February 1822 – 17 January 1911)

Galton’s “genetic utopia”: “Bright, healthy individuals were treated and paid well, and encouraged to have plenty of children. Social undesirables were treated with reasonable kindness so long as they worked hard and stayed celibate.”

More than you ever wanted to know about Galton

Didn’t have data on “intelligence” so instead studied

HEIGHT

• Although a self-proclaimed genius who wrote that he could read @2½, write/do arithmetic @4, and was comfortable with Latin texts @8, he couldn’t figure out how to model these data(!)

• He went to JD Dickson, a mathematician at Cambridge, who formalized the relationship by developing what we now know as linear regression

Research interest:

“Those qualifications of intellect and disposition which … lead to reputation”


From physical attributes to mental abilities: Meet Sir Cyril Burt

(3 March 1883 – 10 October 1971)

Research interest: Heritability of IQ: Can we predict the IQs of identical twins raised in “foster” (adoptive) homes from the IQs of their siblings raised in biological parents’ homes• Over a 30 year period, he & two

RAs—Miss Howard and Miss Conway—accrued data on 53 pairs of separated twins

• 15 pairs in 1943

• Up to 21 pairs in 1955

• Up to 53 pairs in 1966

• “‘Intelligence’, when adequately assessed, is largely dependent on genetic constitution” (Burt, 1966)

Growing accusations• In 1973, Arthur Jensen, a supporter of

Burt, noted “misprints and inconsistencies in some of the data”

• In 1974, Leon Kamin noted how odd it was that Burt’s correlation coefficients remained virtually unchanged as the sample size increased (r=.770, r=.771, and r=.771)

• In 1976, a London Sunday Times reporter tried to find the RAs and concluded that they did not exist

• In 1979, The British Journal of Psychology added the following notice to Burt’s 1966 paper: “The attention of readers of the Journal is drawn to the fact that it has now been established that this paper contains spurious data”

• In 1995, an edited volume with 5 essays, Cyril Burt: Fraud or Framed (Oxford), found evidence of sloppy writing, cutting and pasting of text, but perhaps not fraudulent data

• Debate continues to this day—and with Burt long dead, the conclusion may be that we’ll never know

• Much more info under “Supplemental Resources” on the S-030 website”


IQ scores for Cyril Burt's identical twins reared apartResults of PROC PRINT

ID OwnIQ FostIQ

1 68 63 2 71 76 3 73 77 4 75 72 5 78 71 6 79 75 7 81 86 8 82 82 9 82 9310 83 8611 85 8312 86 9413 87 9314 87 9715 89 10216 90 8017 91 8218 91 88

ID OwnIQ FostIQ

19 92 9120 92 9621 93 8722 93 9923 93 9924 94 9425 95 9626 96 9327 96 10928 97 9229 97 9530 97 11231 97 11332 99 10533 100 8834 101 11535 102 10436 103 106

ID OwnIQ FostIQ

37 105 10938 106 10739 106 10840 107 10841 107 10142 108 9543 111 9844 112 11645 114 10446 114 12547 115 10848 116 11649 118 11650 121 11851 125 12852 129 11753 131 132

Predictor (X):OwnIQ

Outcome (Y):FostIQ

RQ: What’s the relationship between the IQ of the child raised in an adoptive home and

his/her identical twin raised in the birth home?

n = 53

“Heritability

”


Distribution of the outcome (FOSTIQ) and the predictor (OWNIQ)

Results of PROC UNIVARIATE

The UNIVARIATE ProcedureVariable: FostIQ

Basic Statistical Measures Location VariabilityMean 98.11321 Std Deviation 15.21343Median 97.00000 Variance 231.44848Mode 93.00000 Range 69.00000 Interquartile Range 20.00000

Stem Leaf # Boxplot 13 2 1 | 12 58 2 | 12 | 11 566678 6 | 11 23 2 | 10 56788899 8 +-----+ 10 1244 4 | | 9 55667899 8 *--+--* 9 1233344 7 | | 8 66788 5 +-----+ 8 0223 4 | 7 567 3 | 7 12 2 | 6 | 6 3 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Return to fitted line

The UNIVARIATE ProcedureVariable: OwnIQ

Basic Statistical Measures Location VariabilityMean 97.35849 Std Deviation 14.69052Median 96.00000 Variance 215.81132Mode 97.00000 Range 63.00000 Interquartile Range 20.00000

Stem Leaf # Boxplot 13 1 1 | 12 59 2 | 12 1 1 | 11 568 3 | 11 1244 4 | 10 566778 6 +-----+ 10 0123 4 | | 9 56677779 8 *--+--* 9 011223334 9 | | 8 56779 5 +-----+ 8 1223 4 | 7 589 3 | 7 13 2 | 6 8 1 | ----+----+----+----+Multiply Stem.Leaf by 10**+1

Mean Median 100

sd 15

Distribution is symmetric with reasonable tails

Mean Median 100

sd 15

Distribution is symmetric with reasonable tails


Examining the relationship between Y and XResults of PROC PLOT and PROC GPLOT (stands for “Graphics Plot”)

Learn the Standard Terminology

Plot of Y vs. X Plot of FostIQ vs. OwnIQ

PROC GPLOT—a much more aesthetically

pleasing graph

Plot of FostIQ*OwnIQ. Legend: A = 1 obs, B = 2 obs, etc.

FostIQ | 140 + | | A | A | A 120 + A | A A A A A | B | A A CA A | A A A A 100 + B A A | AA A AAA A | A A A AA | AA AA A | A A A 80 + A | AA A | A A | | A 60 + -+-------------+-------------+-------------+-------------+- 60 80 100 120 140 OwnIQ

PROC PLOT—an old style “line printer” graph


Five questions to ask when examining scatterplots

Direction of relationshi

p?

Linearity of Relationshi

p?

Strength of relationshi

p?

Magnitude of

relationship?

Any unusual

observations?

Same slope Same strength


What do we see in the plot of FostIQ vs. OwnIQ?

Linearity of Relationshi

p?

Approximatelylinear

Direction of relationshi

p?

Positive

Strength of relationshi

p?

Fairly strong—points tightly clustered, but

some variability

Any unusual

observations?

None Magnitude of

relationship?

slope 1


The Importance of Axis Scale and Range in Examining Relationships

Plot from Last Slide A Different Relationship?

Note the difference in the axes!!!Simply by adjusting the SCALE and RANGE of each axis, we can

make the relationship look different. But, the magnitude and strength are the same!


The Importance of Axis Units in Examining Relationships

We also need to pay attention to the UNITS (e.g., dollars vs. thousands of dollars, or months vs. years).

Note that the scale and range are the same, but the UNITS are different

Wisconsin has salary data for all of its school teachers and administrators available on-

line. These plots come from a random sample of 728

teachers from across the state. For more information,

visit: http://dpi.wi.gov/sig/dm-stafftchr.html


How do we statistically model the relationship between Y and X?

Step 1: Decide on the model’s functional form

Why are straight lines so popular?

Transformations to achieve linearity

In Units 5&10, we’ll learn how to use the straight line machinery we’re developing to fit curves to dataLimited range of X may yield linearity Range restrictions are

common in social research

Actual linearity Many relationships – such as that in Cyril Burt’s data – are indeed linear

Mathematical simplicityA straight line is one of the simplest mathematical relationships between

variables—makes our work very tractable

In theory, we could fit a model using most any functional form



Step 2: Mathematically represent the model’s functional form

Two points—any two points—determine the

line

1

Y = intercept + slope*X

Intercept: Value of Y when X=0 (even if

X=0 isn’t an observed value)

Slope: Difference in Y per 1 unit

difference in X

Slope’s sign indicates whether the relationship

is positive (+) or negative (-)

So… If we have sample data and we identify the line that “best”

describes the observed pattern, is that our

statistical model?

NO! For two reasons:

1. Statistical models describe hypothesized behavior in the population, not in any particular sample. The model itself is imagined; we will never really see it

2. The equation we’ve written so far (incorrectly) assumes a fixed functional relationship between Y and X—it does not allow for individual variation

Y = +m*X b

Y = b + m*X


What do we mean by individual variation?

Raised by birth mother

IQ=120

IQ=120

IQ=120

IQ=120

IQ=121

IQ=117

IQ=118

IQ=123

Taken together, all adoptees in the population (not just these 4 below)

whose siblings have an IQ of 120 have an average IQ; that’s what we’d like our model to estimate

But we expect any particular adoptee’s IQ to differ from the population averagewe’re trying to estimate because of individual variation


Outcome = Systematic component + residual


Step 3: Postulate a linear regression model

XY 10

outcome predictor

“population parameters” or “regression coefficients”

to be estimated

Random error

iii xy 10Understanding the model

algebraically

Understanding the model graphically

Where we include the subscript i to

emphasize that the model describes behavior of Y for individual cases

(MORE ON SUBSCRIPTS)Remember that a model describes what we think exists in the population; you need to be able to imagine that it’s possible to envision having data on the entire population


From sample data to population model: Understanding what we’re hypothesizing

ID OwnIQ FostIQ

19 92 9120 92 9621 93 8722 93 9923 93 9924 94 9425 95 9626 96 9327 96 10928 97 9229 97 9530 97 11231 97 11332 99 10533 100 8834 101 11535 102 10436 103 106

Y|x3

Y|x1

Y|x2

X

Y

x1 x2 … x3

XY 10


XˆˆY 10

From population model to sample data:How do we fit the hypothesized model to observed data?

XY 10

Population model Fitted model

X

Y

x1 x2 … x3

Y|x3

Y|x1

Y|x2

“Hats” denote estimates

No error term


Understanding the (ordinary) least squares (OLS) criterion

Observed values (yi)are the sample data points

Residuals (yi - ŷi ) are the distances between the observed and predicted values at a given value of X

So a “good” line would go through the “center” of the data and have small

residuals (yi- ŷi);

…perhaps as small as possible???

Predicted values (ŷi) are estimated using the fitted line ii xy 10

ˆˆˆ

Ordinary Least Squares (OLS) criterion: Minimize the sum of the squared residuals

ˆyi iy 2

10 )ˆˆ( ii xy

Least squares criterion selects those parameter estimates that make the sum of squared residuals as small as possible (for

this particular sample)

2How do we find

the “good” line that has the smallest residual

s possible

?


Four assumptions about the population required for LS estimation

1.At each value of X, there is a distribution of Y. These distributions have a mean µY|X and a variance of σ2

Y|X

2.The straight line model is correct. The means of each of these distributions, the µY|X‘s, may be joined by a straight line.

3.Homoscedasticity. The variances of each of these distributions, the σ2

Y|X’s, are identical

4.Independence of observations.At each given value of X (at each xi), the values of Y (the yi’s) are independent of each other (we can’t see this visually…)

XY 10

X

Y

x1 x2 … x3

Y|x3

Y|x1

Y|x2

So how do we evaluate this assumption? We won’t; it’s another class!


Results of fitting a least squares regression line to Cyril Burt’s data

Results of PROC REGThe REG ProcedureModel: MODEL1Dependent Variable: fostiq

Number of Observations Read 53Number of Observations Used 53

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > F

Model 1 9250.65939 9250.65939 169.42 <.0001Error 51 2784.66136 54.60120 Corrected Total 52 12035

Root MSE 7.38926 R-Square 0.7686Dependent Mean 98.11321 Adj R-Sq 0.7641Coeff Var 7.53136

Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 9.71949 6.86647 1.42 0.1630owniq 1 0.90792 0.06975 13.02 <.0001

Verify outcome and predictor

)(91.072.9ˆ OWNIQTIQSFO Slope: DifferenceDifference in Y per 1 unit differencedifference in X.

On average, each 1 point difference in OWNIQ is positively associated with a 0.91 point difference in FOSTIQ.

Intercept: Value of FOSTIQ at OWNIQ=0

We’ll discuss these in Unit 3

Check for sample size

We’ll discuss this in a bit

XY 10ˆ ˆ ˆ


Why the awkward language?:Why not just say “increase” and “decrease”?

7 November 2005 Be careful about causal

language!

Read: Azar, B (2006) Discussing your findings. GradPsych 4(1)with much more to come in Unit 2


Plotting the fitted least squares regression line

Estimate two fitted values:

When X = 80: Ŷ=9.7195 + 0.9079 (80) = 82.35

When X = 120: Ŷ=9.7195 + 0.9079 (120) = 118.67

(80, 82.35)

(120, 118.67)

)(9079.07195.9ˆ

line regression squares least Estimated

OWNIQTIQSFO

(97.36, 98.11)

When X = 97.36: Ŷ=9.7195 + 0.9079 (97.36) = 98.11

It’s wise to stay within the range of the sample data when

graphing a line

Go to descriptive statistics


Four uses of the model and evaluating how well it achieves these goals

Removing/controlling: Once we have a summary, we can remove it and see what’s left over—the residuals

Residuals (yi - ŷi ) are the distances between the observed and predicted values for a given value of X

Predicted values (ŷi) are estimated using the fitted lineii xy 10

ˆˆˆ

Prediction: The regression equation allows us to predict Y—albeit imperfectly—if we have a given value of X

Explanation: The regression equation shows how X “explains” some of the variation in Y.

Description: Just as a mean summarizes the behavior of Y or X, the regression equation, which represents the mean of Y at each X, summarizes their relationship

What benchmark might we use to evaluate how well the regression equation

achieves these goals?

• The least squares criterion identifies those parameter estimates that minimize the sum of the squared residuals

• So we may ask how small did we get the residuals to be? The smaller they are, the better the fit

• But…how do we evaluate whether the residuals are small, large, or something in between?


Meet Sir R. A. Fisher: One of the “fathers” of modern statistics

(17 February 1890 – 29 July 1962)

Credited with bringing statistics into practice with the publication of his

accessible book…

• Also initially a eugenicist.

• In 1919, segued to doing agricultural research at Rothamsted Experimental Station

• Popularized many modern statistical concepts and techniques including:

• randomized trials, • degrees of freedom, • use of p-values for hypothesis testing Right now, we’re going to focus on one

of his contributions, the Analysis of Variance, which helps us show how the relative size of the residuals helps us evaluate how well the regression line

fits the data


Step 1: Let’s make sure we understand how to compute residuals:

Vertical distances between observed values (yi) and fitted values (ŷi)

Conclusion: The FOSTIQ of ID 1 (the IQ of the child who was adopted) is 8.5 points lower than we would have predicted on the basis of his/her OWNIQ (the IQ of the child raised by the birth parents).

ID OwnIQ FostIQ yhat residual

1 68 63 71.459 -8.4586 2 71 76 74.182 1.8177 3 73 77 75.998 1.0019 4 75 72 77.814 -5.8139 5 78 71 80.538 -9.5376

...49 118 116 116.854 -0.853650 121 118 119.577 -1.577351 125 128 123.209 4.791152 129 117 126.841 -9.840553 131 132 128.656 3.3437

46.71)68(9079.07195.9ˆ

63 FOSTIQ 68,OWNIQ :1 ID

TIQSFO

Step 1: Compute ŷi by substituting OWNIQ into the

regression equation

Step 2: Calculate the residual = yi-

ŷi

Over-predicted

Under-predicted

Positive residuals:

Negative residuals

Sometimes we under-predict, sometimes we

over-predict, but across the full sample, the

residuals will always sum to 0

46.846.7163 residual


11.98y

22.113ˆ iy

125iy

89.26)( yyi

Deviation Total

Step 2: To what might we compare the size of the residual?Let’s start with a single case….here, ID 46

11.98y

22.113ˆ iy

125iy

11.15)ˆ( yyi

89.26)( yyi

78.11)ˆ( ii yy

(ID 46: OWNIQ=114, FOSTIQ=125)

Two ways the ANOVA regression decomposition

helps us evaluate the quality of the fit

1.Total deviations provide a context for evaluating the magnitude of the residuals

2.Instead of focusing on just the magnitude of the error deviations (the residuals) we can equivalently focus on the magnitude of the regression deviations

But….

1. How do we generalize these ideas across cases?

2. How do we numerically make the comparison?

The mean would be our “best guess” for all values

of Y if we had no information about the

regression model

ANOVA regression

decomposition

)( yyi )ˆ( ii yy )ˆ( yyi = +

TotalDev

RegrDev

ErrorDev= +

Total Deviation

Error Deviation

Regression Deviation


Step 3: Let’s generalize and quantify these comparisonsThe general case of regression decomposition

)(: yyDevTotal i )ˆ(: yyDevRegr i )ˆ(: ii yyDevError

R2 = Σ (Regress Dev)

Σ (Total Dev)

2

2

Squares"of Sum" means SSwhere

SSTotal

SSRegressionR 2

Σ (Error Dev)

Σ (Total Dev)

2

2


Analysis of Variance regression decomposition

The REG ProcedureModel: MODEL1Dependent Variable: fostiq

Number of Observations Read 53Number of Observations Used 53

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > F

Model 1 9250.65939 9250.65939 169.42 <.0001Error 51 2784.66136 54.60120 Corrected Total 52 12035


SSError SS RegressSS Total )ˆ()ˆ()( iiii yyyyyy Analysis of

Variance regression

decomposition

2 2 2

12035 2785 = 9251 +

= 9251/12035

Interpreting R2

76.9 percent of the variation in Foster Twin’s IQ

scores is “associated

with” or “predicted by” the IQ of the twin

raised in the natural home.

What about the remaining 23.1%?

Measurement errorRandom error/Individual

variationOther predictors (Environment,

SES of household)

and R2

Link here to a cool applet that helps you to play with and envision R-squared relationships


Notes on the interpretation of R2

• R2 says nothing about causality

• Context in which you interpret the value of R2 depends upon your discipline (more in Unit 2)

• R2 does not tell us about the appropriateness of straight lines nor the strength of nonlinear relationships (more in Units 4 and 5)

• R2 is not a measure of the slope of the line (steep and shallow slopes can have low or high R2 statistics) (unless R2=0, in which case the slope will also be 0)


One last parameter to estimate: The residual variance, 2Y|X

Analysis of Variance Sum of MeanSource DF Squares Square Model 1 9250.65939 9250.65939 Error 51 2784.66136 54.60120 Corrected Total 52 12035


X

Y

x1 x2 … x3

Y|x3

Y|x1

Y|x2

1. At each value of X, there is a distribution of Y. These distributions have a mean µY|X and a variance of σ2

Y|X

3. Homoscedasticity. The variances of each of these distributions, the σ2

Y|X’s are identical

Of what importance is σ2Y|X?

(the residual variance of Y at each value of X)

1

)(ˆ

22

n

yyY

Why do we subtract 1?Because we estimated 1 parameter (the

mean) to estimate this other parameter (the variance)

So σ2Y|X tells us about the variability of the

residuals—the unexplained variability in Y that’s “left over”

How do we estimate variances?

Let’s start by reviewing the sample variance of Y

12,035

So how does this help us estimate σ2

Y|X?

= 231.4

Does this numerator look familiar?

52


From estimating 2Y to estimating 2

Y|X

Analysis of Variance Sum of MeanSource DF Squares Square Model 1 9250.65939 9250.65939 Error 51 2784.66136 54.60120 Corrected Total 52 12035


1

)(ˆ

22

n

yyY

SSestimated) parameters#(

squares of sumˆ

n

anceivar

2

)ˆ(ˆ

22|

n

yyXY

Take away 2 because we estimated both 0 and 1 to estimate ŷ

60.5451

66.2784

Mean Square Error (MSE)

Root Mean Square Error (RMSE)(the standard deviation of the residuals)

X

Y

x1 x2 … x3

Does this concept of ‘penalizing’

our calculations

for the number of

parameters estimated

have a name?


Developing an understanding about degrees of freedom (df)

Imagine a random sample of 3 #s from an infinite population.

The number of df depends on the number of constraints on their

values

In regression, the degrees of freedom for a parameter estimate depend on:(1)the sample size AND (2)the # of other parameters you need to estimate in order to estimate

this parameter

-3-11.86

92

¾

.05

Degrees of freedomExamplesWhat could the #s be?

You are told that...

Sample mean = 10 and sample SD = 10 (i.e., 2 constraints)

The sample mean = 10 (i.e., 1 constraint)

You have no constraints

All numbers could be any number

10, 20 & 30200, -7, ⅞

-2.4, 86.8, 0

Two #s could be any #; the third is fixed

0 and 10 20

5 and 10 15

One # could be anything;

remaining two are fixed

10 0 & 200 10 & 20

3

2

1

As the constraints increase…

… freedom decreases.


What’s the big takeaway from this unit?

• The regression model represents your hypothesis about the population

– When you fit a regression model to data, you are estimating sample values of population parameters that you’ll never actual calculate directly

– Don’t confuse sample estimates with true values—estimates are just estimates, even if they have sound statistical properties

– The regression model focuses on the average of Y at each given value of X. Individual variation figures prominently into the model through the error term (and residuals)

• Be sure to fully understand the meaning of the regression coefficients

– These are the building blocks for all further data analysis; take the time to make sure you have a complete and instinctual understanding of what they tell us

– Distinguish clearly between the magnitude and strength of an effect—don’t confuse these separate concepts

– The regression approach assumes linearity. We’ll learn in Units 4 and 5 how to evaluate these assumptions and what to do if they don’t hold

• R2 is a nifty summary of how much the regression model helps us– Be careful about causal language—the phrase “explained by” does not imply

causality– The regression decomposition, which leads to R2 and our estimate of the Root MSE

(the residual standard deviation) will appear in subsequent calculations; be sure you understand what they do and do not mean


Appendix: Annotated PC SAS code for Unit 1, Burt data

options nodate nocenter nonumber;

title1 "Unit 1: IQs of Cyril Burt's identical twins";footnote1 "m:\SAS Programs\Unit 1--Burt analysis.sas";

*-----------------------------------------------------*Be sure to update the infile reference to the file's location on your computer *-----------------------------------------------------*;

*-----------------------------------------------------*Input Burt data and name variables in dataset*-----------------------------------------------------*;

data one; infile 'm:\datasets\Burt.txt'; input ID 1-2 OwnIQ 4-6 FostIQ 8-10;

*-----------------------------------------------------*List owniq & fostiq data for entire Burt sample *-----------------------------------------------------*;

proc print data=one; title2 "Data listing"; var owniq fostiq; id id;

Continued on next page

Every SAS statement ends with a semicolon ;Every SAS statement ends with a semicolon ;

The options statement specifies how you’d like the output to look—here eliminating dates, centering and page numbers

The options statement specifies how you’d like the output to look—here eliminating dates, centering and page numbersThe title and footnote statements provide text that will appear on the output; add as many as you like but always enclose the text in quotes

The title and footnote statements provide text that will appear on the output; add as many as you like but always enclose the text in quotes Comments start with an asterisk * and can run over several lines (don’t forget the semicolon). Unlike titles and footnotes, they appear only in your program and log

Comments start with an asterisk * and can run over several lines (don’t forget the semicolon). Unlike titles and footnotes, they appear only in your program and log

The data step has (at least) three statements

• The data statement reads raw data from an external file (here, Burt.txt) into a temporary SAS dataset (here called one).

• The infile statement specifies the location of the raw data. Indicate appropriate drive where the file is stored (usually e: or f: for flash drive, etc)

• The input statement specifies the variable names and their column locations in the raw data file

The data step has (at least) three statements

• The data statement reads raw data from an external file (here, Burt.txt) into a temporary SAS dataset (here called one).

• The infile statement specifies the location of the raw data. Indicate appropriate drive where the file is stored (usually e: or f: for flash drive, etc)

• The input statement specifies the variable names and their column locations in the raw data file

proc print prints the newly created SAS data set (named “one”). The var statement identifies the variables you want printed; adding another title statement and id statement helps make the output more readable

proc print prints the newly created SAS data set (named “one”). The var statement identifies the variables you want printed; adding another title statement and id statement helps make the output more readable


proc univariate data=one plot; title2 "Descriptive statistics"; var fostiq owniq; id id;

*---------------------------------------------------*Bivariate scatterplot of fostiq vs owniq using proc plot & proc gplot*---------------------------------------------------*;

proc plot data=one; title2 "Line Printer plot of FostIQ vs OwnIQ"; plot fostiq*owniq;

proc gplot data=one; title2 "High quality plot of FostIQ vs OwnIQ"; plot fostiq*owniq; symbol value=dot;

*----------------------------------------------------*Fitting OLS regression model fostiq on owniq *------------------------------------------------- --*;

proc reg data=one; title2 "Regression of FostIQ on OwnIQ"; model fostiq = owniq;

run;

quit;

Appendix: Annotated PC SAS code for Unit 1, Burt data, continued

proc univariate presents summary statistics (e.g., means, sd’s, stem-and-leaf displays). The var statement specifies the variables you want analyzed; the id statement provides identifiers for extreme values

proc univariate presents summary statistics (e.g., means, sd’s, stem-and-leaf displays). The var statement specifies the variables you want analyzed; the id statement provides identifiers for extreme values

proc plot presents a “line printer” scatterplot. The plot statement specifies the variables you want analyzed; the syntax is outcome*predictor

proc plot presents a “line printer” scatterplot. The plot statement specifies the variables you want analyzed; the syntax is outcome*predictor

proc gplot presents a high quality scatterplot suitable for presentation. Its plot statement syntax is also outcome*predictor. If you don’t use a symbol statement, SAS will use + as a plotting symbol; here, we ask it to use a dot ●

proc gplot presents a high quality scatterplot suitable for presentation. Its plot statement syntax is also outcome*predictor. If you don’t use a symbol statement, SAS will use + as a plotting symbol; here, we ask it to use a dot ●proc reg fits a linear regression model using variables you specify. Its model statement syntax is outcome=predictor(s) (note the switch from an asterisk for the plots to an = for the model)

proc reg fits a linear regression model using variables you specify. Its model statement syntax is outcome=predictor(s) (note the switch from an asterisk for the plots to an = for the model)

The run statement tells SAS to execute the entire program; the quit statement tells SAS to stop execution

The run statement tells SAS to execute the entire program; the quit statement tells SAS to stop execution


Appendix: Relationships between Variance, MSE, and R2

(on both squared and square root scales)

Statistic

Incorporates…

Measures… Interpretation

Variance

Sum of Squares Total (SST – difference between observed values and the mean)

Variation inY overall

Average squared distance from mean

SD(Square root of variance)

Average distance from the mean

MSE Sum of Squares Error (SSE – difference between observed and fitted values)

Variation in residuals around fitted values

Estimated variance of Y at each value of X

RMSE(Square root of MSE)

Estimated SD of Y at each value of X

R2 Sum of Squares Regression (SSR – difference between fitted values and the mean of y) and SST

Ratio of SSR to SST

Proportion of variation in Y attributable to X

r (Square root of R2)

Strength of association

Correlation coefficient (more in Unit 2)

22

( )ˆ

1iy y

n

SSTSSR

yy

yyR

i

i

2

22

)(

)ˆ(

22

ˆ( )ˆ

2i

y x

y y

n


Appendix: It’s all Greek to me!

Population representation Sample estimate Quantity

Symbol Pronunciation Symbol Pronounciation

Mean mu x sample mean

(x bar)

Mean of Y at a given value of x xY |

mu sub-Y at this given value

of x xYx | sample mean of Y at

this given value of x

Variance 2 sigma-squared s2 sample variance

Standard deviation

sigma s sample standard

deviation Y intercept in a

regression equation 0 beta-zero 0 beta zero hat

Slope of a regression equation 1 beta-one 1 beta one hat

Random error epsilon ii yy ˆ residual

Regression model expressed using generic random

variables

XY 10 XY 10ˆˆˆ

Regression model expressed using variable names

OWNIQFOSTIQ 10 OWNIQIQTFOS 10ˆˆˆ

Sum across all observations

Note that this capital Greek letter does not to refer to a

population parameter! Sigma (this is an

upper case letter)


Appendix: Using Subscripts: Don’t be confused by the “i”

ID OwnIQ FostIQ

1 68 63 2 71 76 3 73 77 4 75 72 5 78 71

...

ID X Y

1 x1 y1

2 x2 y2

3 x3 y3

4 x4 y4

5 x5 y5

...

xi is a generic name

for x1, x2, etc.

yi is a generic name

for y1, y2, etc.

We use the subscript “i” to mean several things, depending on context:

1. any one of the observations;

2. any one of the X values;

3. more uses later in the semester.

|10

|20

|30

|40

|x1

|x2

|x3

|x4

Again, xi is a generic name for

x1, x2, etc.


Glossary terms included in Unit 1

• Assumptions of regression• Covariate• Degrees of freedom• Individual variation• Intercept• Least squares regression• Magnitude• Measurement error• MSE (mean square error)• Observed values and estimated values• Parameter estimates• Residual• R-squared• Slope• Strength


Documents

Unit I: Introduction to simple linear regression