Upload
yoshio-maldonado
View
48
Download
1
Embed Size (px)
DESCRIPTION
Unit I: Introduction to simple linear regression. The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: Introduction to simple linear regression. Unit 2: Correlation and causality. Unit 3: Inference for the regression model. Building a solid foundation. Unit 5: - PowerPoint PPT Presentation
Citation preview
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 1
Unit I: Introduction to simple linear regression
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 2
The S-030 roadmap: Where’s this unit in the big picture?
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Unit 4:Regression assumptions:
Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in
depth:Correlation and
collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression modeling
in practice
Unit 1:Introduction to
simple linear regression
Building a solid
foundation
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of
predictors and effects
Pulling it all
together
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 3
In this unit, we’re going to learn about…
• The 3 trinities for describing research: 3 types of variables, predictors and research questions
• Statistical models and how they differ from deterministic models• Examining predictor and outcome distributions and scatterplots• Mathematically representing the population model and interpreting
its components – Using sample data to motivate a hypothesized population linear
regression model– Assumptions made in postulating the simple linear regression
model
• Fitting the model to data—Understanding the method of least squares
• Residuals—definitions and interpretations• Uses of the fitted regression model
– How the fitted regression model helps improve our predictions
• Explained variation—what the R2 statistic is (and what it is not) • Using the analysis of variance to estimate the mean square error
(MSE)
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 4
The continuing consequences of segregationCharles, Dinwiddie and Massey (2004) Social Science Quarterly
RQ: “We seek to determine whether the high levels of African-American residential segregation experienced have continuing academic consequences”
Hypothesis: “Because segregation works to concentrate poverty and the social problems associated with it, the friends and relatives of African-American students face an elevated risk of stressful life events, which undermine grade performance”
Sample: Representative sample of 3,924 students—n’s by race/ethnicity—who participated in the National Longitudinal Survey of Freshmen (NLSF)
Target population: African American, Latino, Asian and White undergraduates at 28 selective US colleges & universities
Analytic approach: “Estimate a regression model to connect segregation to academic performance through the intervening variable of family stress.”
Variables: • Student race/ethnicity • Segregation of the HS neighborhood• Family SES—education, $, etc.• Stressful life events during college• College GPA
Results: “African-American students from segregated neighborhoods experience higher levels of family stress than others. This stress is largely a function of violence and disorder in segregated neighborhoods. Students respond by devoting more time to family issues and their health and grades suffer as a result”
Charles, C.Z., Dinwiddie, G., & Massey, D.S. 2004. The continuing consequences of segregation: Family stress and college academic performance. Social Science Quarterly, 85(5): 1353-1373.
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 5
Gray peril or loyal support?Berkman & Plutzer (2004) Social Science Quarterly
RQ: “Do large concentrations of elderly represent a ‘gray peril’ to maintaining adequate educational expenditures?”
Hypothesis: “The gray peril hypothesis is a misleading caricature of more complex political dynamics…not equally applicable to all elderly. Expenditures will decline as the concentration of newly arrived elderly increases; high concentrations of longstanding elderly will have no effect or result in expenditure increases”
Sample: All 9,129 districts that met this criterion.
Target population: All fiscally independent US school districts with > 35 students in 1989-1990.
Analytic approach: “We regress per PPE on the %age of the population over 60…and add a series of economic and demographic controls”
Variables: • Pct district residents who are >
60• Pct also newly arrived• Pct also longstanding
• SES and demographic controls• Per pupil expenditure (PPE)
Results: “Older residents represent a source of support for educational expenditures while elderly migrants lower spending. … The gray peril hypothesis … must be rejected”
Go to template for reporting education research
Berkman, M.B., & Plutzer, E. 2004. Gray peril or loyal support? The effects of the elderly on educational expenditures. Social Science Quarterly, 85(5): 1178-1192.
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 6
The 3 trinities for describing researchThe 3 types of Variables, Predictors, and Research Questions (RQs)
Question Predictors
variables whose effects you want
to study
Outcomesvariables used to
measure the predictors’ effects
Covariatesvariables whose effects you want
to ‘control’
Innovations and Interventions (e.g., vouchers,
a new curriculum)
Potentially changeable
characteristics (e.g., class size,
per pupil expenditures)
Fixed attributes(e.g., race, gender)
Descriptive RQsProvide descriptive
statistics for an outcome
Relational RQsIdentify relationships between a predictor
and an outcome
Causal RQsDemonstrate a
predictor’s causal impact on an outcome
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 7
Models: Simplified representations of relationships between variables
Mathematical models Statistical models
Outcome = Systematic component+ Residual
Goal 1: Identify the systematic components and determine
how they fit the data
Goal 2: Assess how well we did by examining the magnitude of
the residuals
Modeling geometric shapes (e.g, squares)
• Perimeter = 4(side)• Area = (side)2
Mathematical models are deterministic—
• Some are linear; some nonlinear, but…
• All squares behave this way—once we know the “rule,” we can use it to fit the model to data perfectly.
Modeling people, organizations, …, any type of social unit – all the kinds of models we expect to develop and fit to data
Statistical models must allow for:• Other systematic components
(not included in the model or not measured)
• Measurement error• Individual variation
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 8
Target Population
How do we “do” statistical modeling?
Step 1: Articulate your RQs in terms of outcomes, question predictors, and covariates(RQs often also specify the target population and sample)...this is a matter of substanceStep 2: Postulate a statistical model and fit the model to sample data…what we’ll discuss in this unit
Step 3: Determine whether the relationship we think we found in this sample is happenstance or whether we think it really exists in the population…what we’ll discuss in Unit 3
Population and Sample
Population
Group of interest.You want to make
inferences about the target population.
Sample
Group that you will study in your research.
Subset of the larger population
Sample
Draw a sample
Make inferences
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 9
Clarifying the “standard” terminology
ResponseDependent variableLeft-hand side variableY
Variable whose behavior we are trying to explain
Outcome
Independent variableRight-hand side variableX
Variable we are using to explain the variation in the outcome
Predictor
Association CorrelationCovariation
How two variables relate to each other, without implying causality
Relationship
SynonymsDefinitionTerm
Let’s get started by studying one of the oldest social
science RQs:nature vs. nurture...
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 10
Studying the origins of “natural ability”: Meet Sir Francis Galton
(16 February 1822 – 17 January 1911)
Galton’s “genetic utopia”: “Bright, healthy individuals were treated and paid well, and encouraged to have plenty of children. Social undesirables were treated with reasonable kindness so long as they worked hard and stayed celibate.”
More than you ever wanted to know about Galton
Didn’t have data on “intelligence” so instead studied
HEIGHT
• Although a self-proclaimed genius who wrote that he could read @2½, write/do arithmetic @4, and was comfortable with Latin texts @8, he couldn’t figure out how to model these data(!)
• He went to JD Dickson, a mathematician at Cambridge, who formalized the relationship by developing what we now know as linear regression
Research interest:
“Those qualifications of intellect and disposition which … lead to reputation”
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 11
From physical attributes to mental abilities: Meet Sir Cyril Burt
(3 March 1883 – 10 October 1971)
Research interest: Heritability of IQ: Can we predict the IQs of identical twins raised in “foster” (adoptive) homes from the IQs of their siblings raised in biological parents’ homes• Over a 30 year period, he & two
RAs—Miss Howard and Miss Conway—accrued data on 53 pairs of separated twins
• 15 pairs in 1943
• Up to 21 pairs in 1955
• Up to 53 pairs in 1966
• “‘Intelligence’, when adequately assessed, is largely dependent on genetic constitution” (Burt, 1966)
Growing accusations• In 1973, Arthur Jensen, a supporter of
Burt, noted “misprints and inconsistencies in some of the data”
• In 1974, Leon Kamin noted how odd it was that Burt’s correlation coefficients remained virtually unchanged as the sample size increased (r=.770, r=.771, and r=.771)
• In 1976, a London Sunday Times reporter tried to find the RAs and concluded that they did not exist
• In 1979, The British Journal of Psychology added the following notice to Burt’s 1966 paper: “The attention of readers of the Journal is drawn to the fact that it has now been established that this paper contains spurious data”
• In 1995, an edited volume with 5 essays, Cyril Burt: Fraud or Framed (Oxford), found evidence of sloppy writing, cutting and pasting of text, but perhaps not fraudulent data
• Debate continues to this day—and with Burt long dead, the conclusion may be that we’ll never know
• Much more info under “Supplemental Resources” on the S-030 website”
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 12
IQ scores for Cyril Burt's identical twins reared apartResults of PROC PRINT
ID OwnIQ FostIQ
1 68 63 2 71 76 3 73 77 4 75 72 5 78 71 6 79 75 7 81 86 8 82 82 9 82 9310 83 8611 85 8312 86 9413 87 9314 87 9715 89 10216 90 8017 91 8218 91 88
ID OwnIQ FostIQ
19 92 9120 92 9621 93 8722 93 9923 93 9924 94 9425 95 9626 96 9327 96 10928 97 9229 97 9530 97 11231 97 11332 99 10533 100 8834 101 11535 102 10436 103 106
ID OwnIQ FostIQ
37 105 10938 106 10739 106 10840 107 10841 107 10142 108 9543 111 9844 112 11645 114 10446 114 12547 115 10848 116 11649 118 11650 121 11851 125 12852 129 11753 131 132
Predictor (X):OwnIQ
Outcome (Y):FostIQ
RQ: What’s the relationship between the IQ of the child raised in an adoptive home and
his/her identical twin raised in the birth home?
n = 53
“Heritability
”
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 13
Distribution of the outcome (FOSTIQ) and the predictor (OWNIQ)
Results of PROC UNIVARIATE
The UNIVARIATE ProcedureVariable: FostIQ
Basic Statistical Measures Location VariabilityMean 98.11321 Std Deviation 15.21343Median 97.00000 Variance 231.44848Mode 93.00000 Range 69.00000 Interquartile Range 20.00000
Stem Leaf # Boxplot 13 2 1 | 12 58 2 | 12 | 11 566678 6 | 11 23 2 | 10 56788899 8 +-----+ 10 1244 4 | | 9 55667899 8 *--+--* 9 1233344 7 | | 8 66788 5 +-----+ 8 0223 4 | 7 567 3 | 7 12 2 | 6 | 6 3 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1
Return to fitted line
The UNIVARIATE ProcedureVariable: OwnIQ
Basic Statistical Measures Location VariabilityMean 97.35849 Std Deviation 14.69052Median 96.00000 Variance 215.81132Mode 97.00000 Range 63.00000 Interquartile Range 20.00000
Stem Leaf # Boxplot 13 1 1 | 12 59 2 | 12 1 1 | 11 568 3 | 11 1244 4 | 10 566778 6 +-----+ 10 0123 4 | | 9 56677779 8 *--+--* 9 011223334 9 | | 8 56779 5 +-----+ 8 1223 4 | 7 589 3 | 7 13 2 | 6 8 1 | ----+----+----+----+Multiply Stem.Leaf by 10**+1
Mean Median 100
sd 15
Distribution is symmetric with reasonable tails
Mean Median 100
sd 15
Distribution is symmetric with reasonable tails
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 14
Examining the relationship between Y and XResults of PROC PLOT and PROC GPLOT (stands for “Graphics Plot”)
Learn the Standard Terminology
Plot of Y vs. X Plot of FostIQ vs. OwnIQ
PROC GPLOT—a much more aesthetically
pleasing graph
Plot of FostIQ*OwnIQ. Legend: A = 1 obs, B = 2 obs, etc.
FostIQ | 140 + | | A | A | A 120 + A | A A A A A | B | A A CA A | A A A A 100 + B A A | AA A AAA A | A A A AA | AA AA A | A A A 80 + A | AA A | A A | | A 60 + -+-------------+-------------+-------------+-------------+- 60 80 100 120 140 OwnIQ
PROC PLOT—an old style “line printer” graph
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 15
Five questions to ask when examining scatterplots
Direction of relationshi
p?
Linearity of Relationshi
p?
Strength of relationshi
p?
Magnitude of
relationship?
Any unusual
observations?
Same slope Same strength
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 16
What do we see in the plot of FostIQ vs. OwnIQ?
Linearity of Relationshi
p?
Approximatelylinear
Direction of relationshi
p?
Positive
Strength of relationshi
p?
Fairly strong—points tightly clustered, but
some variability
Any unusual
observations?
None Magnitude of
relationship?
slope 1
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 17
The Importance of Axis Scale and Range in Examining Relationships
Plot from Last Slide A Different Relationship?
Note the difference in the axes!!!Simply by adjusting the SCALE and RANGE of each axis, we can
make the relationship look different. But, the magnitude and strength are the same!
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 18
The Importance of Axis Units in Examining Relationships
We also need to pay attention to the UNITS (e.g., dollars vs. thousands of dollars, or months vs. years).
Note that the scale and range are the same, but the UNITS are different
Wisconsin has salary data for all of its school teachers and administrators available on-
line. These plots come from a random sample of 728
teachers from across the state. For more information,
visit: http://dpi.wi.gov/sig/dm-stafftchr.html
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 19
How do we statistically model the relationship between Y and X?
Step 1: Decide on the model’s functional form
Why are straight lines so popular?
Transformations to achieve linearity
In Units 5&10, we’ll learn how to use the straight line machinery we’re developing to fit curves to dataLimited range of X may yield linearity Range restrictions are
common in social research
Actual linearity Many relationships – such as that in Cyril Burt’s data – are indeed linear
Mathematical simplicityA straight line is one of the simplest mathematical relationships between
variables—makes our work very tractable
In theory, we could fit a model using most any functional form
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 20
How do we statistically model the relationship between Y and X?
Step 2: Mathematically represent the model’s functional form
Two points—any two points—determine the
line
1
Y = intercept + slope*X
Intercept: Value of Y when X=0 (even if
X=0 isn’t an observed value)
Slope: Difference in Y per 1 unit
difference in X
Slope’s sign indicates whether the relationship
is positive (+) or negative (-)
So… If we have sample data and we identify the line that “best”
describes the observed pattern, is that our
statistical model?
NO! For two reasons:
1. Statistical models describe hypothesized behavior in the population, not in any particular sample. The model itself is imagined; we will never really see it
2. The equation we’ve written so far (incorrectly) assumes a fixed functional relationship between Y and X—it does not allow for individual variation
Y = +m*X b
Y = b + m*X
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 21
What do we mean by individual variation?
Raised by birth mother
IQ=120
IQ=120
IQ=120
IQ=120
IQ=121
IQ=117
IQ=118
IQ=123
Taken together, all adoptees in the population (not just these 4 below)
whose siblings have an IQ of 120 have an average IQ; that’s what we’d like our model to estimate
But we expect any particular adoptee’s IQ to differ from the population averagewe’re trying to estimate because of individual variation
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 22
Outcome = Systematic component + residual
How do we statistically model the relationship between Y and X?
Step 3: Postulate a linear regression model
XY 10
outcome predictor
“population parameters” or “regression coefficients”
to be estimated
Random error
iii xy 10Understanding the model
algebraically
Understanding the model graphically
Where we include the subscript i to
emphasize that the model describes behavior of Y for individual cases
(MORE ON SUBSCRIPTS)Remember that a model describes what we think exists in the population; you need to be able to imagine that it’s possible to envision having data on the entire population
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 23
From sample data to population model: Understanding what we’re hypothesizing
ID OwnIQ FostIQ
19 92 9120 92 9621 93 8722 93 9923 93 9924 94 9425 95 9626 96 9327 96 10928 97 9229 97 9530 97 11231 97 11332 99 10533 100 8834 101 11535 102 10436 103 106
Y|x3
Y|x1
Y|x2
X
Y
x1 x2 … x3
XY 10
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 24
XˆˆY 10
From population model to sample data:How do we fit the hypothesized model to observed data?
XY 10
Population model Fitted model
X
Y
x1 x2 … x3
Y|x3
Y|x1
Y|x2
“Hats” denote estimates
No error term
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 25
Understanding the (ordinary) least squares (OLS) criterion
Observed values (yi)are the sample data points
Residuals (yi - ŷi ) are the distances between the observed and predicted values at a given value of X
So a “good” line would go through the “center” of the data and have small
residuals (yi- ŷi);
…perhaps as small as possible???
Predicted values (ŷi) are estimated using the fitted line ii xy 10
ˆˆˆ
Ordinary Least Squares (OLS) criterion: Minimize the sum of the squared residuals
ˆyi iy 2
10 )ˆˆ( ii xy
Least squares criterion selects those parameter estimates that make the sum of squared residuals as small as possible (for
this particular sample)
2How do we find
the “good” line that has the smallest residual
s possible
?
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 26
Four assumptions about the population required for LS estimation
1.At each value of X, there is a distribution of Y. These distributions have a mean µY|X and a variance of σ2
Y|X
2.The straight line model is correct. The means of each of these distributions, the µY|X‘s, may be joined by a straight line.
3.Homoscedasticity. The variances of each of these distributions, the σ2
Y|X’s, are identical
4.Independence of observations.At each given value of X (at each xi), the values of Y (the yi’s) are independent of each other (we can’t see this visually…)
XY 10
X
Y
x1 x2 … x3
Y|x3
Y|x1
Y|x2
So how do we evaluate this assumption? We won’t; it’s another class!
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 27
Results of fitting a least squares regression line to Cyril Burt’s data
Results of PROC REGThe REG ProcedureModel: MODEL1Dependent Variable: fostiq
Number of Observations Read 53Number of Observations Used 53
Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > F
Model 1 9250.65939 9250.65939 169.42 <.0001Error 51 2784.66136 54.60120 Corrected Total 52 12035
Root MSE 7.38926 R-Square 0.7686Dependent Mean 98.11321 Adj R-Sq 0.7641Coeff Var 7.53136
Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 9.71949 6.86647 1.42 0.1630owniq 1 0.90792 0.06975 13.02 <.0001
Verify outcome and predictor
)(91.072.9ˆ OWNIQTIQSFO Slope: DifferenceDifference in Y per 1 unit differencedifference in X.
On average, each 1 point difference in OWNIQ is positively associated with a 0.91 point difference in FOSTIQ.
Intercept: Value of FOSTIQ at OWNIQ=0
We’ll discuss these in Unit 3
Check for sample size
We’ll discuss this in a bit
XY 10ˆ ˆ ˆ
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 28
Why the awkward language?:Why not just say “increase” and “decrease”?
7 November 2005 Be careful about causal
language!
Read: Azar, B (2006) Discussing your findings. GradPsych 4(1)with much more to come in Unit 2
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 29
Plotting the fitted least squares regression line
Estimate two fitted values:
When X = 80: Ŷ=9.7195 + 0.9079 (80) = 82.35
When X = 120: Ŷ=9.7195 + 0.9079 (120) = 118.67
(80, 82.35)
(120, 118.67)
)(9079.07195.9ˆ
line regression squares least Estimated
OWNIQTIQSFO
(97.36, 98.11)
When X = 97.36: Ŷ=9.7195 + 0.9079 (97.36) = 98.11
It’s wise to stay within the range of the sample data when
graphing a line
Go to descriptive statistics
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 30
Four uses of the model and evaluating how well it achieves these goals
Removing/controlling: Once we have a summary, we can remove it and see what’s left over—the residuals
Residuals (yi - ŷi ) are the distances between the observed and predicted values for a given value of X
Predicted values (ŷi) are estimated using the fitted lineii xy 10
ˆˆˆ
Prediction: The regression equation allows us to predict Y—albeit imperfectly—if we have a given value of X
Explanation: The regression equation shows how X “explains” some of the variation in Y.
Description: Just as a mean summarizes the behavior of Y or X, the regression equation, which represents the mean of Y at each X, summarizes their relationship
What benchmark might we use to evaluate how well the regression equation
achieves these goals?
• The least squares criterion identifies those parameter estimates that minimize the sum of the squared residuals
• So we may ask how small did we get the residuals to be? The smaller they are, the better the fit
• But…how do we evaluate whether the residuals are small, large, or something in between?
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 31
Meet Sir R. A. Fisher: One of the “fathers” of modern statistics
(17 February 1890 – 29 July 1962)
Credited with bringing statistics into practice with the publication of his
accessible book…
• Also initially a eugenicist.
• In 1919, segued to doing agricultural research at Rothamsted Experimental Station
• Popularized many modern statistical concepts and techniques including:
• randomized trials, • degrees of freedom, • use of p-values for hypothesis testing Right now, we’re going to focus on one
of his contributions, the Analysis of Variance, which helps us show how the relative size of the residuals helps us evaluate how well the regression line
fits the data
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 32
Step 1: Let’s make sure we understand how to compute residuals:
Vertical distances between observed values (yi) and fitted values (ŷi)
Conclusion: The FOSTIQ of ID 1 (the IQ of the child who was adopted) is 8.5 points lower than we would have predicted on the basis of his/her OWNIQ (the IQ of the child raised by the birth parents).
ID OwnIQ FostIQ yhat residual
1 68 63 71.459 -8.4586 2 71 76 74.182 1.8177 3 73 77 75.998 1.0019 4 75 72 77.814 -5.8139 5 78 71 80.538 -9.5376
...49 118 116 116.854 -0.853650 121 118 119.577 -1.577351 125 128 123.209 4.791152 129 117 126.841 -9.840553 131 132 128.656 3.3437
46.71)68(9079.07195.9ˆ
63 FOSTIQ 68,OWNIQ :1 ID
TIQSFO
Step 1: Compute ŷi by substituting OWNIQ into the
regression equation
Step 2: Calculate the residual = yi-
ŷi
Over-predicted
Under-predicted
Positive residuals:
Negative residuals
Sometimes we under-predict, sometimes we
over-predict, but across the full sample, the
residuals will always sum to 0
46.846.7163 residual
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 33
11.98y
22.113ˆ iy
125iy
89.26)( yyi
Deviation Total
Step 2: To what might we compare the size of the residual?Let’s start with a single case….here, ID 46
11.98y
22.113ˆ iy
125iy
11.15)ˆ( yyi
89.26)( yyi
78.11)ˆ( ii yy
(ID 46: OWNIQ=114, FOSTIQ=125)
Two ways the ANOVA regression decomposition
helps us evaluate the quality of the fit
1.Total deviations provide a context for evaluating the magnitude of the residuals
2.Instead of focusing on just the magnitude of the error deviations (the residuals) we can equivalently focus on the magnitude of the regression deviations
But….
1. How do we generalize these ideas across cases?
2. How do we numerically make the comparison?
The mean would be our “best guess” for all values
of Y if we had no information about the
regression model
ANOVA regression
decomposition
)( yyi )ˆ( ii yy )ˆ( yyi = +
TotalDev
RegrDev
ErrorDev= +
Total Deviation
Error Deviation
Regression Deviation
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 34
Step 3: Let’s generalize and quantify these comparisonsThe general case of regression decomposition
)(: yyDevTotal i )ˆ(: yyDevRegr i )ˆ(: ii yyDevError
R2 = Σ (Regress Dev)
Σ (Total Dev)
2
2
Squares"of Sum" means SSwhere
SSTotal
SSRegressionR 2
Σ (Error Dev)
Σ (Total Dev)
2
2
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 35
Analysis of Variance regression decomposition
The REG ProcedureModel: MODEL1Dependent Variable: fostiq
Number of Observations Read 53Number of Observations Used 53
Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > F
Model 1 9250.65939 9250.65939 169.42 <.0001Error 51 2784.66136 54.60120 Corrected Total 52 12035
Root MSE 7.38926 R-Square 0.7686Dependent Mean 98.11321 Adj R-Sq 0.7641Coeff Var 7.53136
SSError SS RegressSS Total )ˆ()ˆ()( iiii yyyyyy Analysis of
Variance regression
decomposition
2 2 2
12035 2785 = 9251 +
= 9251/12035
Interpreting R2
76.9 percent of the variation in Foster Twin’s IQ
scores is “associated
with” or “predicted by” the IQ of the twin
raised in the natural home.
What about the remaining 23.1%?
Measurement errorRandom error/Individual
variationOther predictors (Environment,
SES of household)
and R2
Link here to a cool applet that helps you to play with and envision R-squared relationships
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 36
Notes on the interpretation of R2
• R2 says nothing about causality
• Context in which you interpret the value of R2 depends upon your discipline (more in Unit 2)
• R2 does not tell us about the appropriateness of straight lines nor the strength of nonlinear relationships (more in Units 4 and 5)
• R2 is not a measure of the slope of the line (steep and shallow slopes can have low or high R2 statistics) (unless R2=0, in which case the slope will also be 0)
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 37
One last parameter to estimate: The residual variance, 2Y|X
Analysis of Variance Sum of MeanSource DF Squares Square Model 1 9250.65939 9250.65939 Error 51 2784.66136 54.60120 Corrected Total 52 12035
Root MSE 7.38926 R-Square 0.7686Dependent Mean 98.11321 Adj R-Sq 0.7641Coeff Var 7.53136
X
Y
x1 x2 … x3
Y|x3
Y|x1
Y|x2
1. At each value of X, there is a distribution of Y. These distributions have a mean µY|X and a variance of σ2
Y|X
3. Homoscedasticity. The variances of each of these distributions, the σ2
Y|X’s are identical
Of what importance is σ2Y|X?
(the residual variance of Y at each value of X)
1
)(ˆ
22
n
yyY
Why do we subtract 1?Because we estimated 1 parameter (the
mean) to estimate this other parameter (the variance)
So σ2Y|X tells us about the variability of the
residuals—the unexplained variability in Y that’s “left over”
How do we estimate variances?
Let’s start by reviewing the sample variance of Y
12,035
So how does this help us estimate σ2
Y|X?
= 231.4
Does this numerator look familiar?
52
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 38
From estimating 2Y to estimating 2
Y|X
Analysis of Variance Sum of MeanSource DF Squares Square Model 1 9250.65939 9250.65939 Error 51 2784.66136 54.60120 Corrected Total 52 12035
Root MSE 7.38926 R-Square 0.7686Dependent Mean 98.11321 Adj R-Sq 0.7641Coeff Var 7.53136
1
)(ˆ
22
n
yyY
SSestimated) parameters#(
squares of sumˆ
n
anceivar
2
)ˆ(ˆ
22|
n
yyXY
Take away 2 because we estimated both 0 and 1 to estimate ŷ
60.5451
66.2784
Mean Square Error (MSE)
Root Mean Square Error (RMSE)(the standard deviation of the residuals)
X
Y
x1 x2 … x3
Does this concept of ‘penalizing’
our calculations
for the number of
parameters estimated
have a name?
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 39
Developing an understanding about degrees of freedom (df)
Imagine a random sample of 3 #s from an infinite population.
The number of df depends on the number of constraints on their
values
In regression, the degrees of freedom for a parameter estimate depend on:(1)the sample size AND (2)the # of other parameters you need to estimate in order to estimate
this parameter
-3-11.86
92
¾
.05
Degrees of freedomExamplesWhat could the #s be?
You are told that...
Sample mean = 10 and sample SD = 10 (i.e., 2 constraints)
The sample mean = 10 (i.e., 1 constraint)
You have no constraints
All numbers could be any number
10, 20 & 30200, -7, ⅞
-2.4, 86.8, 0
Two #s could be any #; the third is fixed
0 and 10 20
5 and 10 15
One # could be anything;
remaining two are fixed
10 0 & 200 10 & 20
3
2
1
As the constraints increase…
… freedom decreases.
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 40
What’s the big takeaway from this unit?
• The regression model represents your hypothesis about the population
– When you fit a regression model to data, you are estimating sample values of population parameters that you’ll never actual calculate directly
– Don’t confuse sample estimates with true values—estimates are just estimates, even if they have sound statistical properties
– The regression model focuses on the average of Y at each given value of X. Individual variation figures prominently into the model through the error term (and residuals)
• Be sure to fully understand the meaning of the regression coefficients
– These are the building blocks for all further data analysis; take the time to make sure you have a complete and instinctual understanding of what they tell us
– Distinguish clearly between the magnitude and strength of an effect—don’t confuse these separate concepts
– The regression approach assumes linearity. We’ll learn in Units 4 and 5 how to evaluate these assumptions and what to do if they don’t hold
• R2 is a nifty summary of how much the regression model helps us– Be careful about causal language—the phrase “explained by” does not imply
causality– The regression decomposition, which leads to R2 and our estimate of the Root MSE
(the residual standard deviation) will appear in subsequent calculations; be sure you understand what they do and do not mean
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 41
Appendix: Annotated PC SAS code for Unit 1, Burt data
options nodate nocenter nonumber;
title1 "Unit 1: IQs of Cyril Burt's identical twins";footnote1 "m:\SAS Programs\Unit 1--Burt analysis.sas";
*-----------------------------------------------------*Be sure to update the infile reference to the file's location on your computer *-----------------------------------------------------*;
*-----------------------------------------------------*Input Burt data and name variables in dataset*-----------------------------------------------------*;
data one; infile 'm:\datasets\Burt.txt'; input ID 1-2 OwnIQ 4-6 FostIQ 8-10;
*-----------------------------------------------------*List owniq & fostiq data for entire Burt sample *-----------------------------------------------------*;
proc print data=one; title2 "Data listing"; var owniq fostiq; id id;
Continued on next page
Every SAS statement ends with a semicolon ;Every SAS statement ends with a semicolon ;
The options statement specifies how you’d like the output to look—here eliminating dates, centering and page numbers
The options statement specifies how you’d like the output to look—here eliminating dates, centering and page numbersThe title and footnote statements provide text that will appear on the output; add as many as you like but always enclose the text in quotes
The title and footnote statements provide text that will appear on the output; add as many as you like but always enclose the text in quotes Comments start with an asterisk * and can run over several lines (don’t forget the semicolon). Unlike titles and footnotes, they appear only in your program and log
Comments start with an asterisk * and can run over several lines (don’t forget the semicolon). Unlike titles and footnotes, they appear only in your program and log
The data step has (at least) three statements
• The data statement reads raw data from an external file (here, Burt.txt) into a temporary SAS dataset (here called one).
• The infile statement specifies the location of the raw data. Indicate appropriate drive where the file is stored (usually e: or f: for flash drive, etc)
• The input statement specifies the variable names and their column locations in the raw data file
The data step has (at least) three statements
• The data statement reads raw data from an external file (here, Burt.txt) into a temporary SAS dataset (here called one).
• The infile statement specifies the location of the raw data. Indicate appropriate drive where the file is stored (usually e: or f: for flash drive, etc)
• The input statement specifies the variable names and their column locations in the raw data file
proc print prints the newly created SAS data set (named “one”). The var statement identifies the variables you want printed; adding another title statement and id statement helps make the output more readable
proc print prints the newly created SAS data set (named “one”). The var statement identifies the variables you want printed; adding another title statement and id statement helps make the output more readable
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 42
proc univariate data=one plot; title2 "Descriptive statistics"; var fostiq owniq; id id;
*---------------------------------------------------*Bivariate scatterplot of fostiq vs owniq using proc plot & proc gplot*---------------------------------------------------*;
proc plot data=one; title2 "Line Printer plot of FostIQ vs OwnIQ"; plot fostiq*owniq;
proc gplot data=one; title2 "High quality plot of FostIQ vs OwnIQ"; plot fostiq*owniq; symbol value=dot;
*----------------------------------------------------*Fitting OLS regression model fostiq on owniq *------------------------------------------------- --*;
proc reg data=one; title2 "Regression of FostIQ on OwnIQ"; model fostiq = owniq;
run;
quit;
Appendix: Annotated PC SAS code for Unit 1, Burt data, continued
proc univariate presents summary statistics (e.g., means, sd’s, stem-and-leaf displays). The var statement specifies the variables you want analyzed; the id statement provides identifiers for extreme values
proc univariate presents summary statistics (e.g., means, sd’s, stem-and-leaf displays). The var statement specifies the variables you want analyzed; the id statement provides identifiers for extreme values
proc plot presents a “line printer” scatterplot. The plot statement specifies the variables you want analyzed; the syntax is outcome*predictor
proc plot presents a “line printer” scatterplot. The plot statement specifies the variables you want analyzed; the syntax is outcome*predictor
proc gplot presents a high quality scatterplot suitable for presentation. Its plot statement syntax is also outcome*predictor. If you don’t use a symbol statement, SAS will use + as a plotting symbol; here, we ask it to use a dot ●
proc gplot presents a high quality scatterplot suitable for presentation. Its plot statement syntax is also outcome*predictor. If you don’t use a symbol statement, SAS will use + as a plotting symbol; here, we ask it to use a dot ●proc reg fits a linear regression model using variables you specify. Its model statement syntax is outcome=predictor(s) (note the switch from an asterisk for the plots to an = for the model)
proc reg fits a linear regression model using variables you specify. Its model statement syntax is outcome=predictor(s) (note the switch from an asterisk for the plots to an = for the model)
The run statement tells SAS to execute the entire program; the quit statement tells SAS to stop execution
The run statement tells SAS to execute the entire program; the quit statement tells SAS to stop execution
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 43
Appendix: Relationships between Variance, MSE, and R2
(on both squared and square root scales)
Statistic
Incorporates…
Measures… Interpretation
Variance
Sum of Squares Total (SST – difference between observed values and the mean)
Variation inY overall
Average squared distance from mean
SD(Square root of variance)
Average distance from the mean
MSE Sum of Squares Error (SSE – difference between observed and fitted values)
Variation in residuals around fitted values
Estimated variance of Y at each value of X
RMSE(Square root of MSE)
Estimated SD of Y at each value of X
R2 Sum of Squares Regression (SSR – difference between fitted values and the mean of y) and SST
Ratio of SSR to SST
Proportion of variation in Y attributable to X
r (Square root of R2)
Strength of association
Correlation coefficient (more in Unit 2)
22
( )ˆ
1iy y
n
SSTSSR
yy
yyR
i
i
2
22
)(
)ˆ(
22
ˆ( )ˆ
2i
y x
y y
n
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 44
Appendix: It’s all Greek to me!
Population representation Sample estimate Quantity
Symbol Pronunciation Symbol Pronounciation
Mean mu x sample mean
(x bar)
Mean of Y at a given value of x xY |
mu sub-Y at this given value
of x xYx | sample mean of Y at
this given value of x
Variance 2 sigma-squared s2 sample variance
Standard deviation
sigma s sample standard
deviation Y intercept in a
regression equation 0 beta-zero 0 beta zero hat
Slope of a regression equation 1 beta-one 1 beta one hat
Random error epsilon ii yy ˆ residual
Regression model expressed using generic random
variables
XY 10 XY 10ˆˆˆ
Regression model expressed using variable names
OWNIQFOSTIQ 10 OWNIQIQTFOS 10ˆˆˆ
Sum across all observations
Note that this capital Greek letter does not to refer to a
population parameter! Sigma (this is an
upper case letter)
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 45
Appendix: Using Subscripts: Don’t be confused by the “i”
ID OwnIQ FostIQ
1 68 63 2 71 76 3 73 77 4 75 72 5 78 71
...
ID X Y
1 x1 y1
2 x2 y2
3 x3 y3
4 x4 y4
5 x5 y5
...
xi is a generic name
for x1, x2, etc.
yi is a generic name
for y1, y2, etc.
We use the subscript “i” to mean several things, depending on context:
1. any one of the observations;
2. any one of the X values;
3. more uses later in the semester.
|10
|20
|30
|40
|x1
|x2
|x3
|x4
Again, xi is a generic name for
x1, x2, etc.
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 46
Glossary terms included in Unit 1
• Assumptions of regression• Covariate• Degrees of freedom• Individual variation• Intercept• Least squares regression• Magnitude• Measurement error• MSE (mean square error)• Observed values and estimated values• Parameter estimates• Residual• R-squared• Slope• Strength
© Judith D. Singer, Harvard Graduate School of Education Unit 1/Slide 47