Lecture 18: Thurs., Nov. 6th
• Chapters 8.3.2, 8.4, 8.6.1
• Outliers and Influential Observations
• Transformations
• Interpretation of log transformations (8.4)
• R2 (8.6.1)
Outliers and Influential Observations
• An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual.
• An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential.
• The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot.
Outliers Example
• Does the age at which a child begins to talk predict a later score on a test of mental ability at a later age?
• gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken at a later age.
• Child 18 is an outlier in the x direction and potentially influential. Child 19 is an outlier in the direction of the scatterplot.
• To assess whether a point is influential, fit the least squares line with and without the point (excluding the row to fit it without the point) and see how much of a difference it makes.
• Child 18 is highly influential; child 19 is not highly influential.
Bivariate Fit of Score By Age
50
7080
100
120
Score
5 10 15 20 25 30 35 40 45Age
Parameter Estimates Term Estimate Std Error Prob>|t|
Intercept 109.87384 5.067802 <.0001 Age -1.126989 0.310172 0.0018
Bivariate Fit of Score By Age
70
80
90
100
110
120
130
Score
5 10 15 20 25Age
Parameter Estimates Term Estimate Std Error Prob>|t|
Intercept 105.62987 7.161928 <.0001 Age -0.779221 0.516733 0.1489
Case Study 8.1.1
• Biologists are interested in the relationship between the area of islands (X) and the number of animal and plant species (Y) living on them.– Estimates of this relationship are useful in conservation
biology for predicting species extinction rates due to diminishing habitat.
• Data in Display 8.1 are number of reptile and amphibian species and the island areas for seven islands in the West Indies.
Scatterplots for Species Data
• Regression function does not appear to be linear.
Bivariate Fit of SPECIES By AREA
0
25
50
75
100
125
SP
EC
IES
-10000100003000050000AREA
-20
0
20
Re
sid
ua
l
-10000 10000 30000 50000AREA
Case Study 8.1.2
• In an industrial laboratory, batches of electrical insulating fluid were subject to different voltages until insulating property of fluids broke down.
• Y=time to breakdown of an insulating fluid, X=voltage. Residual plots shows “horn shaped” pattern indicating both nonlinearity and nonconstant variance.
0
500
1000
1500
2000
2500
TIM
E
24 26 28 30 32 34 36 38 40VOLTAGE
-500
500
1500R
esi
du
al
24 26 28 30 32 34 36 38 40VOLTAGE
Tukey’s Bulging Rule
• Draw a circle, divide into 4 pieces • Try transformations based on what quadrant the shape
of the data falls in.• Upper left: sqrt X, log X, 1/X, Y2 • Upper right: X2 Y2 • Lower left: sqrt X, log X, 1/X, sqrt Y, log Y, 1/Y • Lower right: X2, sqrt Y, log Y, 1/Y • Try different transformations, draw residual plots and
see which works best. If no transformation works, polynomial regression (Ch. 9) must be used.
Transformations for Voltage Data
-3
-1
1
3
5
7
Lo
g T
ime
24 26 28 30 32 34 36 38 40VOLTAGE
-5
-2
1
Re
sid
ua
l
24 26 28 30 32 34 36 38 40VOLTAGE
Bivariate Fit of Square Root of Time By VOLTAGE
0
10
20
30
40
50
Sq
ua
re
Ro
ot
of
Tim
e
242628303234363840VOLTAGE
-20
0
20
Re
sid
ua
l
24 26 28 30 32 34 36 38 40VOLTAGE
Transformations for Species Data
0
25
50
75
100
125
SP
EC
IES
-2.5 0 2.5 5 7.5 1012.5Log Area
-30
-10
10
30
Re
sid
ua
l
-2.5 0 2.5 5 7.5 10 12.5Log Area
-2.5
0
2.5
5
7.5
10
12.5
Lo
g A
rea
1.5 2 2.5 3 3.5 4 4.5 5Log Species
-1.0
0.00.5
Re
sid
ua
l
1.5 2 2.5 3 3.5 4 4.5 5Log Species
Prediction After Transformation
• To predict y given x (or to estimate ) when y has been transformed to f(y) and x to g(x),
• • Species Data log-log transformation. Y transformed to log
Y, X transformed to log X
• Predicted number of species given area = 30000:– Predicted number of log species given log area =
log(30000)=10.31 equals 1.94+0.25*10.31=4.52.– Predicted number of species given area = 30000 equals
exp(predicted number of log species given log area = log(30000)) = exp(4.52) =91.84.
}|{ XY
)})()(|{ˆ(}|{ˆ 1 xgXgYfxXY
Linear Fit
Log Species = 1.9365081 + 0.2496799 Log Area
Second Prediction Example
• For voltage data, if using the square root transformation, to predict y based on x,
• Predicted Time for Voltage = 30:– Predicted Square Root of Time for Voltage =
30 equals 61.78-1.70*30 = 10.78– Predicted Time for Voltage = 30 equals
10.782=116.21
Linear Fit
Square Root of Time = 61.784472 - 1.6958968 VOLTAGE
Testing whether Y is Associated with X
• To test whether Y is associated with X, we can test whether f(Y) is associated with g(X) by testing whether the slope is zero in the transformed model.
• Strong evidence that number of species is associated with area.
• Interpreting the slope and intercept is difficult except for log transformations.
Linear Fit
Log Area = -7.595201 + 3.9585895 Log Species Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 61.784472 7.776881 7.94 <.0001 VOLTAGE -1.695897 0.233695 -7.26 <.0001
Interpreting log transformations
• Case I: Response is logged, explanatory variable is not logged.
• Median{Y|X}=• Consequently, Median{Y|(X+1)}/ Median{Y|X} =
• Interpretation:
– If , as X increases by 1, the median of Y increases by
– If , as X increases by 1, the median of Y decreases by
XXY 10}|{ )exp()exp( 10 X
)exp( 1
01 %100*)1( 1 e
01 %100*)1( 1e
Interpretation in Voltage Study
• Interpretation: It is estimated that the median failure time decreases by
with each 1kV increase in voltage. 95% CI: Median failure time decreases by
for 1KV increase in voltage.
%40%100*)1( 1 e
Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%
Intercept 18.955459 <.0001 15.149663 22.761254 VOLTAGE -0.507365 <.0001 -0.621729 -0.393001
-3
-1
1
3
5
7
Lo
g T
ime
24 26 28 30 32 34 36 38 40VOLTAGE
%)50.32%,29.46(%100*)1,1( 393.6217. ee
Case II: Explanatory variable is logged
• • Implies • Interpretation: Doubling of X is associated with
change in the mean of Y.• Species Example:
• Interpretation: Doubling of Area is associated with an increase in mean species of 8.86*log(2) = 6.14. 95% CI = (4.52*log(2),13.20*log(2))=(3.13,9.15)
XXY log)}log(|{ 10
)2log()}log(|{)}2log(|{ 1 XYXY)2log(1
Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%
Intercept -5.294043 0.6845 -36.88145 26.293365 Log Area 8.8605231 0.0033 4.5212408 13.199805
Case III: Both response and explanatory variable logged
•
• Interpretation:– A doubling of X is associated with a multiplicative
change of in the median of Y.
– A ten-fold increase in X is associated with a change of
in the median of Y.
)log()}log(|){log( 10 XXY 10}|{ XeXYMedian
12
110
Case III Example
• Species Example:
• Since , “associated with each doubling of island area is a 19% increase in the median number of bird species. 95% CI for multiplicative increase = (16.4%, 21.5%)
1.52
2.53
3.54
4.55
Lo
g S
pe
cie
s
-2.5 0 2.5 5 7.5 1012.5Log Area
Parameter Estimates Term Estimate Prob>|t| Lower 95% Upper 95%
Intercept 1.9365081 <.0001 1.7099593 2.1630569 Log Area 0.2496799 <.0001 0.218558 0.2808018
19.12 25.0
R-Squared
• The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.
• Total sum of squares = . Best sum of squared prediction error without using x.
)%squares of sum Total
squares of sum Residual - squares of sum Total(1002 R
2
1)( YY
n
i i
R-Squared example
• R2= 86.69%. Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.”
Bivariate Fit of Neuron activity index By Years playing
0
5
10
15
20
25
30
Ne
uro
n a
ctivity in
de
x
0 5 10 15 20Years playing
Linear Fit
Neuron activity index = 7.9715909 + 1.0268308 Years playing Summary of Fit
RSquare 0.866986 RSquare Adj 0.855902 Root Mean Square Error 3.025101 Mean of Response 15.89286 Observations (or Sum Wgts) 14
Interpreting R2
• If the residuals are all zero (a perfect fit), then R2 is 100%. If the least squares line has slope 0, R2 will be 0%.
• R2 is useful as a unitless summary of the strength of linear association but– It is not useful for assessing model adequacy (e.g., linearity)
or whether or not there is an association– A good R2 depends on the context. In precise laboratory
work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.
Coverage of Second Midterm
• Transformations of the data for two group problem (Ch. 3.5)
• Welch t-test (Ch. 4.3.2)• Comparisons Among Several Samples (5.1-5.3,
5.5.1)• Multiple Comparisons (6.3-6.4)• Simple Linear Regression (Ch. 7.1-7.4, 7.5.3)• Assumptions for Simple Linear Regression and
Diagnostics (Ch. 8.1-8.4, 8.6.1, 8.6.3)