Upload
rahuliiim
View
217
Download
0
Embed Size (px)
Citation preview
7/28/2019 7 Bivariate Eda
1/61
bivariate EDA and regression
analysis
7/28/2019 7 Bivariate Eda
2/61
length
width
7/28/2019 7 Bivariate Eda
3/61
distance from quarry
weight
of core
7/28/2019 7 Bivariate Eda
4/61
-4 -3 -2 -1 0 1 2 3 4 5
AG_C1_1
-5
-4
-3
-2
-1
0
1
2
3
AG_
C1
_2
7/28/2019 7 Bivariate Eda
5/61
-4 -3 -2 -1 0 1 2 3 4 5
AG_C1_1
-5
-4
-3
-2
-1
0
1
2
3
AG
_C
1_2
7/28/2019 7 Bivariate Eda
6/61
-4 -3 -2 -1 0 1 2 3 4 5
AG_C1_1
-5
-4
-3
-2
-1
0
1
2
3
AG_
C1
_2
7/28/2019 7 Bivariate Eda
7/61
AG_C1_2
AG
_C1
_1
AG_C2_2 AG_C3_2 AG_C4_2
AG_C1_1
AG
_C2
_1
AG_C
2_1
AG
_C3
_1
AG_C3_1
AG_C1_2
AG
_C4
_1
AG_C2_2 AG_C3_2 AG_C4_2
A
G_C4_1
scatterplot matrix
7/28/2019 7 Bivariate Eda
8/61
AG_C1_1
AG
_C1
_1
AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2
AG_C1_1
AG
_C2
_1
AG_C2_1
AG
_C3
_1
AG_C3_1
A
G_
C4
_1
AG_C4_1
AG
_C1
_2
AG_C1_2
AG
_C2
_2
AG_C2_2
AG
_C3
_2
AG_C3_2
AG_C1_1
AG
_C4
_2
AG_C2_1 AG_C3_1 AG_C4_1 AG_C1_2 AG_C2_2 AG_C3_2 AG_C4_2
AG_C4_2
7/28/2019 7 Bivariate Eda
9/61
-4 -3 -2 -1 0 1 2 3 4 5
AG_C1_1
-10
-5
0
5
10
AG
_
C2
_1
7/28/2019 7 Bivariate Eda
10/61
scatterplots
scatterplots provide the most detailed summary of abivariate relationship, but they are not concise, andthere are limits to what else you can do with them
simpler kinds of summaries may be useful
more compact; often capture less detail
may support more extended mathematical analyses
may reveal fundamental relationships
-4 -3 -2 -1 0 1 2 3 4 5
AG_C1_1
-5
-4
-3
-2
-1
0
1
2
3
AG
_C
1_
2
7/28/2019 7 Bivariate Eda
11/61
7/28/2019 7 Bivariate Eda
12/61
y = a + bx
7/28/2019 7 Bivariate Eda
13/61
y = a + bx
1 2 3 4 5 6
1
2
3
4
5
6
a = y intercept
y
x
(x2,y2)
(x1,y1)
b = slope
b = y/x
b = (y2-y1)/(x2-x1)
7/28/2019 7 Bivariate Eda
14/61
y = a + bx
we can predict values ofy from values ofx
predicted values ofyare called y-hat
the predicted values (y) are often regardedas dependent on the (independent) x
values try to assign independent values to x-axis,
dependent values to the y-axis
bxay
7/28/2019 7 Bivariate Eda
15/61
y = a + bx
becomes a concise summary of a point
distribution, and a model of a relationship
may have important explanatory andpredictive value
7/28/2019 7 Bivariate Eda
16/61
7/28/2019 7 Bivariate Eda
17/61
how do we come up with these lines?
various options:
by eye
calculating a Tukey Line (resistant to
outliers)
locally weighted regression LOWESSleast squares regression
7/28/2019 7 Bivariate Eda
18/61
linear regression
linear regression and correlation analysis
are generally concerned with fitting lines to
real data
least squares regression is one of the main
tools
attempts to minimize deviation of observed
points from the regression line
maximizes its potential for prediction
7/28/2019 7 Bivariate Eda
19/61
standard approach minimizes the squared
variation in y
Note:
these are the vertical deviations this is a sum-squared-error approach
n
i
ii yy1
2)(
7/28/2019 7 Bivariate Eda
20/61
regressing x on y would involve defining
the line
by minimizing
ii dycx
2
ii xx
7/28/2019 7 Bivariate Eda
21/61
calculating a line that minimizes this value
is called regressingy onx
appropriate when we are trying to predictyfromx
this is also called Model I Regression
7/28/2019 7 Bivariate Eda
22/61
start by calculating the slope (b):
n
i
i
n
i
ii
xx
yyxx
b
1
2
1
)(
))(( covariance
7/28/2019 7 Bivariate Eda
23/61
once you have the slope, you can calculate
the y-intercept (a):
n
xbyxbya
ii
7/28/2019 7 Bivariate Eda
24/61
regression pathologies
things to avoid in regression analysis
7/28/2019 7 Bivariate Eda
25/61
7/28/2019 7 Bivariate Eda
26/61
7/28/2019 7 Bivariate Eda
27/61
7/28/2019 7 Bivariate Eda
28/61
7/28/2019 7 Bivariate Eda
29/61
Tukey Line
resistant to outliers
divide cases into thirds, based onx-axis
identify the medianx andy values in upper
and lower thirds
slope (b)= (My3-My1)/(Mx3-Mx1)
intercept (a) = median of all values yi-b*xi
7/28/2019 7 Bivariate Eda
30/61
7/28/2019 7 Bivariate Eda
31/61
Correlation
regression concerns fitting a linear model to
observed data
correlation concerns the degree of fitbetween observed data and the model...
if most points lie near the line:
the fit of the model is good
the two variables are strongly correlated
values of y can be well predicted from x
7/28/2019 7 Bivariate Eda
32/61
Pearsonsr
this is assessed using the product-moment
correlation coefficient:
= covariance (the numerator), standardizedby a measure of variation in both x and y
22 )()(
))((yyxx
yyxxr
ii
ii
7/28/2019 7 Bivariate Eda
33/61
y
x
22 )()(
))((
yyxx
yyxxr
ii
ii
+
+
-
-
(xi,yi)
7/28/2019 7 Bivariate Eda
34/61
unlike the covariance, r is unit-less
ranges between1 and 1 0 = no correlation
-1 and 1 = perfect negative and positive
correlation (respectively) r is symmetrical
correlation betweenx andy is the same as
betweeny andx no question of independence or dependence
recall, this symmetry is not true of regression
7/28/2019 7 Bivariate Eda
35/61
regression/correlation
one can assess the strength of a relationship by
seeing how knowledge of one variable
improves the ability to predict the other
7/28/2019 7 Bivariate Eda
36/61
if you ignorex, the best predictor ofy will
be the mean of ally values (y-bar)
if they measurements are widely scattered,
prediction errors will be greater than if they
are close together
we can assess the dispersion ofy values
around their mean by:
2)( yyi
y
7/28/2019 7 Bivariate Eda
37/61
y
iy
2
)( yyi
2)( ii yy
7/28/2019 7 Bivariate Eda
38/61
2)( ii yy
2)( yyir2=
coefficient of determination (r2)
describes the proportion of variation that is
explained or accounted for by the regression line
r2=.5
half of the variation is explained by the regression
half of the variation iny is explained by variation inx
7/28/2019 7 Bivariate Eda
39/61
y
iy
7/28/2019 7 Bivariate Eda
40/61
correlation and percentages
much of what we want to learn aboutassociation between variables can belearned from counts
ex: are high counts of bone needles associatedwith high counts of end scrapers?
sometimes, similar questions are posed ofpercent-standardized data
ex: are highproportions of decorated potteryassociated with highproportions of copper
bells?
7/28/2019 7 Bivariate Eda
41/61
caution
these are different questions and have
different implications for formal regression
percents will show at least some level ofcorrelation even if the underlying counts do
not
spurious correlation (negative) closed-sum effect
7/28/2019 7 Bivariate Eda
42/61
case C_v1 C_v2 C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10
1 15 14 94 59 76 13 8 97 10 952 35 1 89 95 23 77 14 9 27 43
3 20 96 73 31 90 65 74 60 85 27
4 23 59 7 52 33 83 71 35 57 90
5 36 90 86 15 97 54 52 41 34 3
6 79 2 26 5 11 68 74 44 13 87
7 40 99 28 66 77 23 69 22 63 36
8 95 36 22 75 21 48 95 58 74 68
9 27 0 58 99 32 30 5 5 100 75
10 67 93 98 61 62 94 3 16 43 48
10 vars.
5 vars.
3 vars.
2 vars.
7/28/2019 7 Bivariate Eda
43/61
-1.0 -0.5 0.0 0.5 1.0
r
original counts
-1.0 -0.5 0.0 0.5 1.0
r
percents (10 vars.)
-1.0 -0.5 0.0 0.5 1.0
r
percents (5 vars.)
-1.0 -0.5 0.0 0.5 1.0r
percents (3 vars.)
-1.0 -0.5 0.0 0.5 1.0
r
percents (2 vars.)
7/28/2019 7 Bivariate Eda
44/61
0 20 40 60 80 100
C_V1
0
20
40
60
80
100
C_ V
2
0 5 10 15 20
P10_V1
0
5
10
15
20
P 1 0_ V 2
0 10 20 30 40 50 60 70
T5_V1
0
10
20
30
40
T 5
_ V 2
10 20 30 40 50 60 70 80
T3_V1
0
10
20
30
40
50
60
70
T 3
_ V 2
10 20 30 40 50 60 70 80 90 100
T2_V1
0
10
20
30
40
50
60
70
80
90
T2
_V2
7/28/2019 7 Bivariate Eda
45/61
regression assumptions
both variables are measured at the interval
scale or above
variation is the same at all points along theregression line (variation is homoscedastic)
7/28/2019 7 Bivariate Eda
46/61
residuals
vertical deviations of points around the regression
for case i, residual = yi-y-hati [yi-(a+bxi)]
residuals iny should not show patterned variationeither withx ory-hat
normally distributed around the regression line
residual error should not beautocorrelated
(errors/residuals in y are independent)
7/28/2019 7 Bivariate Eda
47/61
standard error of the regression
recall: standard error of an estimate (SEE) is like
a standard deviation
can calculate an SEE for residuals associated witha regression formula
n
yyS
ii
iyyi
2
7/28/2019 7 Bivariate Eda
48/61
to the degree that the regression assumptions
hold, there is a 68% probability that true
values of y lie within 1 SEE of y-hat 95% within 2 SEE
can plot lines showing the SEE
y-hat = a+bx +/- SEE
7/28/2019 7 Bivariate Eda
49/61
7/28/2019 7 Bivariate Eda
50/61
data transformations and
regression
read Shennan, Chapter 9 (esp. pp. 151-173)
7/28/2019 7 Bivariate Eda
51/61
0 50 100 150 200
VAR1
0
50
100
150
200
V A R 2
0 50 100 150 200
VAR1
0
50
100
150
200
V A R 2
7/28/2019 7 Bivariate Eda
52/61
40 80 120 160
VAR1
0
50
100
150
200
V A R 2
7/28/2019 7 Bivariate Eda
53/61
0 5 10 15
VAR1T
0
50
100
150
200
VAR2
let VAR1T = sqr(VAR1)
7/28/2019 7 Bivariate Eda
54/61
distribution and fall-off models
ex: density of obsidian vs. distance from thequarry:
0 10 20 30 40 50 60 70 80
DIST
0
1
2
3
4
5
6
D
E N S I T Y
7/28/2019 7 Bivariate Eda
55/61
7/28/2019 7 Bivariate Eda
56/61
0 10 20 30 40 50 60 70 80DIST
0
1
2
3
4
5
6
DENSITY
Plot of Residuals against Predicted Values
-1 0 1 2 3 4ESTIMATE
-1
0
1
2
RESIDUAL
7/28/2019 7 Bivariate Eda
57/61
0 10 20 30 40 50 60 70 80
DIST
1
2
3456
DENSITY
0 10 20 30 40 50 60 70 80
DIST
-3
-2
-1
0
1
2
LG
_DENS
LG_DENS log(DENSITY)
7/28/2019 7 Bivariate Eda
58/61
0 10 20 30 40 50 60 70 80
DIST
-3
-2
-1
0
1
2
L G
_ D E N
S
y = 1.70-.05x
[remembery
is logged
density]
7/28/2019 7 Bivariate Eda
59/61
0 10 20 30 40 50 60 70 80
DISTANCE
0
1
2
3
4
5
6
D E N S I T Y
0 800
6
0 10 20 30 40 50 60 70 80
DISTANCE
0
1
2
3
4
5
6
DENSITY
logy = 1.70-.05x
fploty = exp(1.70-.05*x)
7/28/2019 7 Bivariate Eda
60/61
begin
PLOT DENSITY*DISTANCE / FILL=1,0,0
fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL=''
XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0
XMAX=80 YMIN=0 YMAX=6
end
transformation summary
7/28/2019 7 Bivariate Eda
61/61
transformation summary
correcting left skew:x4 stronger
x3 strong
x2 mild
correcting right skew:
x weak
log(x) mild
-1/x strong
-1/x2 stronger