View
2.210
Download
1
Category
Preview:
DESCRIPTION
Citation preview
Medical Statistics Medical Statistics (full English class)(full English class)
Shaoqi Rao, PhD
School of Public Health
Sun Yat-Sen University
Slides adapted from Dr. Ji-Qian Fang’s
Chapter 8Chapter 8Linear RegressionLinear Regression
How does the value of one variable How does the value of one variable depend on that of another one?depend on that of another one?How does the son’s height depend on the father’s
height?How does the death rate of animal depend on the
drug dosage?How does the infant weight depend on the month’
s age?How does the body surface area depend on the hei
ght?
---- To explore linear dependence quantitatively between two continuous variables.
8.1.1 Linear regression equation Initial meaning of “regression”: Galdon noted that if the father is tall, his son will be relatively tall; if the father is short, his son will be relatively short. But, if the father is very tall, his son will not talle
r than his father usually; if the father is very short, his son will not shorter than his father usually.
Otherwise, ……?!Galdon called this phenomenon “regression to th
e mean”
8.1 Statistical Description of Linear Regression
Independent variable (explanatory variable), X
randomly changing
or fixed by the researcher
Dependent variable (response variable), Y
randomly following a linear equation
What is regression in statistics?What is regression in statistics?
To find out the track of the means
100
120
140
160
180
200
220
100 120 140 160 180 200 220
Father’s height ( cm)
Son’s height (cm)
Given the value of X, Y varies around a center (y|x)
All the centers locate on a line -- regression line.
The relationship between the center y|x and X is described by a linear equation
|y x X
Linear regression
Try to estimate and , getting
Where
a -- estimate of , intercept
b -- estimate of , slope
-- estimate of y|x
bXaY ˆ
Y
|y x X
8.1.2 Regression coefficient and its calculation
To find a straight line to best fit the points.
Residual:
Fitness of the regression line:
Principle of least squares: To find a straight line that minimizes the sum of squared residuals.
Under such a principle, it is easy to get the formulas for and by calculus:
(8.3)
(8.4)
Such a line must go through the point of , and cross the vertical axis at ---- Why?
yy ˆ
2)ˆ( yy
2)(
))((
xx
yyxx
l
lb
i
ii
xx
xy
xbya ),( yx
a
Example 8.1 Calculate the regression equation Example 8.1 Calculate the regression equation of the height of son of the height of son YY on the height of father on the height of father XX . .
No. 1 2 3 4 5 6 7 8 9 10
Father’s height, X 150 153 155 158 161 164 165 167 168 169 Son’s height, Y 159 157 163 166 169 170 169 167 169 170
No. 11 12 13 14 15 16 17 18 19 20
Father’s height, X 170 171 172 174 175 177 178 181 183 185 Son’s height, Y 173 170 170 176 178 174 173 178 176 180
8.168x 35.170y 2.1859xxl 4.1059xyl
5698.0
2.1859
4.1059 xx
xy
l
lb 17.74)8.168)(5698.0(35.170 a
XY 5698.017.74ˆ
8.2.1.1 The t-test for regression coefficient
b is the sample regression coefficient, changing from sample to sample
There is a population regression coefficient, denoted by
Question : Whether =0 or not?
H0: =0, H1: ≠0α=0.05
8.2 Statistical Inference on Regression 8.2.1 Hypothesis tests
2
)ˆ( 2
n
YYs
20
ns
bt
b
Statistic
Standard deviation of regression coefficient
Standard deviation of residual
2)( XX
ssb
For Example 8.1For Example 8.1
05326.0
2.1859
2964.2
xx
bl
ss
68.1005326.0
5698.00
bb s
bt 18220
2964.218
92.94
2
)ˆ( 2
n
yys ii
p <0.001.
0H
Reject ---- the regression of the son’s height on the father’s height is statistically significant.
: =0, : ≠0
0H
0H
1H 0H
8.2.1.2 Analysis of variance : The contribution of the linear regression is 0
: The contribution of the linear regression is not 0
(1) Before regression, we can only use to estimate
(2) After regression, we can use to estimate
(3) The regression makes the sum of squared deviations decline
(4) To test The contribution of regression is 0, F-statistic is used
0H
1H
y xy| Y xy|
sidualTotalgression SSSSSS ReRe 1ReRe sidualTotalgression
For Example 8.1For Example 8.1Source SS DF MS F P
Regression 603.63 1 603.63 114.54 < 0.01
Residual 94.92 18 5.27
Total 698.55 19
Conclusion: the regression of the son’s height on the father’s height is statistically significant.
The slight difference between these two approaches :• t test could be used for both of one-side and two-side problems;• ANOVA for two-side only. However, the idea of ANOVA can easily be extended to the cases of nonlinear regression and multiple regression.
8.2.2 Determination coefficient
For Example 8.1For Example 8.1
63.603Re gressionSS 55.698TotalSS
8641.0
55.698
63.603Re Total
gression
SS
SS 8641.09296.0 22 r
Determination coefficient: Contribution of regression by %
Total
gression
SS
SSR Re2 10 2 R
•It reflects that the percentage of the total sum of squared deviations can be explained by the regression.• If both of X and Y are random variables,
tcoefficien ncorrelatio of square2 R
In practice, it is suggested to report the value of In practice, it is suggested to report the value of determination coefficient after an analysis of determination coefficient after an analysis of regression to describe how good the regression regression to describe how good the regression is. is. Here is a story:
: An index of liver function: A score for psychological status
Regression is statistically significant, Claimed: “the index for liver function can be improved
by psychological consultation” It is wrong?
Why?
X Y
2.0r
01.0b
8.3 The Application of Linear Regression
8.3.1 Two interval estimations
8.3.1.1 Confidence interval for
8.3.1.2 Prediction interval for Y
xy|
2
20
,0 )(
)(1ˆxx
xx
nstY
i
2
20
,0 )(
)(11ˆ
xx
xx
nstY
i
8.3.3 On the basic assumptions 8.3.3 On the basic assumptions ---- ---- LINE LINE
(1) Linear : There exists a linear tendency between the dependent variable and the independent variable
(2) Independent : The individual observations are independent each other
(3) Normal : Given the value of, the corresponding follows a normal distribution
(4) Equal variances : The variances of for different values of are all equal, denoted with .
In practice, one may use scatter diagram to observe whether the basic assumptions are met.
The assumption of linearity is essential that using a linear model to describe a curvilinear relationship is obviously inappropriate;
The assumption of independency is also essential; The violation to the assumptions of normal
distribution and equal variance might not seriously affect the least square estimates though all the introduced formulas for statistical inference might not valid.
Once the assumptions (1), (3) and (4) are violated, some transformations are worthwhile to try.
SummarySummary Regression and Correlation Regression and Correlation
1. Distinguish and connection Distinguish: Correlation: Both X and Y are random Regression: Y must be random X could be random or not random
Connection: When both X and Y are random
1) Same sign for correlation coefficient
and regression coefficient
2) t tests are equivalent
tr = tb
3) Determination Coefficient
Total
Regressiont Coefficien ionDeterminatSS
SS
2tCoefficien ionDeterminat r
2. Caution --
for regression and correlation
1) Don’t put any two variables together for correlation and regression – They must have some relation in subject matter;
2) Correlation and regression do not necessary mean causality
---- sometimes may be indirect relation or even no any real relation;
3) A big value of r does not necessary mean a big regression coefficient b;
4) To reject does not necessary mean that the correlation is strong, only but ;
5) A regression equation is statistically significant does not necessary mean that one can well predict Y by X, only but ; well predict or not depends on coefficient of determination;
6) Scatter diagram is useful before working with linear correlation and linear regression;
7) The regression equation is not allowed to be applied beyond the range of the data set.
0:0 H0
0
Recommended