Upload
shubham-mehta
View
117
Download
6
Tags:
Embed Size (px)
Citation preview
Presented by: Shubham Mehta
Regression and Correlation
A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn
The following data set provide information on 15 pregnant mothers who were contacted for a study :
Example
BMI (Kg/m2) Birth-weight (Kg)
20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8
Scatter diagram is a graphical method to display the relationship between two variables. Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y planeY is called the dependent variableX is called an independent variable
Scatter Diagram
Scatter diagram of BMI and Birth weight
Scatter diagrams are important for initial exploration of the relationship between two quantitative variables
In the above example, we may wish to summarize this relationship by a straight line drawn through the scatter of points
Is there a linear relationship between BMI and BW?
Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory. An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares. Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.
Simple Linear Regression
These vertical distances, i.e., the distance between y values and their corresponding estimated values on the line are called residuals
The line which fits the best is called the regression line or, sometimes, the least-squares line
The line always passes through the point defined by the mean of Y and the mean of X.
Least-squares or regression line
The method of least-squares is available in most of the statistical packages (and also on some scientific calculators) and is usually referred to as linear regressionY is also known as an outcome variable (dependent variable)X is also called as a predictor (independent variable)
Linear Regression Model
Linear regression assumes that :-
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same (homogeneity of variances)
4. The observations are independent
Assumptions
Estimated Regression Line
This equation allows you to estimate BW of other newborns when the BMI is given. e.g., for a mother who has BMI=40, i.e. X = 40 we predict BW to be
Application of Regression Line
R is a measure of strength of the linear association between two variables, x and y. Most statistical packages and some hand calculators can calculate RFor the data in our example, R=0.94R has some unique characteristics
Correlation Coefficient, R
Correlation
measures and describes the strength and direction of the relationshipbivariate techniques requires two variable scores from the same individuals (dependent and independent variables)multivariate when more than two independent variables (e.g effect of advertising and prices on sales)
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
Interpreting Covariance
Correlation coefficient
Pearson’s Correlation Coefficient is standardized covariance (unit less):
yx
yxariancer
varvar
),(cov
Measures the relative strength of the linear relationship between two variables
Unit-less
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
Correlation
The Difference
In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Remember this:
Y=mX+B?
m: slope
A slope of 2 means that every 1- unit change in X yields a 2-unit change in Y.
What is “Linear”?
B
m
If you know something about X, this knowledge helps you predict something
about Y.
(Sound familiar?…sound like conditional probabilities?)
Prediction
Regression equation…
Expected value of y at a given level of x=
yi= + *xi + random errori
Predicted value for an individual
Follows a normal distribution
Fixed – exactly on the line
Random Error is often denoted by ei
Scatter Plots of Data with Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
Xr = 0
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Linear Correlation
Y
X
Y
XNo relationship
Correlation Coefficient “r”
A measure of the strength and direction of a linear relationship between two variables
The range of r is from –1 to 1.
If r is close to 1 there is a
strong positive
correlation.
If r is close to –1 there is a strong negative correlation.
If r is close to 0 there is no
linear correlation.
–1 0 1
R takes values between -1 and +1 R=0 represents no linear relationship between the two variables R>0 implies a direct linear relationship R<0 implies an inverse linear relationshipThe closer R comes to either +1 or -1, the stronger is the linear relationship
Correlation Coefficient, R
Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of non-linear relationship
When the sample size, n, is small we also have to be careful with the reliability of the correlation Outliers could have a marked effect on R
Limitations of the correlation coefficient
Introduction
Spearman's rank correlation coefficient or Spearman's rho is named after Charles Spearman
Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical dependence between two variables)
Assesses how well the relationship between two variables can be described using a monotonic function
Monotonic is a function (or monotone function) in mathematic that preserves the given order.
If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other
Spearman Rho Correlation
Spearman Rho Correlation
A correlation coefficient is a numerical measure or index of the amount of association between two sets of scores. It ranges in size from a maximum of +1.00 through 0.00 to -1.00
The ‘+ ’ sign indicates a positive correlation (the scores on one variable increase as the scores on the other variable increase)
The ‘- ’ sign indicates a negative correlation (the scores on one variable increase, the scores on the other variable decrease)
Spearman Rho Correlation
Calculation
Often thought of as being the Pearson correlation coefficient between the ranked (relationship between two item) variables
The n raw scores Xi, Yi are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated
If there are no tied ranks, then ρ is given by this formula:
Interpretation
The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable)
If Y tends to increase when X increases, the Spearman correlation coefficient is positive
If Y tends to decrease when X increases, the Spearman correlation coefficient is negative
A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases
Spearman Rho Correlation
Interpretation cont…/
Alternative name for the Spearman rank correlation is the "grade correlation” the "rank" of an observation is replaced by the "grade"
When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1
A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj and Yi − Yj always have the same sign
Spearman Rho Correlation
Spearman Rho Correlation
Example 1Calculate the correlation between the IQ of a person with the number of hours spent in the class per week
Find the value of the term d²i:1. Sort the data by the first column (Xi). Create a new column xi and assign it the ranked values 1,2,3,...n. 2. Sort the data by the second column (Yi). Create a fourth column yi and similarly assign it the ranked values 1,2,3,...n.
3. Create a fifth column di to hold the differences between the two rank columns (xi and yi).
IQ, Xi Hours of class per week, Yi
106 7
86 0
100 27
101 50
99 28
103 29
97 20
113 12
112 6
110 17
Spearman Rho Correlation
Example # 1 cont…/4. Create one final column to hold the
value of column di squared.
IQ
(Xi )
Hours of class per week
(Yi)
rank xi rank yi di d²i
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36
Example # 1- Result
With d²i found, we can add them to find d²i = 194
The value of n is 10, so;
ρ = 1- 6 x 194 10(10² - 1)
ρ = −0.18
The low value shows that the correlation between IQ and hours spent in the class is very low
Spearman Rho Correlation
Outliers.....
Outliers are dangerous
Here we have a spurious correlation of r=0.68
without IBM, r=0.48
without IBM & GE, r=0.21
r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).
The sampling distribution for r is a t-distribution with n – 2 d.f.
Standardized teststatistic
For a two tail test for significance:
Hypothesis Test for Significance
(The correlation is not significant)
(The correlation is significant)
A t-distribution with 5 degrees of freedom
Test of Significance The correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data. Test the significance of this correlation. Use = 0.01. 1. Write the null and alternative hypothesis.
2. State the level of significance.
3. Identify the sampling distribution.
(The correlation is not significant)
(The correlation is significant)
= 0.01
t0 4.032–4.032
Rejection Regions
Critical Values ± t0
4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
df\p
0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
1 0.324920
1.000000
3.077684
6.313752
12.70620
31.82052
63.65674
636.6192
2 0.288675
0.816497
1.885618
2.919986
4.30265
6.96456
9.92484
31.5991
3 0.276671
0.764892
1.637744
2.353363
3.18245
4.54070
5.84091
12.9240
4 0.270722
0.740697
1.533206
2.131847
2.77645
3.74695
4.60409
8.6103
5 0.267181
0.726687
1.475884
2.015048
2.57058
3.36493
4.03214
6.8688
t
0–4.032 +4.032
t = –9.811 falls in the rejection region. Reject the null hypothesis.
There is a significant negative correlation between the number of times absent and final grades.
7. Make your decision.
8. Interpret your decision.
The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept.The line of regression is:
The slope m is:
The y-intercept is:
Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable YOnce you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.
The Line of Regression
180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0Ad $
= a residual
(xi,yi) = a data pointre
ven
ue
= a point on the line with the same x-value
Best fitting straight line
Calculating manually
Simpler calculation formula…
yx
xy
SSSS
SSr ˆ
Numerator of covariance
Numerators of variance
*Note - like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r
Distribution of the correlation coefficient:
2
1)ˆ(
2
n
rrSE
The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).
R2 is another important measure of linear association between x and y (0 < R2 < 1)
R2 measures the proportion of the total variation in y which is explained by x
For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).
Coefficient of Determination
The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506. Interpretation: About 95% of the variation in final
grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.
Strength of the Association
The coefficient of determination, r2, measures the strength of the association and is the ratio of explained variation in y to the total variation in y.
The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Correlation Coefficient, R, measures the strength of bivariate association
The regression line is a prediction equation that estimates the values of y for any given x
Difference between Correlation and Regression
Thank You