52
Presented by: Shubham Mehta Regression and Correlation

Correlation and Regression

Embed Size (px)

Citation preview

Page 1: Correlation and Regression

Presented by: Shubham Mehta

Regression and Correlation

Page 2: Correlation and Regression

A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn

The following data set provide information on 15 pregnant mothers who were contacted for a study :

Example

Page 3: Correlation and Regression

BMI (Kg/m2) Birth-weight (Kg)

20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8

Page 4: Correlation and Regression

Scatter diagram is a graphical method to display the relationship between two variables. Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y planeY is called the dependent variableX is called an independent variable

Scatter Diagram

Page 5: Correlation and Regression

Scatter diagram of BMI and Birth weight

Page 6: Correlation and Regression

Scatter diagrams are important for initial exploration of the relationship between two quantitative variables

In the above example, we may wish to summarize this relationship by a straight line drawn through the scatter of points

Is there a linear relationship between BMI and BW?

Page 7: Correlation and Regression

Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory. An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares. Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.

Simple Linear Regression

Page 8: Correlation and Regression

These vertical distances, i.e., the distance between y values and their corresponding estimated values on the line are called residuals

The line which fits the best is called the regression line or, sometimes, the least-squares line

The line always passes through the point defined by the mean of Y and the mean of X.

Least-squares or regression line

Page 9: Correlation and Regression

The method of least-squares is available in most of the statistical packages (and also on some scientific calculators) and is usually referred to as linear regressionY is also known as an outcome variable (dependent variable)X is also called as a predictor (independent variable)

Linear Regression Model

Page 10: Correlation and Regression

Linear regression assumes that :-

1. The relationship between X and Y is linear

2. Y is distributed normally at each value of X

3. The variance of Y at every value of X is the same (homogeneity of variances)

4. The observations are independent

Assumptions

Page 11: Correlation and Regression

Estimated Regression Line

Page 12: Correlation and Regression

This equation allows you to estimate BW of other newborns when the BMI is given. e.g., for a mother who has BMI=40, i.e. X = 40 we predict BW to be

Application of Regression Line

Page 13: Correlation and Regression

R is a measure of strength of the linear association between two variables, x and y. Most statistical packages and some hand calculators can calculate RFor the data in our example, R=0.94R has some unique characteristics

Correlation Coefficient, R

Page 14: Correlation and Regression

Correlation

measures and describes the strength and direction of the relationshipbivariate techniques requires two variable scores from the same individuals (dependent and independent variables)multivariate when more than two independent variables (e.g effect of advertising and prices on sales)

Page 15: Correlation and Regression

cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

Interpreting Covariance

Page 16: Correlation and Regression

Correlation coefficient

Pearson’s Correlation Coefficient is standardized covariance (unit less):

yx

yxariancer

varvar

),(cov

Page 17: Correlation and Regression

Measures the relative strength of the linear relationship between two variables

Unit-less

Ranges between –1 and 1

The closer to –1, the stronger the negative linear relationship

The closer to 1, the stronger the positive linear relationship

The closer to 0, the weaker any positive linear relationship

Correlation

Page 18: Correlation and Regression

The Difference

In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

Page 19: Correlation and Regression

Remember this:

Y=mX+B?

m: slope

A slope of 2 means that every 1- unit change in X yields a 2-unit change in Y.

What is “Linear”?

B

m

Page 20: Correlation and Regression

If you know something about X, this knowledge helps you predict something

about Y.

(Sound familiar?…sound like conditional probabilities?)

Prediction

Page 21: Correlation and Regression

Regression equation…

Expected value of y at a given level of x=

Page 22: Correlation and Regression

yi= + *xi + random errori

Predicted value for an individual

Follows a normal distribution

Fixed – exactly on the line

Random Error is often denoted by ei

Page 23: Correlation and Regression

Scatter Plots of Data with Various Correlation Coefficients

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1 r = -.6 r = 0

r = +.3r = +1

Y

Xr = 0

Page 24: Correlation and Regression

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Linear Correlation

Page 25: Correlation and Regression

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

Linear Correlation

Page 26: Correlation and Regression

Linear Correlation

Y

X

Y

XNo relationship

Page 27: Correlation and Regression

Correlation Coefficient “r”

A measure of the strength and direction of a linear relationship between two variables

The range of r is from –1 to 1.

If r is close to 1 there is a

strong positive

correlation.

If r is close to –1 there is a strong negative correlation.

If r is close to 0 there is no

linear correlation.

–1 0 1

Page 28: Correlation and Regression

R takes values between -1 and +1 R=0 represents no linear relationship between the two variables R>0 implies a direct linear relationship R<0 implies an inverse linear relationshipThe closer R comes to either +1 or -1, the stronger is the linear relationship

 Correlation Coefficient, R

Page 29: Correlation and Regression

Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of non-linear relationship 

When the sample size, n, is small we also have to be careful with the reliability of the correlation Outliers could have a marked effect on R

Limitations of the correlation coefficient

Page 30: Correlation and Regression

Introduction

Spearman's rank correlation coefficient or Spearman's rho is named after Charles Spearman

Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical dependence between two variables)

Assesses how well the relationship between two variables can be described using a monotonic function

Monotonic is a function (or monotone function) in mathematic that preserves the given order.

If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other

Spearman Rho Correlation

Page 31: Correlation and Regression

Spearman Rho Correlation

A correlation coefficient is a numerical measure or index of the amount of association between two sets of scores. It ranges in size from a maximum of +1.00 through 0.00 to -1.00

The ‘+ ’ sign indicates a positive correlation (the scores on one variable increase as the scores on the other variable increase)

The ‘- ’ sign indicates a negative correlation (the scores on one variable increase, the scores on the other variable decrease)

Page 32: Correlation and Regression

Spearman Rho Correlation

Calculation

Often thought of as being the Pearson correlation coefficient between the ranked (relationship between two item) variables

The n raw scores Xi, Yi are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated

If there are no tied ranks, then ρ is given by this formula:

Page 33: Correlation and Regression

Interpretation

The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable)

If Y tends to increase when X increases, the Spearman correlation coefficient is positive

If Y tends to decrease when X increases, the Spearman correlation coefficient is negative

A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases

Spearman Rho Correlation

Page 34: Correlation and Regression

Interpretation cont…/

Alternative name for the Spearman rank correlation is the "grade correlation” the "rank" of an observation is replaced by the "grade"

When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1

A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj and Yi − Yj always have the same sign

Spearman Rho Correlation

Page 35: Correlation and Regression

Spearman Rho Correlation

Example 1Calculate the correlation between the IQ of a person with the number of hours spent in the class per week

Find the value of the term d²i:1. Sort the data by the first column (Xi). Create a new column xi and assign it the ranked values 1,2,3,...n. 2. Sort the data by the second column (Yi). Create a fourth column yi and similarly assign it the ranked values 1,2,3,...n.

3. Create a fifth column di to hold the differences between the two rank columns (xi and yi).

IQ, Xi Hours of class per week, Yi

106 7

86 0

100 27

101 50

99 28

103 29

97 20

113 12

112 6

110 17

Page 36: Correlation and Regression

Spearman Rho Correlation

Example # 1 cont…/4. Create one final column to hold the

value of column di squared.

IQ

(Xi )

Hours of class per week

(Yi)

rank xi rank yi di d²i

86 0 1 1 0 0

97 20 2 6 -4 16

99 28 3 8 -5 25

100 27 4 7 -3 9

101 50 5 10 -5 25

103 29 6 9 -3 9

106 7 7 3 4 16

110 17 8 5 3 9

112 6 9 2 7 49

113 12 10 4 6 36

Page 37: Correlation and Regression

Example # 1- Result

With d²i found, we can add them to find d²i = 194

The value of n is 10, so;

ρ = 1- 6 x 194 10(10² - 1)

ρ =  −0.18

The low value shows that the correlation between IQ and hours spent in the class is very low

Spearman Rho Correlation

Page 38: Correlation and Regression

Outliers.....

Outliers are dangerous

Here we have a spurious correlation of r=0.68

without IBM, r=0.48

without IBM & GE, r=0.21

Page 39: Correlation and Regression

r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).

The sampling distribution for r is a t-distribution with n – 2 d.f.

Standardized teststatistic

For a two tail test for significance:

Hypothesis Test for Significance

(The correlation is not significant)

(The correlation is significant)

Page 40: Correlation and Regression

A t-distribution with 5 degrees of freedom

Test of Significance The correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data. Test the significance of this correlation. Use = 0.01. 1. Write the null and alternative hypothesis.

2. State the level of significance.

3. Identify the sampling distribution.

(The correlation is not significant)

(The correlation is significant)

= 0.01

Page 41: Correlation and Regression

t0 4.032–4.032

Rejection Regions

Critical Values ± t0

4. Find the critical value.

5. Find the rejection region.

6. Find the test statistic.

df\p

0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005

1 0.324920

1.000000

3.077684

6.313752

12.70620

31.82052

63.65674

636.6192

2 0.288675

0.816497

1.885618

2.919986

4.30265

6.96456

9.92484

31.5991

3 0.276671

0.764892

1.637744

2.353363

3.18245

4.54070

5.84091

12.9240

4 0.270722

0.740697

1.533206

2.131847

2.77645

3.74695

4.60409

8.6103

5 0.267181

0.726687

1.475884

2.015048

2.57058

3.36493

4.03214

6.8688

Page 42: Correlation and Regression

t

0–4.032 +4.032

t = –9.811 falls in the rejection region. Reject the null hypothesis.

There is a significant negative correlation between the number of times absent and final grades.

7. Make your decision.

8. Interpret your decision.

Page 43: Correlation and Regression

The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept.The line of regression is:

The slope m is:

The y-intercept is:

Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable YOnce you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.

The Line of Regression

Page 44: Correlation and Regression

180

190

200

210

220

230

240

250

260

1.5 2.0 2.5 3.0Ad $

= a residual

(xi,yi) = a data pointre

ven

ue

= a point on the line with the same x-value

Best fitting straight line

Page 45: Correlation and Regression

Calculating manually

Page 46: Correlation and Regression

Simpler calculation formula…

yx

xy

SSSS

SSr ˆ

Numerator of covariance

Numerators of variance

Page 47: Correlation and Regression

*Note - like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r

Distribution of the correlation coefficient:

2

1)ˆ(

2

n

rrSE

The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).

Page 48: Correlation and Regression

R2 is another important measure of linear association between x and y (0 < R2 < 1)

R2 measures the proportion of the total variation in y which is explained by x

For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).

Coefficient of Determination

Page 49: Correlation and Regression

The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506. Interpretation: About 95% of the variation in final

grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.

Strength of the Association

The coefficient of determination, r2, measures the strength of the association and is the ratio of explained variation in y to the total variation in y.

Page 50: Correlation and Regression

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Page 51: Correlation and Regression

Correlation Coefficient, R, measures the strength of bivariate association

The regression line is a prediction equation that estimates the values of y for any given x

Difference between Correlation and Regression

Page 52: Correlation and Regression

Thank You