Correlation and Regression

Presented by: Shubham Mehta

Regression and Correlation

A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn

The following data set provide information on 15 pregnant mothers who were contacted for a study :

Example

BMI (Kg/m2) Birth-weight (Kg)

20 2.730 2.950 3.445 3.010 2.230 3.140 3.325 2.350 3.520 2.510 1.555 3.860 3.750 3.135 2.8

Scatter diagram is a graphical method to display the relationship between two variables. Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y planeY is called the dependent variableX is called an independent variable

Scatter Diagram

Scatter diagram of BMI and Birth weight

Scatter diagrams are important for initial exploration of the relationship between two quantitative variables

In the above example, we may wish to summarize this relationship by a straight line drawn through the scatter of points

Is there a linear relationship between BMI and BW?

Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory. An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares. Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized.

Simple Linear Regression

These vertical distances, i.e., the distance between y values and their corresponding estimated values on the line are called residuals

The line which fits the best is called the regression line or, sometimes, the least-squares line

The line always passes through the point defined by the mean of Y and the mean of X.

Least-squares or regression line

The method of least-squares is available in most of the statistical packages (and also on some scientific calculators) and is usually referred to as linear regressionY is also known as an outcome variable (dependent variable)X is also called as a predictor (independent variable)

Linear Regression Model

Linear regression assumes that :-

1. The relationship between X and Y is linear

2. Y is distributed normally at each value of X

3. The variance of Y at every value of X is the same (homogeneity of variances)

4. The observations are independent

Assumptions

Estimated Regression Line

This equation allows you to estimate BW of other newborns when the BMI is given. e.g., for a mother who has BMI=40, i.e. X = 40 we predict BW to be

Application of Regression Line

R is a measure of strength of the linear association between two variables, x and y. Most statistical packages and some hand calculators can calculate RFor the data in our example, R=0.94R has some unique characteristics

Correlation Coefficient, R

Correlation

measures and describes the strength and direction of the relationshipbivariate techniques requires two variable scores from the same individuals (dependent and independent variables)multivariate when more than two independent variables (e.g effect of advertising and prices on sales)

cov(X,Y) > 0 X and Y are positively correlated

cov(X,Y) < 0 X and Y are inversely correlated

cov(X,Y) = 0 X and Y are independent

Interpreting Covariance

Correlation coefficient

Pearson’s Correlation Coefficient is standardized covariance (unit less):

yx

yxariancer

varvar

),(cov

Measures the relative strength of the linear relationship between two variables

Unit-less

Ranges between –1 and 1

The closer to –1, the stronger the negative linear relationship

The closer to 1, the stronger the positive linear relationship

The closer to 0, the weaker any positive linear relationship

Correlation

The Difference

In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

Remember this:

Y=mX+B?

m: slope

A slope of 2 means that every 1- unit change in X yields a 2-unit change in Y.

What is “Linear”?

B

m

If you know something about X, this knowledge helps you predict something

about Y.

(Sound familiar?…sound like conditional probabilities?)

Prediction

Regression equation…

Expected value of y at a given level of x=

yi= + *xi + random errori

Predicted value for an individual

Follows a normal distribution

Fixed – exactly on the line

Random Error is often denoted by ei

Scatter Plots of Data with Various Correlation Coefficients

Y

X

Y

X

Y

X

Y

X

Y

X

r = -1 r = -.6 r = 0

r = +.3r = +1

Y

Xr = 0

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Linear Correlation

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

Linear Correlation

Linear Correlation

Y

X

Y

XNo relationship

Correlation Coefficient “r”

A measure of the strength and direction of a linear relationship between two variables

The range of r is from –1 to 1.

If r is close to 1 there is a

strong positive

correlation.

If r is close to –1 there is a strong negative correlation.

If r is close to 0 there is no

linear correlation.

–1 0 1

R takes values between -1 and +1 R=0 represents no linear relationship between the two variables R>0 implies a direct linear relationship R<0 implies an inverse linear relationshipThe closer R comes to either +1 or -1, the stronger is the linear relationship

Correlation Coefficient, R

Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of non-linear relationship

When the sample size, n, is small we also have to be careful with the reliability of the correlation Outliers could have a marked effect on R

Limitations of the correlation coefficient

Introduction

Spearman's rank correlation coefficient or Spearman's rho is named after Charles Spearman

Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical dependence between two variables)

Assesses how well the relationship between two variables can be described using a monotonic function

Monotonic is a function (or monotone function) in mathematic that preserves the given order.

If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other

Spearman Rho Correlation

http://en.wikipedia.org/wiki/Rho_(letter)


A correlation coefficient is a numerical measure or index of the amount of association between two sets of scores. It ranges in size from a maximum of +1.00 through 0.00 to -1.00

The ‘+ ’ sign indicates a positive correlation (the scores on one variable increase as the scores on the other variable increase)

The ‘- ’ sign indicates a negative correlation (the scores on one variable increase, the scores on the other variable decrease)


Calculation

Often thought of as being the Pearson correlation coefficient between the ranked (relationship between two item) variables

The n raw scores Xi, Yi are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated

If there are no tied ranks, then ρ is given by this formula:

Interpretation

The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable)

If Y tends to increase when X increases, the Spearman correlation coefficient is positive

If Y tends to decrease when X increases, the Spearman correlation coefficient is negative

A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases


Interpretation cont…/

Alternative name for the Spearman rank correlation is the "grade correlation” the "rank" of an observation is replaced by the "grade"

When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1

A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj and Yi − Yj always have the same sign



Example 1Calculate the correlation between the IQ of a person with the number of hours spent in the class per week

Find the value of the term d²i:1. Sort the data by the first column (Xi). Create a new column xi and assign it the ranked values 1,2,3,...n. 2. Sort the data by the second column (Yi). Create a fourth column yi and similarly assign it the ranked values 1,2,3,...n.

3. Create a fifth column di to hold the differences between the two rank columns (xi and yi).

IQ, Xi Hours of class per week, Yi

106 7

86 0

100 27

101 50

99 28

103 29

97 20

113 12

112 6

110 17


Example # 1 cont…/4. Create one final column to hold the

value of column di squared.

IQ

(Xi )

Hours of class per week

(Yi)

rank xi rank yi di d²i

86 0 1 1 0 0

97 20 2 6 -4 16

99 28 3 8 -5 25

100 27 4 7 -3 9

101 50 5 10 -5 25

103 29 6 9 -3 9

106 7 7 3 4 16

110 17 8 5 3 9

112 6 9 2 7 49

113 12 10 4 6 36

Example # 1- Result

With d²i found, we can add them to find d²i = 194

The value of n is 10, so;

ρ = 1- 6 x 194 10(10² - 1)

ρ = −0.18

The low value shows that the correlation between IQ and hours spent in the class is very low


Outliers.....

Outliers are dangerous

Here we have a spurious correlation of r=0.68

without IBM, r=0.48

without IBM & GE, r=0.21

r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).

The sampling distribution for r is a t-distribution with n – 2 d.f.

Standardized teststatistic

For a two tail test for significance:

Hypothesis Test for Significance

(The correlation is not significant)

(The correlation is significant)

A t-distribution with 5 degrees of freedom

Test of Significance The correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data. Test the significance of this correlation. Use = 0.01. 1. Write the null and alternative hypothesis.

2. State the level of significance.

3. Identify the sampling distribution.

(The correlation is not significant)

(The correlation is significant)

= 0.01

t0 4.032–4.032

Rejection Regions

Critical Values ± t0

4. Find the critical value.

5. Find the rejection region.

6. Find the test statistic.

df\p

0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005

1 0.324920

1.000000

3.077684

6.313752

12.70620

31.82052

63.65674

636.6192

2 0.288675

0.816497

1.885618

2.919986

4.30265

6.96456

9.92484

31.5991

3 0.276671

0.764892

1.637744

2.353363

3.18245

4.54070

5.84091

12.9240

4 0.270722

0.740697

1.533206

2.131847

2.77645

3.74695

4.60409

8.6103

5 0.267181

0.726687

1.475884

2.015048

2.57058

3.36493

4.03214

6.8688

t

0–4.032 +4.032

t = –9.811 falls in the rejection region. Reject the null hypothesis.

There is a significant negative correlation between the number of times absent and final grades.

7. Make your decision.

8. Interpret your decision.

The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept.The line of regression is:

The slope m is:

The y-intercept is:

Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable YOnce you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line.

The Line of Regression

180

190

200

210

220

230

240

250

260

1.5 2.0 2.5 3.0Ad $

= a residual

(xi,yi) = a data pointre

ven

ue

= a point on the line with the same x-value

Best fitting straight line

Calculating manually

Simpler calculation formula…

yx

xy

SSSS

SSr ˆ

Numerator of covariance

Numerators of variance

*Note - like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r

Distribution of the correlation coefficient:

2

1)ˆ(

2

n

rrSE

The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).

R2 is another important measure of linear association between x and y (0 < R2 < 1)

R2 measures the proportion of the total variation in y which is explained by x

For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).

Coefficient of Determination

The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506. Interpretation: About 95% of the variation in final

grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.

Strength of the Association

The coefficient of determination, r2, measures the strength of the association and is the ratio of explained variation in y to the total variation in y.

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Correlation Coefficient, R, measures the strength of bivariate association

The regression line is a prediction equation that estimates the values of y for any given x

Difference between Correlation and Regression

Thank You

Documents

Correlation and Regression