30
BC COMS 1016: Intro to Comp Thinking & Data Science Copyright © 2016 Barnard College Lecture 19 – Correlation & Regression

Lecture 19 – Correlation & Regression

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 19 – Correlation & Regression

BC COMS 1016:Intro to Comp Thinking & Data Science

Copyright © 2016 Barnard College

Lecture 19 –Correlation & Regression

Page 2: Lecture 19 – Correlation & Regression

Announcements

§ Lab07 – Normal Distribution and Variance of Sample Means (short)• Due Wednesday 11/23

§ Homework 7 - Confidence Intervals, Resampling, the Bootstrap, and the Central Limit Theorem• Due Thursday 11/24• Not the shortest

§ Homeworks:• Run all cells before submitting

§ Dropping 2 homeworks and labs

Copyright © 2016 Barnard College 2

Page 3: Lecture 19 – Correlation & Regression

Correlation

Copyright © 2016 Barnard College

Page 4: Lecture 19 – Correlation & Regression

Prediction

§ To predict the value of a variable:• Identify (measurable) attributes that are associated

with that variable• Describe the relation between the attributes and the

variable you want to predict• Then, use the relation to predict the value of a

variable

Copyright © 2016 Barnard College 7

Page 5: Lecture 19 – Correlation & Regression

Visualizing Two Numerical Variables

§ Trend • Positive association • Negative association

§ Pattern • Any discernible “shape” in the scatter • Linear • Non-linear

Visualize, then quantify

Copyright © 2016 Barnard College 8

Page 6: Lecture 19 – Correlation & Regression

The Correlation Coefficient r

§ Measures linear association § Based on standard units § -1 ≤ r ≤ 1

• r = 1: scatter is perfect straight line sloping up • r = -1: scatter is perfect straight line sloping down

§ r = 0: No linear association; uncorrelated

Copyright © 2016 Barnard College 9

Page 7: Lecture 19 – Correlation & Regression

Definition of r

Correlation Coefficient (r) =

average of product of standard(x) and standard(y)

Measures how clustered the scattered data are around a straight line

Copyright © 2016 Barnard College 10

Steps: 1234

Page 8: Lecture 19 – Correlation & Regression

Operations that leave r unchanged

R is not affected by:

§ Changing the units of the measurement of the data• Because r is based on standard units

§ Which variable is plotted on the x- and y-axes• Because the product of standard units is the same

Copyright © 2016 Barnard College 11

Page 9: Lecture 19 – Correlation & Regression

Interpreting r

Copyright © 2016 Barnard College

Page 10: Lecture 19 – Correlation & Regression

Causal Conclusion

Be careful …

§ Correlation measures linear association§ Association doesn’t imply causation§ Two variables might be correlated, but that

doesn’t mean one causes the other

Copyright © 2016 Barnard College 13

Page 11: Lecture 19 – Correlation & Regression

Nonlinearity and Outliers

Both can affect correlation

§ Draw a scatter plot before computing r

Copyright © 2016 Barnard College 14

Page 12: Lecture 19 – Correlation & Regression

Ecological Correlation

§ Correlations based on groups or aggregated data

§ Can be misleading:• For example, they can be artificially high

Copyright © 2016 Barnard College 15

Page 13: Lecture 19 – Correlation & Regression

Prediction

Copyright © 2016 Barnard College

Page 14: Lecture 19 – Correlation & Regression

Guess the future

§ Based on incomplete information

§ One way of making predictions: • To predict an outcome for an individual, • find others who are like that individual • and whose outcomes you know. • Use those outcomes as the basis of your prediction.

Copyright © 2016 Barnard College 17

Page 15: Lecture 19 – Correlation & Regression

Galton’s Heights

Goal: Predict the height of a new child, based on that child’s midparentheight

Copyright © 2016 Barnard College 18

Page 16: Lecture 19 – Correlation & Regression

Galton’s Heights

How can we predict a child’s height given a midparent height of 68 inches?

Idea: Use the average height of the children of all families where the midparent Height is close to to 68 inches

Copyright © 2016 Barnard College 19

Page 17: Lecture 19 – Correlation & Regression

Galton’s Heights

Copyright © 2016 Barnard College 20

How can we predict a child’s height given a midparent height of 68 inches?

Idea: Use the average height of the children of all families where the midparent Height is close to to 68 inches

Page 18: Lecture 19 – Correlation & Regression

Predicted Heights

Copyright © 2016 Barnard College 21

Page 19: Lecture 19 – Correlation & Regression

Graph of Average

For each x value, the prediction is the average of the y values in its nearby group.

The graph of these predictions is the graph of averages

If the association between x and y is linear, then points in the graph of averages tend to fall on a line. The line is called the regression line

Copyright © 2016 Barnard College 22

Page 20: Lecture 19 – Correlation & Regression

Nearest Neighbor Regression

A method for predicting a numerical y, given a value of x:

§ Identify the group of points where the values of x are close to the given value

§ The prediction is the average of the y values for the group

Copyright © 2016 Barnard College 23

Page 21: Lecture 19 – Correlation & Regression

Linear Regression

Copyright © 2016 Barnard College

Page 22: Lecture 19 – Correlation & Regression

Where is the prediction line?

r = 0.99Copyright © 2016 Barnard College 25

Page 23: Lecture 19 – Correlation & Regression

Where is the prediction line?

r = 0.0Copyright © 2016 Barnard College 26

Page 24: Lecture 19 – Correlation & Regression

Where is the prediction line?

r = 0.5Copyright © 2016 Barnard College 27

Page 25: Lecture 19 – Correlation & Regression

Identifying the Line

§ If the scatter plot is oval shaped, then we can spot an important feature of the regression line

Copyright © 2016 Barnard College 28

Page 26: Lecture 19 – Correlation & Regression

Linear Regression

A statement about x and y pairs • Measured in standard units • Describing the deviation of x from 0 (the average of

x's) • And the deviation of y from 0 (the average of y’s)

On average, y deviates from 0 less than x deviates from 0

𝑦!" = 𝑟 × 𝑥!"

Copyright © 2016 Barnard College 29

Page 27: Lecture 19 – Correlation & Regression

Slope and Intercept

Copyright © 2016 Barnard College

Page 28: Lecture 19 – Correlation & Regression

Regression Line Equation

In original units, the regression line has this equation:

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑦 −𝑚𝑒𝑎𝑛(𝑦)𝑆𝐷 𝑜𝑓 𝑦

= 𝑟 ×𝑔𝑖𝑣𝑒𝑛 𝑥 −𝑚𝑒𝑎𝑛(𝑥)

𝑆𝐷 𝑜𝑓 𝑥

Lines can be expressed by slope & intercept𝑦 = 𝑠𝑙𝑜𝑝𝑒 × 𝑥 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Copyright © 2016 Barnard College 31

Page 29: Lecture 19 – Correlation & Regression

Regression Line

Standard Units Original Unites

Copyright © 2016 Barnard College 32

Page 30: Lecture 19 – Correlation & Regression

Slope and Intercept

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑜𝑓 𝑦 = 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑥 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

𝒔𝒍𝒐𝒑𝒆 𝒐𝒇 𝒕𝒉𝒆 𝒓𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝒍𝒊𝒏𝒆

𝑟 ∗𝑆𝐷 𝑜𝑓 𝑦𝑆𝐷 𝑜𝑓 𝑥

𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒓𝒆𝒈𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝒍𝒊𝒏𝒆𝑚𝑒𝑎𝑛 𝑦 − 𝑠𝑙𝑜𝑝𝑒 × 𝑚𝑒𝑎𝑛(𝑥)

Copyright © 2016 Barnard College 33