Chapter 2 part3-Least-Squares Regression

INTRODUCTION TO STATISTICS & PROBABILITY Chapter 2: Looking at Data–Relationships (Part 3)

1

Dr. Nahid Sultana

Chapter 2: Looking at Data–Relationships

2

2.1: Scatterplots

2.2: Correlation

2.3: Least-Squares Regression

2.5: Data Analysis for Two-Way Tables

3

Objectives

Regression lines

Prediction and Extrapolation

Correlation and r2

2.3 Least-Squares Regression

Regression line 4

Correlation tells us about strength and direction of the linear relationship between two quantitative variables.

In Regression we study the association between two variables in order to explain the values of one from the values of the other (i.e., make predictions).

When there is a linear association between two variables, then a straight line equation can be used to model the relationship.

In regression the distinction between Response and Explanatory is important.

Regression line (Cont…) 5

A regression line is a line that best describes the linear

relationship between the two variables, and it is expressed by

means of an equation of the form:

Where is the slope and is the intercept.

Once the equation of the regression line is established, we can

use it to predict the response y for a specific value of the

explanatory variable x .

The least-squares regression line 6

The least-squares regression line is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

The least-squares regression line (Cont.) 7

is the predicted y value (y hat) b1 is the slope

b0 is the y-intercept

ˆ y

xbby 10ˆ +=

The equation of the least-squares regression line of y on x is

b1 = rsy

sx

First we calculate the slope of the line,

Where r is the correlation, sy is the standard deviation of the response variable y, sx is the standard deviation of the explanatory variable x.

Once we know b1, the slope, we can calculate b0, the y-intercept:

b0 = y − b1x Where and are the sample means of the x and y variables

How to plot the least-squares regression line

8

Typically, we use stats software.

x

y

How to plot the least-squares regression line (Cont…)

9

To plot the regression line you only need to plug the x values into the equation, get y, and draw the line that goes through those points. Hint: The regression line always passes through the mean of x and y.

9

The points you use for drawing the regression line are derived from the equation. They are NOT points from your sample data (except by pure coincidence).

9

Two different regression lines can be drawn if we interchange the roles of x and y. Example:

10

Correlation coefficient of NEA and Fat, r = -0.779 stay same in both cases

Nonexercise activity (calories)

Fat

ga

in (

Kilo

gra

ms)

7006005004003002001000-100

4

3

2

1

0

Fitted Line PlotFat = 3.505 - 0.003441 NEA

Fat gain (Kilograms)

No

ne

xe

rcis

e a

cti

vit

y (

ca

lori

es)

43210

700

600

500

400

300

200

100

0

-100

Fitted Line PlotNEA = 745.3 - 176.1 Fat

BEWARE!!!

Not all calculators and software use the same convention. Some use: And some use:

bxay +=ˆ

ˆ y = ax + bMake sure you know what YOUR calculator gives you for a and b before you answer homework or exam questions.

11

Making predictions

The equation of the least-squares regression allows you to predict y for any x within the range studied.

y

ˆ y = 0.0144x + 0.0008Nobody in the study drank 6.5

beers, but by finding the value

from the regression line for x = 6.5

we would expect a blood alcohol

content of 0.094 mg/ml.

mg/ml 0944.00008.0936.0ˆ0008.05.6*0144.0ˆ=+=

+=yy

Year Powerboat s Dead Manate es1977 447 131978 460 211979 481 241980 498 161981 513 241982 512 201983 526 151984 559 341985 585 331986 614 331987 645 391988 675 431989 711 501990 719 47

There is a positive linear relationship between the number of powerboats registered and the number of manatee deaths.

(in 1000s)

1.214.415.62ˆ 4.41)500(125.0ˆ =−=⇒−= yy Roughly 21 manatees.

Thus if we were to limit the number of powerboat registrations to 500,000, what could we expect for the number of manatee deaths?

The least squares regression line has the equation:

ˆ y = 0 . 125 x − 41 . 4

ˆ y = 0 . 125 x − 41 . 4

13 ----Could we use this regression line to predict the number of manatee

deaths for a year with 200,000 powerboat registrations?

Extrapolation is the use of a regression line for prediction far outside the range of values of x used to obtain the line. Such predictions are often not accurate.

!!!

!!!

Extrapolation 14

Sometimes the y-intercept is not biologically possible. Here we have negative blood alcohol content, which makes no sense…

y-intercept shows negative blood alcohol But the negative value is

appropriate for the equation of the regression line. There is a lot of scatter in the data, and the line is just an estimate.

The y intercept 15

Coefficient of determination, r2

16

Least-squares regression looks at the distances of the data points from the line only in the y direction.

The variables x and y play different roles in regression.

Even though correlation r ignores the distinction between x and y, there is a close connection between correlation and regression.

r2 is called the coefficient of determination.

r2 represents the percentage of the variance in y (vertical scatter from the regression line) that can be explained by changes in x.

r = -1 r2 = 1

Changes in x explain 100% of the variations in y. Y can be entirely predicted for any given value of x.

r = 0 r2 = 0

Changes in x explain 0% of the variations in y. The values y takes are entirely independent of what value x takes.

Here the change in x only explains 76% of the change in y. The rest of the change in y (the vertical scatter, shown as red arrows) must be explained

by something other than x.

r = 0.87 r2 = 0.76

17

17

r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y.

r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y.

r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.

r = –0.3, r 2 = 0.09, or 9% The regression model explains not even 10% of the variations in y.

r = –0.7, r 2 = 0.49, or 49% The regression model explains nearly half of the variations in y.

r = –0.99, r 2 = 0.9801, or ~98% The regression model explains almost all of the variations in y.

18

Observed y

Predicted ŷ residual )ˆ( dist. =− yy

Residuals 19

Points above the line have a positive residual.

Points below the line have a negative residual.

A residual is the difference between an observed value of the response variable and the value predicted by the regression line:

residual = observed y – predicted y

= y − ˆ y

The sum of these residuals is always 0.

A residual plot is a scatterplot of the regression residuals against the explanatory variable.

Residual plots help us assess the fit of a regression line.

If residuals are scattered randomly around 0, chances are your data fit a linear model, was normally distributed, and you didn’t have outliers.

Residual plots 20

The x-axis in a residual plot is the same as on the scatterplot.

Only the y-axis is different.

21

Residuals are randomly scattered—good!

22

Curved pattern—means the relationship you are looking at is not linear.

A change in variability across a plot is a warning sign. You need to find out why it is, and remember that predictions made in areas of larger variability will not be as good.

2.5 Data Analysis for Two-Way Tables

23

Objectives

The Two-Way Table

Marginal Distribution

Conditional Distributions

23

24

Two-way tables

Two-way tables summarize data about two categorical variables (or factors) collected on the same set of individuals.

Example (Smoking Survey in Arizona): High school students were asked whether they smoke and whether their parents smoke.

Does parental smoking influence the smoking habits of their high school children?

Explanatory Variable: Smoking habit of student’s parents

(both smoke/ one smoke/ neither smoke) Response variable: Smoking habit of student

(smokes/does not smoke) To analyze the relationship we can summarize the result in a Two-way table:

25

Two-way tables (Cont …)

Explanatory (Row) Variable: Smoking habit of student’s parents

Response (Column) variable: Smoking habit of student

This 3X2 two-way table has 3 rows and 2 columns. Numbers are counts or frequency

400 1380 416 1823 188 1168

First factor: Parent smoking status

Second factor: Student smoking status

High school students were asked whether they smoke, and whether their parents smoke:

26

Margins

Margins show the total for each column and each row.

For each cell, we can compute a proportion by dividing the cell entry by the total sample size. The collection of these proportions is the joint distribution of the two categorical variables.

400 1380 416 1823 188 1168

Margin for parental smoking

Margin for student smoking

27

Marginal distributions (When examine the distribution of a single variable in a two-way table)

Marginal distributions: Distribution of column variable separately (or row variable separately) expressed in counts or percent.

%1.3353751780

≈

%7.1853751004

=

400 1380 33.1% 416 1823 41.7% 188 1168 25.2% 18.7% 81.3% 100%

400 1380 416 1823 188 1168

28

Marginal distribution (Cont..)

The marginal distributions can be displayed on separate bar graphs, typically expressed as percents instead of raw counts.

Each graph represents only one of the two variables, ignoring the second one. Each marginal distribution can also be shown in a pie chart.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Smoker Nonsmoker

Perc

ent o

f stu

dent

s in

terv

iew

ed

Sum of Counts

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Both One Neither

Perc

ent o

f stu

dent

s in

terv

iew

ed

Sum of Counts Parental smoking

Student smoking

29

Conditional Distribution

A conditional distribution is the distribution of one factor for each level of the other factor.

A conditional percent is computed using the counts within a single row or a single column. The denominator is the corresponding row or column total (rather than the table grand total).

Percent of students who smoke when both parents smoke = 400/1780 = 22.5%

400 1380 416 1823 188 1168

Percent of students who smoke when both parents smoke = 400/1780 = 22.5%

400 1380 416 1823 188 1168

30

Conditional distributions (Cont…)

Conditional distribution of student smokers for different parental smoking statuses: Percent of students who smoke when both parents smoke = 400/1780 = 22.5% Percent of students who smoke when one parent smokes = 416/2239 = 18.6% Percent of students who smoke when neither parent smokes = 188/1356 = 13.9%

400 1380 416 1823 188 1168

Comparing conditional distributions helps us describe the “relationship" between the two categorical variables.

We can compare the percent of individuals in one level of factor 1 for each level of factor 2.

31


Conditional distribution of student smoking status for different levels of parental smoking status:

Percent who

smoke Percent who do not smoke Row total

Both parents smoke 22% 78% 100% One parent smokes 19% 81% 100% Neither parent smokes 14% 86% 100%

The conditional distributions can be compared graphically by displaying the percents

making up one level of one factor, for each level of the other factor.

32

Conditional Distribution

In the table below, the 25 to 34 age group occupies the first column.

33


Here the percents are calculated by age range (columns).

29.30% = 11071 37785 = cell total . column total

34

The conditional distributions can be graphically compared using side by side bar graphs of one variable for each value of the other variable.

Here, the percents are calculated by age range

(columns).

34

35

Music and wine purchase decision

We want to compare the conditional distributions of the response variable (wine purchased) for each value of the explanatory variable (music played). Therefore, we calculate column percents.

What is the relationship between type of music played in supermarkets and type of wine purchased?

We calculate the column conditional percents similarly for

each of the nine cells in the table:

Calculations: When no music was played, there were 84 bottles of wine sold. Of these, 30 were French wine. 30/84 = 0.357 35.7% of the wine sold was French when no music was played.

30 = 35.7% 84 = cell total . column total

For every two-way table, there are two sets of possible conditional distributions.

Wine purchased for each kind of music played (column percents)

Music played for each kind of wine purchased

(row percents)

Does background music in supermarkets

influence customer purchasing decisions?

36

Education

Chapter 2 part3-Least-Squares Regression