31
STATISTICS 200 Lecture #5 Tuesday, September 6, 2016 Textbook: Sections 2.7 through 3.2 • Define z-scores and relate them to the empirical (68-95-99.7) rule • Explore scatterplots as a tool for visualizing two quantitative variables • Familiarize yourselves with least squares regression lines: – slope interpretation – y-intercept interpretation – dangerous to extrapolate Objectives:

STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

STATISTICS 200 Lecture #5 Tuesday, September 6, 2016 Textbook: Sections 2.7 through 3.2

• Define z-scores and relate them to the empirical (68-95-99.7) rule • Explore scatterplots as a tool for visualizing two quantitative variables • Familiarize yourselves with least squares regression lines:

– slope interpretation – y-intercept interpretation – dangerous to extrapolate

Objectives:

Page 2: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Standardized z-scores

• Tells us how many standard deviations an observation is from the mean.

• A useful measure of the relative value of any observation in a dataset

• Allows comparison of observations in different data sets.

Page 3: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Standardized z-scores

• About 68% of values have z-scores

between __ and __. • About 95% of values have z-scores

between __ and __. • About 99.7% of values have z-scores

between __ and __.

•  Z-scores correspond directly to the Empirical Rule.

–1 1

–2 2

–3 3

Page 4: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Example 1 n What is the z-score and interpretation in

the following situation? n  Obs = 3, mean = 4, SD = 0.5

Interpretation: The observation of 3 is 2 standard deviations below the mean.

Z-score = (observation – mean)/SD

= (3 – 4) / 0.5

= –1 / 0.5

= –2

Page 5: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Example 2 n What is the z-score and interpretation in

the following situation? n  Obs = 200, mean=150, SD = 20

Interpretation: The observation 200 is 2.5 standard deviations above the mean.

Z-score = (observation – mean)/SD

= (200-150)/20

= 50/20

= 2.5

Page 6: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

More complicated example: which person has a more unusual height?

Me: a 53” tall woman

• Women’s heights are normal with mean 54” and std. dev. 3”.

My husband: a 73” tall man

• Men’s heights are normal with mean 70” and std. dev. 3”

These heights come from different distributions, so we cannot compare them directly. We need a tool to make them comparable…

Z-score!

Page 7: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Calculate Z-scores for both: • Me: Z-score = (obs – mean)/(std. dev)

•  = (53 – 54) / (3) •  = -1/3 •  = -0.33

•  Husband: Z-score = (obs – mean) / (std. dev) = (73 – 70) / 3

= 3 / 3 = 1

Page 8: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Compare Z-scores – draw them below

Me Husband

Page 9: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Compare Z-scores

Me: ____ std. dev. _____ the mean Husband: ____ std. dev. _____ the mean

.33 below

1 above

Conclusion:

My husband’s height is more unusual than mine, because it is more std. dev. from the mean.

Page 10: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

So far…

n We have talked about quantitative variables, but only one at a time.

n Now we’re going to begin looking at the

relationships between two different quantitative variables.

n Start with looking at a Scatterplot

Page 11: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Scatterplots:

n A scatterplot is a two-dimensional graph of two numeric variables.

n There are two axes on a scatterplot, the vertical axis (y-axis) and the horizontal axis (x-axis). n  The y-axis is assigned to the response variable n  The x-axis is assigned to the explanatory variable.

Page 12: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Example 1: Apartment size and rent

Two Variables: • size of one-bed-room apartment (square feet)

• monthly rent ($)

Size (Square Ft) Rent ($) 415 438 485 636 548 666 646 545 690 688 538 469

1000 833 1003 1089 1150 1181 1237 1225 1469 1501 1177 958

Page 13: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

What is the average pattern?

What is the direction of the pattern?

A positive, linear association

Explanatory / independent / x variable Response / dependent / y variable

Page 14: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

n Linear relationship n a relationship that, on average, will follow a line

n Curvilinear or nonlinear relationship n a relationship that, on average, will follow a curve

Linear versus curvilinear

Page 15: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Association : a term used to describe direction of the pattern shown by the two variables.

n  A positive association occurs when the values of one variable tend to _________as the values of the other variable increase.

n  A negative association occurs when the values of one variable tend to _________ as the values of the other variable increase.

increase

decrease

Page 16: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Outliers

n  When we consider two variables, an outlier is a point with

an _________________ of values.

n  May be unusual and interesting data points, or may be errors.

unusual combination

Page 17: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

17

Example – Tornado Activity

Variables: • year • number of tornadoes (Jan – May)

Source: National Weather Service

Unusually high observations that don’t follow trend of other observation

Page 18: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Formalize the trend: Regression lines

8580757065605550

300

250

200

150

100

Height

Wei

ght

S 24.3673R-Sq 43.0%R-Sq(adj) 43.0%

Fitted Line PlotWeight = - 195.9 + 5.175 Height

n  Regression line: a straight line that describes how values of the response variables (y) are related, on average, to values of the explanatory variable (x).

n  We can use the regression line to… n  Estimate average value of y at a specified value of x n  Predict the unknown value of y for an individual using

that individual’s x value.

Page 19: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

19

Specify Linear Relationships with Simple Linear Regression Model

Regression: •  used to find the best straight line to fit the data points

Name of Procedure: ___________ Squares

Least Square Model: •  smallest ________ of the __________ differences found with all possible lines

Least

sum squared

Page 20: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

The regression equation

y-intercept slope

average value of y

xbby 10ˆ +=In statistics

In math

Page 21: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

In a picture:

xbby 10ˆ +=

Page 22: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

22 Example : Positive Linear Relationship between meal bill ($) and amount of tip ($)

r = 0.830 & n = 10 bills

data from a restaurant

Page 23: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

23

Example: Tip example

Question: Use the amount of bill ($) to estimate the amount of tip left ($), on the average?

Identify the Variables: •  Bill ($): response explanatory

•  Tip ($): response explanatory

• Note: explanatory variable is also called the predictor variable

Page 24: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

To fit a regression line in Minitab: Stat > Regression > Fitted Line Plot

24

correctly identify explanatory variable and response variable

straight line: simple linear regression

Page 25: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

25 Least Squares Regression Equation

sample y-intercept (bo)

sample slope (b1)

The regression equation is Tip = -0.60 + 0.190 Bill

Page 26: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Slope Interpretation b1 = $0.19

• For each additional ___ $ found on the bill, you can expect the tip to ____________ by ___ cents, on the average

tip

Tip = -0.60 + 0.19 Bill

tip

bill

1 increase 19

Page 27: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Y-intercept Interpretation

bo = -$0.60 In theory it says: When you have no bill, you can expect a tip to be ________

• So does the y-intercept have a logical interpretation in the context of this problem?

Tip = -0.60 + 0.19 Bill

-$0.60

No: we have no data for bill = 0

Page 28: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

28

Estimation & Limitations Question: If the bill is $30, estimate the average amount left for a tip?

Tip = -0.60 + 0.19 Bill

Can: _______________ within the range of $15 to $45

Tip = -0.60 + 0.19×(_____)

Tip = ______

Note: Bill = $30 is not an actual observation in the sample

30 $5.1

Estimate

Page 29: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

29

Question: If the bill is $70, estimate the average amount left for a tip.

x = $_____

Tip = -0.60 + 0.19 × Bill

Can’t: _______________ outside the range of $15 to $45

70

Extrapolate

Example 5B: Estimation & Limitations

Page 30: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

30

To remember about regression equations:

Y-intercept: logical interpretation: •  restricted to data where ____ is in the range of data in the

sample No Extrapolation: •  don’t use a regression equation to estimate a value for the

response variable ___________ the range of x values Estimation: •  regression equation estimates the __________ value for y at

a given value of x.

0

outside

average

Page 31: STATISTICS 200personal.psu.edu/drh20/200/lectures/lecture05.pdfFormalize the trend: Regression lines 50 55 60 65 70 75 80 85 300 250 200 150 100 Height W e i g h t S 24.3673 R-Sq 43.0%

Review: If you understood today’s lecture, you should be able to solve

• 3.1, 3.3, 3.5, 3.13, 3.15, 3.19, 3.21

Recall Objectives: • Define z-scores and relate them to the empirical (68-95-99.7) rule • Explore scatterplots as a tool for visualizing two quantitative variables • Familiarize yourselves with least squares regression lines:

– slope interpretation – y-intercept interpretation – dangerous to extrapolate