Stats Chap10

Embed Size (px)

Citation preview

  • 8/3/2019 Stats Chap10

    1/63

    Chapter 10

    Correlation and Regression

    McGraw-Hill, Bluman, 5th ed., Chapter 10 1

  • 8/3/2019 Stats Chap10

    2/63

    Chapter 10 OverviewIntroduction

    10-1 Correlation

    10-2 Regression 10-3 Coefficient of Determination and

    Standard Error of the Estimate

    Bluman, Chapter 10 2

  • 8/3/2019 Stats Chap10

    3/63

    Chapter 10 Objectives1. Draw a scatter plot for a set of ordered pairs.

    2. Compute the correlation coefficient.

    3. Test the hypothesis H0: = 0.

    4. Compute the equation of the regression line.

    5. Compute the coefficient of determination.

    6. Compute the standard error of the estimate.

    7. Find a prediction interval.

    Bluman, Chapter 10 3

  • 8/3/2019 Stats Chap10

    4/63

    Introduction

    In addition to hypothesis testing andconfidence intervals, inferential statisticsinvolves determining whether a

    relationship between two or morenumerical or quantitative variables exists.

    Bluman, Chapter 10 4

  • 8/3/2019 Stats Chap10

    5/63

    Introduction

    CorrelationCorrelation is a statistical method usedto determine whether a linear relationshipbetween variables exists.

    RegressionRegression is a statistical method usedto describe the nature of the relationship

    between variablesthat is, positive ornegative, linear or nonlinear.

    Bluman, Chapter 10 5

  • 8/3/2019 Stats Chap10

    6/63

    Introduction

    The purpose of this chapter is to answerthese questions statistically:

    1. Are two or more variables related?

    2. If so, what is the strength of therelationship?

    3. What type of relationship exists?4. What kind of predictions can be

    made from the relationship?

    Bluman, Chapter 10 6

  • 8/3/2019 Stats Chap10

    7/63

    Introduction

    1. Are two or more variables related?

    2. If so, what is the strength of the

    relationship?

    To answer these two questions, statisticians usethe correlation coefficientcorrelation coefficient, a numericalmeasure to determine whether two or more

    variables are related and to determine thestrength of the relationship between or amongthe variables.

    Bluman, Chapter 10 7

  • 8/3/2019 Stats Chap10

    8/63

    Introduction

    3. What type of relationship exists?

    There are two types of relationships: simple andmultiple.

    In a simple relationship, there are two variables: anindependent variableindependent variable (predictor variable) and adependent variabledependent variable (response variable).

    In a multiple relationship, there are two or moreindependent variables that are used to predict onedependent variable.

    Bluman, Chapter 10 8

  • 8/3/2019 Stats Chap10

    9/63

    Introduction

    4. What kind of predictions can be made

    from the relationship?

    Predictions are made in all areas and daily. Examplesinclude weather forecasting, stock market analyses,sales predictions, crop predictions, gasoline pricepredictions, and sports predictions. Some predictionsare more accurate than others, due to the strength of

    the relationship. That is, the stronger the relationship isbetween variables, the more accurate the prediction is.

    Bluman, Chapter 10 9

  • 8/3/2019 Stats Chap10

    10/63

    10.1 Scatter Plots and Correlation

    A scatter plotscatter plot is a graph of the orderedpairs (x, y) of numbers consisting of theindependent variablexand the

    dependent variable y.

    Bluman, Chapter 10 10

  • 8/3/2019 Stats Chap10

    11/63

    Chapter 10Correlation and Regression

    Section 10-1Example 10-1

    Page #536

    Bluman, Chapter 10 11

  • 8/3/2019 Stats Chap10

    12/63

    Example 10-1: Car Rental Companies

    Construct a scatter plot for the data shown for car rentalcompanies in the United States for a recent year.

    Step 1: Draw and label thexand yaxes.

    Step 2: Plot each point on the graph.

    Bluman, Chapter 10 12

  • 8/3/2019 Stats Chap10

    13/63

    Example 10-1: Car Rental Companies

    Bluman, Chapter 10 13

    Positive Relationship

  • 8/3/2019 Stats Chap10

    14/63

    Correlation

    The correlation coefficientcorrelation coefficient computed fromthe sample data measures the strength anddirection of a linear relationship between two

    variables. There are several types of correlation

    coefficients. The one explained in this sectionis called the Pearson product momentPearson product moment

    correlation coefficient (PPMC)correlation coefficient (PPMC). The symbol for the sample correlation

    coefficient is r. The symbol for the populationcorrelation coefficient is V.

    Bluman, Chapter 10 14

  • 8/3/2019 Stats Chap10

    15/63

    Correlation

    The range of the correlation coefficient is from1 to 1.

    If there is a strong positive linearstrong positive linear

    relationshiprelationship between the variables, the valueofrwill be close to 1.

    If there is a strong negative linearstrong negative linearrelationshiprelationship between the variables, the value

    ofr will be close to 1.

    Bluman, Chapter 10 15

  • 8/3/2019 Stats Chap10

    16/63

    Correlation

    Bluman, Chapter 10 16

  • 8/3/2019 Stats Chap10

    17/63

    Correlation Coefficient

    The formula for the correlation coefficient is

    where n is the number of data pairs.

    Rounding Rule: Round to three decimal places.

    Bluman, Chapter 10 17

    2 2

    2 2

    !

    - -

    n xy x yr

    n x x n y y

  • 8/3/2019 Stats Chap10

    18/63

    Chapter 10Correlation and Regression

    Section 10-1Example 10-2

    Page #540

    Bluman, Chapter 10 18

  • 8/3/2019 Stats Chap10

    19/63

    Example 10-4: Car Rental Companies

    Compute the correlation coefficient for the data inExample 101.

    Bluman, Chapter 10 19

    CompanyCarsx

    (in 10,000s)Income y

    (in billions) xy x 2 y2

    A

    BCDEF

    63.0

    29.020.819.113.48.5

    7.0

    3.92.12.81.41.5

    441.00

    113.1043.6853.4818.762.75

    3969.00

    841.00432.64364.81179.5672.25

    49.00

    15.214.417.841.962.25

    x =

    153.8

    y =

    18.7

    xy =

    682.77

    x2 =

    5859.26

    y2 =

    80.67

  • 8/3/2019 Stats Chap10

    20/63

    Example 10-4: Car Rental Companies

    Compute the correlation coefficient for the data inExample 101.

    Bluman, Chapter 10 20

    x = 153.8, y = 18.7, xy = 682.77, x2 = 5859.26,

    y2 = 80.67, n = 6

    2 22 2

    ! - -

    n xy x yr

    n x x n y y

    2 2

    6 682.77 153.8 18.7

    6 5859.26 153.8 6 80.67 18.7

    !

    - -

    r

    0.982 (strong positive relationship)!r

  • 8/3/2019 Stats Chap10

    21/63

    Hypothesis Testing

    In hypothesis testing, one of the following istrue:

    H0: V ! 0 This null hypothesis means that

    there is no correlation between thexand yvariables in the population.

    H1: V { 0 This alternative hypothesis meansthat there is a significant correlationbetween the variables in thepopulation.

    Bluman, Chapter 10 21

  • 8/3/2019 Stats Chap10

    22/63

    tTest for the Correlation Coefficient

    Bluman, Chapter 10 22

    2

    2

    1

    with degrees of freedom equal to 2.

    !

    nt r

    r

    n

  • 8/3/2019 Stats Chap10

    23/63

    Chapter 10Correlation and Regression

    Section 10-1

    Example 10-3

    Page #544

    Bluman, Chapter 10 23

  • 8/3/2019 Stats Chap10

    24/63

    Example 10-3: Car Rental Companies

    Test the significance of the correlation coefficient found inExample 102. Use = 0.05 and r= 0.982.

    Step 1: State the hypotheses.

    H0: = 0 and H1: { 0

    Step 2: Find the critical value.

    Since = 0.05 and there are 6 2 = 4 degrees of

    freedom, the critical values obtained from Table Fare 2.776.

    Bluman, Chapter 10 24

  • 8/3/2019 Stats Chap10

    25/63

    Step 3: Compute the test value.

    Step 4:

    Make the de

    cision

    .

    Reject the null hypothesis.

    Step 5: Summarize the results.

    There is a significant relationship between thenumber of cars a rental agency owns and itsannual income.

    Example 10-3: Car Rental Companies

    Bluman, Chapter 10 25

    2

    2

    1

    !

    nt r

    r 2

    6 20.982 10.4

    1 0.982

    ! !

  • 8/3/2019 Stats Chap10

    26/63

    Chapter 10Correlation and Regression

    Section 10-1

    Example 10-4

    Page #545

    Bluman, Chapter 10 26

  • 8/3/2019 Stats Chap10

    27/63

    Example 10-4: Car Rental Companies

    Using Table I, test the significance of the correlationcoefficient r = 0.067, from Example 103, at = 0.01.

    Step 1: State the hypotheses.

    H0: = 0 and H1: { 0

    There are 9 2 = 7 degrees of freedom. The value inTable I when = 0.01 is 0.798.

    For a significant relationship, rmust be greater than 0.798or less than -0.798. Since r= 0.067, do not reject the null.

    Hence, there is not enough evidence to say that there is asignificant linear relationship between the variables.

    Bluman, Chapter 10 27

  • 8/3/2019 Stats Chap10

    28/63

    Possible Relationships Between

    VariablesWhen the null hypothesis has been rejected for a specifica value, any of the following five possibilities can exist.

    1. There is a direct cause-and-effectrelationship

    between the variables. That is,xcauses y.2. There is a reverse cause-and-effectrelationship

    between the variables. That is, ycausesx.

    3. The relationship between the variables may be

    caused by a third variable.4. There may be a complexity of interrelationships

    among many variables.

    5. The relationship may be coincidental.

    Bluman, Chapter 10 28

  • 8/3/2019 Stats Chap10

    29/63

    Possible Relationships Between

    Variables1. There is a reverse cause-and-effectrelationship

    between the variables. That is, ycausesx.

    For example,

    water causes plants to grow

    poison causes death

    heat causes ice to melt

    Bluman, Chapter 10 29

  • 8/3/2019 Stats Chap10

    30/63

    Possible Relationships Between

    Variables2. There is a reverse cause-and-effectrelationship

    between the variables. That is, ycausesx.

    For example,

    Suppose a researcher believes excessive coffeeconsumption causes nervousness, but theresearcher fails to consider that the reverse

    situation may occur. That is, it may be that anextremely nervous person craves coffee to calmhis or her nerves.

    Bluman, Chapter 10 30

  • 8/3/2019 Stats Chap10

    31/63

    Possible Relationships Between

    Variables3. The relationship between the variables may be

    caused by a third variable.

    For example,

    If a statistician correlated the number of deathsdue to drowning and the number of cans of softdrink consumed daily during the summer, he or

    she would probably find a significant relationship.However, the soft drink is not necessarilyresponsible for the deaths, since both variablesmay be related to heat and humidity.

    Bluman, Chapter 10 31

  • 8/3/2019 Stats Chap10

    32/63

    Possible Relationships Between

    Variables4. There may be a complexity of interrelationships

    among many variables.

    For example,

    A researcher may find a significant relationshipbetween students high school grades and collegegrades. But there probably are many other

    variables involved, such as IQ, hours of study,influence of parents, motivation, age, andinstructors.

    Bluman, Chapter 10 32

  • 8/3/2019 Stats Chap10

    33/63

    Possible Relationships Between

    Variables5. The relationship may be coincidental.

    For example,

    A researcher may be able to find a significantrelationship between the increase in the number ofpeople who are exercising and the increase in thenumber of people who are committing crimes. But

    common sense dictates that any relationshipbetween these two values must be due tocoincidence.

    Bluman, Chapter 10 33

  • 8/3/2019 Stats Chap10

    34/63

    10.2 Regression

    If the value of the correlation coefficient issignificant, the next step is to determinethe equation of the regression lineregression line

    which is the datas line of best fit.

    Bluman, Chapter 10 34

  • 8/3/2019 Stats Chap10

    35/63

    Regression

    Bluman, Chapter 10 35

    Best fitBest fit means that the sum of thesquares of the vertical distance fromeach point to the line is at a minimum.

  • 8/3/2019 Stats Chap10

    36/63

    Regression Line

    Bluman, Chapter 10 36

    d! y a bx

    2

    22

    22

    where

    = intercept

    = the slope of the line.

    !

    !

    d

    y x x xya

    n x x

    n xy x yb

    n x x

    a y

    b

  • 8/3/2019 Stats Chap10

    37/63

    Chapter 10Correlation and Regression

    Section 10-2

    Example 10-5

    Page #553

    Bluman, Chapter 10 37

  • 8/3/2019 Stats Chap10

    38/63

    Example 10-5: Car Rental Companies

    Find the equation of the regression line for the data inExample 102, and graph the line on the scatter plot.

    Bluman, Chapter 10 38

    x = 153.8, y = 18.7, xy = 682.77, x2 = 5859.26,

    y2 = 80.67, n = 6

    2

    22

    !

    y x x xya

    n x x

    22

    !

    n xy x yb

    n x x

    2

    18.7 5859.26 153.8 682.77

    6 5859.26 153.8

    !

    0.396!

    2

    6 682.77 153.8 18.7

    6 5859.26 153.8

    !

    0.106!

    0.396 0.106d d! p ! y a bx y x

  • 8/3/2019 Stats Chap10

    39/63

    Find two points to sketch the graph of the regression line.

    Use anyxvalues between 10 and 60. For example, letxequal 15 and 40. Substitute in the equation and find thecorresponding yvalue.

    Plot (15,1.986) and (40,4.636), and sketch the resultingline.

    Example 10-5: Car Rental Companies

    Bluman, Chapter 10 39

    0.396 0.106

    0.396 0.106 15

    1.986

    d!

    !

    !

    y x

    0.396 0.106

    0.396 0.106 40

    4.636

    d!

    !

    !

    y x

  • 8/3/2019 Stats Chap10

    40/63

    Example 10-5: Car Rental Companies

    Find the equation of the regression line for the data inExample 102, and graph the line on the scatter plot.

    Bluman, Chapter 10 40

    0.396 0.106d! y x

    15, 1.986

    40, 4.636

  • 8/3/2019 Stats Chap10

    41/63

    Chapter 10Correlation and Regression

    Section 10-2

    Example 10-6

    Page #555

    Bluman, Chapter 10 41

  • 8/3/2019 Stats Chap10

    42/63

    Use the equation of the regression line to predict theincome of a car rental agency that has 200,000automobiles.

    x = 20 corresponds to 200,000 automobiles.

    Hence, when a rental agency has 200,000 automobiles, itsrevenue will be approximately $2.516 billion.

    Example 10-11: Car Rental Companies

    Bluman, Chapter 10 42

    0.396 0.106

    0.396 0.106 20

    2.516

    d!

    !

    !

    y x

  • 8/3/2019 Stats Chap10

    43/63

    Regression

    The magnitude of the change in one variablewhen the other variable changes exactly 1 unitis called a marginal changemarginal change. The value ofslope b of the regression line equationrepresents the marginal change.

    For valid predictions, the value of thecorrelation coefficient must be significant.

    When ris not significantly different from 0, thebest predictor ofyis the mean of the datavalues ofy.

    Bluman, Chapter 10 43

  • 8/3/2019 Stats Chap10

    44/63

    Assumptions for Valid Predictions

    1. For any specific value of the independentvariablex, the value of the dependent variableymust be normally distributed about theregression line. See Figure 1016(a).

    2. The standard deviation of each of thedependent variables must be the same foreach value of the independent variable. See

    Figure 1016(b).

    Bluman, Chapter 10 44

  • 8/3/2019 Stats Chap10

    45/63

    Extrapolations (Future Predictions)

    ExtrapolationExtrapolation, or making predictions beyondthe bounds of the data, must be interpretedcautiously.

    Remember that when predictions are made,they are based on present conditions or on thepremise that present trends will continue. Thisassumption may or may not prove true in the

    future.

    Bluman, Chapter 10 45

  • 8/3/2019 Stats Chap10

    46/63

    Procedure Table

    Step 1: Make a table with subject,x, y,xy,x2,and y2 columns.

    Step 2: Find the values ofxy,x2, and y2. Place

    them in the appropriate columns and sumeach column.

    Step 3: Substitute in the formula to find the valueofr.

    Step 4: When ris significant, substitute in theformulas to find the values ofa and b forthe regression line equation yd = a +bx.

    Bluman, Chapter 10 46

  • 8/3/2019 Stats Chap10

    47/63

    10.3 Coefficient of Determination

    and Standard Error of the Estimate The total variationtotal variation is the

    sum of the squares of the vertical

    distances each point is from the mean. The total variation can be divided into two

    parts: that which is attributed to the

    relationship ofxand y, and that which isdue to chance.

    Bluman, Chapter 10 47

    2

    y y

  • 8/3/2019 Stats Chap10

    48/63

    Variation

    The variation obtained from therelationship (i.e., from the predicted y'values) is and is called the

    ex

    plained variationex

    plained variation. Variation due to chance, found by

    , is called the unexplainedunexplainedvariationvariation. This variation cannot beattributed to the relationships.

    Bluman, Chapter 10 48

    2

    d y y

    2

    d y y

  • 8/3/2019 Stats Chap10

    49/63

    Variation

    Bluman, Chapter 10 49

  • 8/3/2019 Stats Chap10

    50/63

    Coefficient of Determiation

    The coefficient of determinationcoefficient of determination is theratio of the explained variation to the totalvariation.

    The symbol for the coefficient ofdetermination is r2.

    Another way to arrive at the value forr2

    is to square the correlation coefficient.

    Bluman, Chapter 10 50

    2 explained variation

    total variation

    !r

  • 8/3/2019 Stats Chap10

    51/63

    Coefficient of Nondetermiation

    The coefficient ofcoefficient of nondeterminationnondetermination isa measure of the unexplained variation.

    The formula for the coefficient of

    determination is 1.00 r2

    .

    Bluman, Chapter 10 51

  • 8/3/2019 Stats Chap10

    52/63

    Standard Error of the Estimate

    The standard error of estimatestandard error of estimate,denoted by sest is the standard deviationof the observed yvalues about the

    predicted y'values. The formula for thestandard error of estimate is:

    Bluman, Chapter 10 52

    2

    2

    d

    !

    est

    y ys

    n

  • 8/3/2019 Stats Chap10

    53/63

    Chapter 10Correlation and Regression

    Section 10-3

    Example 10-7

    Page #569

    Bluman, Chapter 10 53

  • 8/3/2019 Stats Chap10

    54/63

    A researcher collects the following data and determinesthat there is a significant relationship between the age of acopy machine and its monthly maintenance cost. Theregression equation is yd = 55.57 + 8.13x. Find thestandard error of the estimate.

    Example 10-7: Copy Machine Costs

    Bluman, Chapter 10 54

  • 8/3/2019 Stats Chap10

    55/63

    Example 10-7: Copy Machine Costs

    Bluman, Chapter 10 55

    MachineAgex(years)

    Monthlycost, y yd y yd (y yd)2

    ABCD

    EF

    1234

    46

    62787090

    93103

    63.7071.8379.9688.09

    88.09104.35

    -1.706.17

    -9.961.91

    4.91-1.35

    2.8938.068999.20163.6481

    24.10811.8225

    169.7392

    55.57 8.13

    55.57 8.13 1 63.70

    55.57 8.13 2 71.8355.57 8.13 3 79.96

    55.57 8.13 4 88.09

    55.57 8.13 6 104.35

    d!

    d! !

    d

    ! !d! !

    d! !

    d! !

    y x

    y

    yy

    y

    y

    2

    2

    d!

    est

    y ys

    n

    169.73926.51

    4! !

    ests

  • 8/3/2019 Stats Chap10

    56/63

    Chapter 10Correlation and Regression

    Section 10-3

    Example 10-8

    Page #570

    Bluman, Chapter 10 56

  • 8/3/2019 Stats Chap10

    57/63

    Example 10-8: Copy Machine Costs

    Bluman, Chapter 10 57

    2

    2

    !

    est

    y a y b xys

    n

  • 8/3/2019 Stats Chap10

    58/63

    Example 10-8: Copy Machine Costs

    Bluman, Chapter 10 58

    496 1778

    MachineAgex(years)

    Monthlycost, y xy y 2

    ABCD

    EF

    1234

    46

    62787090

    93103

    62156210360

    372618

    3,8446,0844,9008,100

    8,64910,609

    2

    2

    !

    est

    y a y b xys

    n

    42,186

    42,186 55.57 496 8.13 17786.48

    4

    ! !

    ests

  • 8/3/2019 Stats Chap10

    59/63

    Formula for the Prediction Interval

    about a Value yd

    Bluman, Chapter 10 59

    2

    2 22

    2

    2 22

    11

    11

    with d.f. = - 2

    d

    d

    est

    est

    n x Xy t s y

    n n x x

    n x Xy t s

    n n x x

    n

    E

    E

  • 8/3/2019 Stats Chap10

    60/63

    Chapter 10Correlation and Regression

    Section 10-3

    Example 10-9Page #571

    Bluman, Chapter 10 60

  • 8/3/2019 Stats Chap10

    61/63

    For the data in Example 107, find the 95% predictioninterval for the monthly maintenance cost of a machinethat is 3 years old.

    Step 1: Find

    Step 2: Find yd forx = 3.

    Step 3: Find sest.

    (as shown in Example 10-13)

    Example 10-9: Copy Machine Costs

    Bluman, Chapter 10 61

    2, , and . x x X 2 2020 82 3.3

    6! ! ! ! x x X

    55.57 8.13 3 79.96d! !y

    6.48!est

    s

  • 8/3/2019 Stats Chap10

    62/63

    Step 4: Substitute in the formula and solve.

    Example 10-9: Copy Machine Costs

    Bluman, Chapter 10 62

    2

    2 22

    2

    2 22

    11

    11

    d

    d

    est

    est

    n x Xy t s y

    n n x x

    n x Xy t s

    n n x x

    E

    E

    2

    2

    2

    2

    6 3 3.3179.96 2.776 6.48 1

    66 82 20

    6 3 3.3179.96 2.776 6.48 1

    6 6 82 20

    y

  • 8/3/2019 Stats Chap10

    63/63

    Step 4: Substitute in the formula and solve.

    Example 10-9: Copy Machine Costs

    2

    2

    2

    2

    6 3 3.3179.96 2.776 6.48 1

    6 6 82 20

    6 3 3.31

    79.96 2.776 6.48 1 6 6 82 20

    y

    79.96 19.43 79.96 19.43

    60.53 99.39

    y

    y

    Hence, you can be 95% confident that the interval60.53 < y < 99.39 contains the actual value ofy.