Week 6 in Class Lecture

Embed Size (px)

Citation preview

  • 8/10/2019 Week 6 in Class Lecture

    1/70

    Week 6

    Simple Linear Regression

  • 8/10/2019 Week 6 in Class Lecture

    2/70

    Models

    Representation of some phenomenon

    Mathematical model is a mathematical

    expression of some phenomenon

    Often describe relationships between

    variables

    Types Deterministic models

    Probabilistic models

  • 8/10/2019 Week 6 in Class Lecture

    3/70

    Deterministic Models

    Hypothesize exactrelationships

    Suitable when prediction error is negligible

    Example: force is exactly mass timesacceleration

    F= ma

    1984-1994 T/Maker Co.

  • 8/10/2019 Week 6 in Class Lecture

    4/70

    Probabilistic Models

    Hypothesize two components

    Deterministic

    Random error

    Example:sales volume (y) is 10 times

    advertising spending (x) + random error

    y= 10x+

    Random error may be due to factors

    other than advertising

  • 8/10/2019 Week 6 in Class Lecture

    5/70

    General Form of Probabilistic

    Modelsy= Deterministic component + Random error

    whereyis the variable of interest.

    We always assume that the mean value of therandom error equals 0:

    E(y) = Deterministic component

  • 8/10/2019 Week 6 in Class Lecture

    6/70

    A First-Order (Straight Line)

    Probabilistic Model

    y= 0+ 1x +

    where

    y= Dependentorresponse variable(variable to be modeled)

    x= Independentorpredictor variable

    (variable used as a predictor ofy)E(y) = 0+ 1x = Deterministic component

    (epsilon) = Random error component

  • 8/10/2019 Week 6 in Class Lecture

    7/70

    A First-Order (Straight Line)

    Probabilistic Model

    y= 0+ 1x +

    0(beta zero) =y-intercept of the line, that is, thepoint at which the line interceptsor cuts through the y-axis

    1(beta one) = slope of the line, that is, the

    change (amount of increase ordecrease) in the deterministiccomponent ofyfor every 1-unitincrease inx

  • 8/10/2019 Week 6 in Class Lecture

    8/70

    A First-Order (Straight Line)

    Probabilistic Model

    Apositiveslope implies thatE(y) increasesby theamount 1for each unit increase inx.

    A negativeslope implies thatE(y) decreasesbythe amount 1.

  • 8/10/2019 Week 6 in Class Lecture

    9/70

    Five-Step Procedure

    Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.

    Step 2: Use the sample data to estimate unknown

    parameters in the model.Step 3: Specify the probability distribution of the

    random error and estimate the standarddeviation of this distribution.

    Step 4: Statistically evaluate the usefulness of themodel.

    Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.

  • 8/10/2019 Week 6 in Class Lecture

    10/70

    Scattergram

    1. Plot of all (xi,yi) pairs

    2. Suggests how well model will fit

    0

    20

    40

    60

    0 20 40 60

    x

    y

  • 8/10/2019 Week 6 in Class Lecture

    11/70

    0

    20

    40

    60

    0 20 40 60

    x

    y

    Thinking Challenge

    How would you draw a line through the points?

    How do you determine which line fits best?

  • 8/10/2019 Week 6 in Class Lecture

    12/70

    Least Squares Line

    The least squares line is one that has

    the following two properties:

    1. The sum of the errors equals 0,i.e., mean error = 0.

    2. The sum of squared errors (SSE) is smaller than

    for any other straight-line model, i.e., the errorvariance is minimum.

  • 8/10/2019 Week 6 in Class Lecture

    13/70

    Interpreting the Estimates of 0and

    1in Simple Liner Regression

    y-intercept: represents the predicted value ofywhenx= 0 (Caution: This value will notbe meaningful if the valuex= 0 is

    nonsensical or outside the range of thesample data.)

    slope: represents the increase (or decrease) iny

    for every 1-unit increase inx(Caution:This interpretation is valid only forx-values within the range of the sampledata.)

  • 8/10/2019 Week 6 in Class Lecture

    14/70

    Dependent

    Independent

    MiniTab

  • 8/10/2019 Week 6 in Class Lecture

    15/70

    Least Squares Example

    Youre an economist for the county cooperative.

    You gather the following data:

    Fertilizer (lb.) Yield (lb.)4 3.0

    6 5.5

    10 6.5

    12 9.0Find the least squares linerelating

    crop yield and fertilizer.

    1984-1994 T/Maker Co.

  • 8/10/2019 Week 6 in Class Lecture

    16/70

    Scattergram

    Crop Yield vs. Fertilizer*Stat -> Regression -> Fitted Line Plot

  • 8/10/2019 Week 6 in Class Lecture

    17/70

    Coefficient Interpretation

    Solution

    2. y-Intercept ( 0) Since 0 is outside of the range of the sampled

    values ofx, they-intercept has no meaningful

    interpretation.

    ^

    ^1. Slope (

    1

    )

    Crop Yield (y) is expected to increase by .65 lb. for

    each 1 lb. increase in Fertilizer (x)

    .8 .65y x

  • 8/10/2019 Week 6 in Class Lecture

    18/70

    Five-Step Procedure

    Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.

    Step 2: Use the sample data to estimate unknown

    parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.

    Step 4: Statistically evaluate the usefulness of themodel.

    Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.

  • 8/10/2019 Week 6 in Class Lecture

    19/70

    Basic Assumptions of the

    Probability Distribution

  • 8/10/2019 Week 6 in Class Lecture

    20/70

    Five-Step Procedure

    Step 1: Hypothesize the deterministic component of themodel that relates the mean,E(y), to theindependent variablex.

    Step 2: Use the sample data to estimate unknown

    parameters in the model.Step 3: Specify the probability distribution of therandom error and estimate the standarddeviation of this distribution.

    Step 4: Statistically evaluate the usefulness of themodel.

    Step 5: When satisfied that the model is useful, use it forprediction, estimation, and other purposes.

  • 8/10/2019 Week 6 in Class Lecture

    21/70

    A Test of Model Utility: Simple

    Linear RegressionOne-Tailed Test

    H0: 1= 0

    Ha:

    1< 0 (orH

    a:

    1> 0)

    Test Statistic: t

    Rejection region: t t whenHa: 1> 0)

    where t is based on (n2) degrees of freedom

  • 8/10/2019 Week 6 in Class Lecture

    22/70

    A Test of Model Utility: Simple

    Linear RegressionTwo-Tailed Test

    H0: 1= 0

    Ha:

    1 0

    Test Statistic: t

    Rejection region: |t| > t

    where t is based on (n2) degrees of freedom

  • 8/10/2019 Week 6 in Class Lecture

    23/70

    Interpreting p-Values for

    Coefficients in Regression

    Almost all statistical computer software packages

    report a two-tailedp-value for each of the

    parameters in the regression model. For example,in simple linear regression, thep-value for the two-

    tailed testH0: 1= 0 versusHa: 1 0 is given on

    the printout. If you want to conduct a one-tailedtest of hypothesis, you will need to adjust the

    p-value accordingly.

  • 8/10/2019 Week 6 in Class Lecture

    24/70

    Test of Slope Coefficient

    Example

    Youre a marketing analyst for Hasbro Toys.

    You find 0=.1,1= .7and s= .6055.

    Ad Expenditure (100$) Sales (Units)

    1 12 13 24 2

    5 4Is the relationship significantat the .05level of significance?

    ^^

  • 8/10/2019 Week 6 in Class Lecture

    25/70

    Test of Slope Coefficient

    Solution

    H0:

    if slop ( 1) is zero then there is no relationship

    Ha: This is the claim, there is a relationship because slop is not

    zero.

    1= 0

    1 0

    f S C ff

  • 8/10/2019 Week 6 in Class Lecture

    26/70

    Test of Slope Coefficient

    Solution

    H0:

    Ha:

    df

    Critical Value(s):

    t0 3.182-3.182

    .025

    RejectH0 RejectH0

    .025

    1= 0

    1 0

    .0552 = 3

    Inverse Cumulative Distribution Function

    Student's t distribution with 3 DF

    P( X Probability Distributions -> t

    T f Sl C ffi i

  • 8/10/2019 Week 6 in Class Lecture

    27/70

    Test of Slope Coefficient

    Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)

    Regression Equation

    Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)

    Coefficients

    Term Coef SE Coef T P

    Constant -0.1 0.635085 -0.15746 0.885

    Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

    Summary of Model

    S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

    PRESS = 4.43367 R-Sq(pred) = 26.11%

    Analysis of Variance

    Source DF Seq SS Adj SS Adj MS F P

    Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

    Ad Ex enditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

    tP-Value

    1^

    Stat -> Regression -> General Regression

    0

    ^

    T t f Sl C ffi i t

  • 8/10/2019 Week 6 in Class Lecture

    28/70

    Test of Slope Coefficient

    Solution

    H0:

    Ha:

    df

    Critical Value(s):

    t0 3.182-3.182

    .025

    RejectH0 RejectH0

    .025

    1= 0

    1 0

    .0552 = 3

    Test Statistic:

    Decision:

    Conclusion:

    t 3.657

    RejectH0 at = .05

    because t >

    because P-value is smaller than .

    There is evidence of a

    relationship

    P-Value = 0.035

  • 8/10/2019 Week 6 in Class Lecture

    29/70

    Correlation Models

    Answers How strongis the linearrelationship between two variables?

    Coefficient of correlation Sample correlation coefficient denoted r Population correlation coefficient

    Values range from1 to +1

  • 8/10/2019 Week 6 in Class Lecture

    30/70

    Coefficient of Correlation

  • 8/10/2019 Week 6 in Class Lecture

    31/70

    Coefficient of Correlation

  • 8/10/2019 Week 6 in Class Lecture

    32/70

    Coefficient of Correlation

  • 8/10/2019 Week 6 in Class Lecture

    33/70

  • 8/10/2019 Week 6 in Class Lecture

    34/70

    Coefficient of Correlation

    Solution

    r =

    r =

    r = 0.9038805

    r 0.904

    Stat -> Regression -> Fitted Line Plot

    r 0.904 -- Strong Positive Relation between x and y

  • 8/10/2019 Week 6 in Class Lecture

    35/70

    Coefficient of Correlation

    Example

    Youre an economist for the county cooperative.

    You gather the following data:

    Fertilizer (lb.) Yield (lb.)4 3.0

    6 5.5

    10 6.5

    12 9.0Find the coefficient of correlation.

    1984-1994 T/Maker Co.

  • 8/10/2019 Week 6 in Class Lecture

    36/70

    Coefficient of Correlation

    Solution

    r =

    r =

    r 0.956

    Stat -> Regression -> Fitted Line Plot

  • 8/10/2019 Week 6 in Class Lecture

    37/70

    It represents the proportion of the total sample

    variability around y that is explained by the linear

    relationship betweenyandx.

    Coefficient of Determination

    0 r2 1

    r2= (coefficient of correlation)2

    Coefficient of

  • 8/10/2019 Week 6 in Class Lecture

    38/70

    Coefficient of

    Determination Example

    Youre a marketing analyst for Hasbro Toys.

    You know r= .904.

    Ad Expenditure (100$) Sales (Units)1 12 13 24 2

    5 4

    Calculate and interpret thecoefficient of determination.

    Coefficient of

  • 8/10/2019 Week 6 in Class Lecture

    39/70

    Coefficient of

    Determination Solution

    r2= (coefficient of correlation)2

    r2= (.904)2

    r2= .817

    Interpretation:About 81.7% of the sample variation

    in Sales (y) can be explained by using Ad $ (x) to

    predict Sales (y)in the linear model. The remaining

    18.3% are due to other factors.

    Regression Modeling

  • 8/10/2019 Week 6 in Class Lecture

    40/70

    Regression Modeling

    Steps

    1. Hypothesize deterministic component

    2. Estimate unknown model parameters

    3. Specify probability distribution of random errorterm

    Estimate standard deviation of error

    4. Evaluate model5. Use model for prediction and estimation

    P di ti With R i

  • 8/10/2019 Week 6 in Class Lecture

    41/70

    Prediction With Regression

    Models Types of predictions

    Point estimates

    Interval estimates

    What is predicted

    Population mean value of y,E(y), for givenx

    (confidence interval)

    Individual response (yi) for givenx(prediction interval)

    Confidence Interval

  • 8/10/2019 Week 6 in Class Lecture

    42/70

    Confidence Interval

    Estimate Example

    Youre a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.

    Ad Expenditure (100$) Sales (Units)1 12 13 24 25 4

    Find a 95%confidence interval forthe meansales when advertising is $4.

    ^^

  • 8/10/2019 Week 6 in Class Lecture

    43/70

    Prediction Interval Solution

    I t l E ti t

  • 8/10/2019 Week 6 in Class Lecture

    44/70

    Interval Estimate

    Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)

    Regression Equation

    Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)

    Coefficients

    Term Coef SE Coef T P

    Constant -0.1 0.635085 -0.15746 0.885

    Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

    Summary of Model

    S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

    PRESS = 4.43367 R-Sq(pred) = 26.11%

    Analysis of Variance

    Source DF Seq SS Adj SS Adj MS F P

    Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

    Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

    Error 3 1.1 1.1 0.36667

    Total 4 6.0

    Fits and Diagnostics for Unusual Observations

    No unusual observations

    Predicted Values for New Observations

    New Obs Fit SE Fit 95% CI 95% PI

    1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)

    Values of Predictors for New Observations

    Ad Expenditure

    New Obs (100$)

    1 4

    I t l E ti t

  • 8/10/2019 Week 6 in Class Lecture

    45/70

    Fits and Diagnostics for Unusual Observations

    No unusual observations

    Predicted Values for New Observations

    New Obs Fit SE Fit 95% CI 95% PI

    1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)

    Values of Predictors for New Observations

    Ad Expenditure

    New Obs (100$)

    1 4

    Interval Estimate

    Computer Output

    Predicted y

    when x= 4

    Confidence

    IntervalSY

    Prediction Interval

  • 8/10/2019 Week 6 in Class Lecture

    46/70

    Prediction Interval

    ExampleYoure a marketing analyst for Hasbro Toys.You find 0=.1,1= .7and s= .6055.

    Ad Expenditure (1000$) Sales (Units)1 1

    2 13 24 25 4

    Predict the sales when advertisingis $400. Use a 95%predictioninterval.

    ^^

  • 8/10/2019 Week 6 in Class Lecture

    47/70

    Prediction Interval Solution

    Interval Estimate

  • 8/10/2019 Week 6 in Class Lecture

    48/70

    Interval Estimate

    Computer OutputGeneral Regression Analysis: Sales (Units) versus Ad Expenditure (100$)

    Regression Equation

    Sales (Units) = -0.1 + 0.7 Ad Expenditure (100$)

    Coefficients

    Term Coef SE Coef T P

    Constant -0.1 0.635085 -0.15746 0.885

    Ad Expenditure (100$) 0.7 0.191485 3.65563 0.035

    Summary of Model

    S = 0.605530 R-Sq = 81.67% R-Sq(adj) = 75.56%

    PRESS = 4.43367 R-Sq(pred) = 26.11%

    Analysis of Variance

    Source DF Seq SS Adj SS Adj MS F P

    Regression 1 4.9 4.9 4.90000 13.3636 0.0353528

    Ad Expenditure (100$) 1 4.9 4.9 4.90000 13.3636 0.0353528

    Error 3 1.1 1.1 0.36667

    Total 4 6.0

    Fits and Diagnostics for Unusual Observations

    No unusual observations

    Predicted Values for New Observations

    New Obs Fit SE Fit 95% CI 95% PI

    1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)

    Values of Predictors for New Observations

    Ad Expenditure

    New Obs (100$)

    1 4

    Interval Estimate

  • 8/10/2019 Week 6 in Class Lecture

    49/70

    Fits and Diagnostics for Unusual Observations

    No unusual observations

    Predicted Values for New Observations

    New Obs Fit SE Fit 95% CI 95% PI

    1 2.7 0.331662 (1.64450, 3.75550) (0.502806, 4.89719)

    Values of Predictors for New Observations

    Ad Expenditure

    New Obs (100$)

    1 4

    Interval Estimate

    Computer Output

    Predicted y

    when x= 4SY^

    Prediction

    Interval

    C fid I t l

  • 8/10/2019 Week 6 in Class Lecture

    50/70

    Confidence Intervals v.

    Prediction Intervals The prediction interval is always wider than

    the corresponding confidence interval

    Added uncertainty involved in predicting a single

    response versus the mean response

  • 8/10/2019 Week 6 in Class Lecture

    51/70

  • 8/10/2019 Week 6 in Class Lecture

    52/70

    Example

    Suppose a fire insurance company wants to relatethe amount of fire damage in major residential

    fires to the distance between the burning house

    and the nearest fire station. The study is to beconducted in a large suburb of a major city; a

    sample of 15 recent fires in this suburb is

    selected. The amount of damage,y, and the

    distance between the fire and the nearest fire

    station,x, are recorded for each fire.

  • 8/10/2019 Week 6 in Class Lecture

    53/70

    Example

  • 8/10/2019 Week 6 in Class Lecture

    54/70

    Example

    Step 1: First, we hypothesize a model to relatefire damage,y, to the distance from the nearest

    fire station,x. We hypothesize a straight-line

    probabilistic model:y= 0+ 1x+

  • 8/10/2019 Week 6 in Class Lecture

    55/70

    Example

    Step 2: Use a statistical software package toestimate the unknown parameters in the

    deterministic component of the hypothesized

    model. The least squares estimates of the slope 1and intercept 0, highlighted on the printout, are

    1

    0

  • 8/10/2019 Week 6 in Class Lecture

    56/70

    ExampleGeneral Regression Analysis: DAMAGE versus DISTANCE

    Regression Equation

    DAMAGE = 10.2779+ 4.91933DISTANCE

    Coefficients

    Term Coef SE Coef T P

    Constant 10.2779 1.42028 7.2366 0.000

    DISTANCE 4.9193 0.39275 12.5254 0.000

    Summary of Model

    S = 2.31635 R-Sq = 92.35% R-Sq(adj) = 91.76%

    PRESS = 93.2117 R-Sq(pred) = 89.77%

    Analysis of Variance

    Source DF Seq SS Adj SS Adj MS F P

    Regression 1 841.766 841.766 841.766 156.886 0.0000000

    DISTANCE 1 841.766 841.766 841.766 156.886 0.0000000Error 13 69.751 69.751 5.365

    Total 14 911.517

    Fits and Diagnostics for Unusual Observations

    No unusual observations

    Least Square Equation

  • 8/10/2019 Week 6 in Class Lecture

    57/70

    Example

    This prediction equation is graphed in theMinitab Fitted Line Plot.

  • 8/10/2019 Week 6 in Class Lecture

    58/70

    Example

    The least squares estimate of the slope, 1implies that the estimated mean damage increases

    by $4,919 for each additional mile from the fire

    station. This interpretation is valid over the rangeofx, or from .7 to 6.1 miles from the station. The

    estimatedy-intercept, 0 has no

    practical interpretation becausex= 0 is outside

    the sampled range.

  • 8/10/2019 Week 6 in Class Lecture

    59/70

  • 8/10/2019 Week 6 in Class Lecture

    60/70

    Example

    Step 4: First, test the null hypothesis that theslope 1is 0that is, that there is no linearrelationship between fire damage and thedistance from the nearest fire station, against the

    alternative hypothesis that fire damage increasesas the distance increases. We test

    H0: 1= 0

    Ha: 1> 0The two-tailed observed significance level fortesting is approximately 0. Dividing by 2, p-value

    is also approximately 0. (P