27
8/2/2019 KFS2312 Topic 10(1t) http://slidepdf.com/reader/full/kfs2312-topic-101t 1/27 INTRODUCTION In Topic 8, we will learn about a method to visually check for the relationship  between two variables using the two-way scatter plot and another method to measure the strength of this relationship using correlation. If a relationship exists, we would like to know the meaning of the relationship. Once we have determined the relationship in terms of equation, we will be able to predict the value of a variable given the value of the other variable. Statistical method used to examine a linear relationship between 2 variables is called Simple Linear Regression. Only quantitative variables are considered in this case. LEARNING OUTCOMES By the end of this topic, you should be able to: 1. Explain regression concepts, 2. Construct simple linear regression model and identify the assumptions made; 3. Prove mathematically using the least squares estimate method, how a regression model is constructed; 4. Identify inferential concepts for the regression parameters; 5. Use appropriate methods to evaluate data suitability in fitting a regression model; 6. Use regression analysis for prediction and variables estimation. Topic Topic 10 10 Simple Linear Regression Analysis

KFS2312 Topic 10(1t)

Embed Size (px)

Citation preview

Page 1: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 1/27

INTRODUCTION

In Topic 8, we will learn about a method to visually check for the relationship

 between two variables using the two-way scatter plot and another method to

measure the strength of this relationship using correlation. If a relationship exists,

we would like to know the meaning of the relationship. Once we have determinedthe relationship in terms of equation, we will be able to predict the value of a

variable given the value of the other variable. Statistical method used to examine a

linear relationship between 2 variables is called Simple Linear Regression. Only

quantitative variables are considered in this case.

LEARNING OUTCOMES

By the end of this topic, you should be able to:

1. Explain regression concepts,

2. Construct simple linear regression model and identify the assumptions

made;

3. Prove mathematically using the least squares estimate method, how a

regression model is constructed;

4. Identify inferential concepts for the regression parameters;

5. Use appropriate methods to evaluate data suitability in fitting aregression model;

6. Use regression analysis for prediction and variables estimation.

TopicTopic

1010Simple Linear

RegressionAnalysis

Page 2: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 2/27

  TOPIC 10 CORRELATION & REGRESSION

INTRODUCTION TO REGRESSIONCONCEPTS

Regression analysis concepts deal with finding the best relationship betweendependent variable Y and independent variable X, quantifying the strength of that

relationship, and the use of methods that allow for prediction of the response values

(Y) given values of the regressor χ . The y variable value can only be determined if 

the independent variables values (denoted by 1 2, , ...,

k   x x x where k is the number of 

independent variables) are known.

Examples of independent variables are the amount of electrical consumption in a

house, profit made by a company, final examination students’ grades, selling price

of a house etc. These are considered as dependent variables as their values dependon other variables. For example, the amount of electrical consumption in a house

depends on outside temperature during that day. If the temperature for the day were

high, then the occupants of the house would most probably turn on their air-

conditioner or fan to cool themselves down. Hence, we can say that temperature is

an independent variable since it is a factor that influences the amount of electrical

consumption in a house. Another possible variable is the number of electrical

appliances in a house - the more it has, the greater the amount of electrical

consumption.

Regression analysis is used to determine the mathematical relationship betweenthese variables through a linear equation termed as regression model. From the

model, we can predict the y value for a given value of χ.

10.1

2

SELF-CHECK 

Try to think of the independent variables for the following dependent

variables:

(a) Profit made by a firm;

(b) Students’ final examination grade; and

(c) Selling price of a house.

Page 3: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 3/27

  TOPIC 10 CORRELATION & REGRESSION

SIMPLE LINEAR REGRESSIONMODEL AND ITS ASSUMPTIONS

A simple linear regression model involves only one variable, that is for k = 1 case.Multiple linear regressions are employed for cases involving more than one

independent variable (k > 1). A simple linear regression model is written as

0 1 y x= β + β ∈

ε refers to the random variable for errors/residuals. Errors/Residuals exist due to

imperfect relationship between variables and measurements are rarely done without

errors. To further understand about errors, let us see the following example:

A property development manager would like to know the estimation of the selling price for each house that will be built. He knows that the cost of building a house is

RM90 for each square feet and the land price is RM25,000 for an area of 4,500

square feet. Hence, the manager can estimate the selling price using the equation

 below:

 y = 25,000 + 90 x (10.2)

where  y = selling price and  x = house size in square feet. If the house is 2,000

square feet, the price would be RM205,000, that is

 y = 25,000 + 90(2,000) = 205,000

However, this is only an estimated price and the actual price (based on observation)

would be between RM180,000 and RM250,000. For this reason, to reflect the actual

situation, another simple linear regression model replaces the previous model, that

is:

 y = 25,000 + 90 x + ε (10.2)

where ε is a random variable for errors representing all other variables which are

not considered in equation (5.1). In other words, the selling price for the same size

will also differ due to other factors such as location, number of bedrooms, toilets

and other unknown factors.

The simple linear regression model 0 1 y x= β + β is a population model and

regression coefficients 0β and 1

β values are population parameters. It is difficult to

get these values of the population parameter and for this purpose, sample data is

collected to estimate the values. The estimation model is as shown below.

0 1ˆ ˆˆ = β + β y x (10.3) 

10.2

3

Page 4: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 4/27

  TOPIC 10 CORRELATION & REGRESSION

Here,  y is the predicted/fitted value for  y, 0β is the estimation for population

 parameter  0β and 1β is the estimation for population parameter  1

β . The estimation

model (10.3) is a linear equation with 1β parameter as the regression slope and 0β   parameter as the  y-intercept  , which is the y value when x is zero (Refer Figure

10.1). However, in most cases, when  x = 0, the y value does not carry any

significant meaning and at times  x = 0 is not possible to happen. The slope of a

straight line is a fixed value that explains the changes (increasing or decreasing) in

 y value given a one unit change in x value.

Figure 10.1: Estimation model

Errors (Refer to Figure 10.1) are obtained from the difference between y observed

values with  y  fitted values. This is denoted by i∈ for i = 1, 2…n and the formula

is:

iiiy y ˆ−=∈ (10.4)

The residuals, ∈i is a random variable. To determine whether a calculated simple

linear regression is a good estimate for the population, we need to ensure that the∈i random variable satisfies few conditions. The assumptions made on ∈i random

variable are:

(a) i∈ is distributed as normal; that is i∈ ~ N (0, s2), i=1, 2, …, n.

(b) mean for  i∈ is zero, that is E ( i∈ ) = 0, I = 1,2,…,n.

(c) standard deviation for  i∈ is s; that is s( i∈ ) = s , i=1, 2, …., n fixed

4

0 1β β = +ˆ ˆ ˆ  y x

Page 5: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 5/27

  TOPIC 10 CORRELATION & REGRESSION

(d) i∈ for any y value is independent of  i∈ for other values of y.

Assumption 1 is made to facilitate the inferential processes (hypothesis test and

confidence interval) on the significance of the relationship between x and  y, asdisplayed by the fitted line. Assumptions 2 and 3 refer to the linearity of a

regression model. Suppose we have the population regression model as below:

0 1 y x= β + β + ∈ (10.5)

For each x value, y is distributed as normal with mean

( ) 0 1 E y x = β + β (10.6)

and standard deviation

 s( y) = σ∈. (10.7)

Observe from equation (10.6), mean E ( y) depends on x but the standard deviation

does not depend on anything. This is because σ∈ is fixed for all x values. The visual

display of a simple linear regression is shown in Figure 10.2 below.

Figure 10.2: Simple linear regression model

5

Page 6: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 6/27

  TOPIC 10 CORRELATION & REGRESSION

THE LEAST SQUARES METHOD

Using the available data, how can we derive a simple linear regression model

1ˆ ˆˆ

o y x= β + β ? In the previous sub-topic, we will learn about two-way scatter plot to

visualize the relationship between two variables, or in this case between

independent and dependent variables. If there exists a linear relationship, we would

 be able to draw a straight line across all available data. However, this situation is

rare due to errors/residuals.

You can try some online activities at these websites:

• http//www.stat.uiuc.edu/~stat100/java/guess/PPApplet.html

• http//www.texasoft.com/winkslr.html

When the straight line fails to capture all the data (point ( x, y) on the graph), what

must we do to obtain the best straight line? This best straight line refers to the fitted

straight line that we build in the two-way scatter plot that best represents the

relationship between the two variables. This fitted line would be a straight line that

is close to points ( x, y) and when the errors between the points on the straight line

(estimated) and actual observed points are minimised. However, the total errors

i∈∑ does not represent the distance between the actual and observed points. Let us

look at an example to prove why ˆ( )i i y y−∑ is not suitable to represent the

distance value of the actual and observed points.

10.

3

6

EXERCISE 10.1

Given regression equation  y = –12.84 + 36.18 x, state the values of  0β and  1β  and explain both values. Next, calculate the residuals using the following data:

 x 8.3 8.3 12.1 12.1 17.0 17.0 17.0 24.3 24.3 24.3 33.6

 y 227 312 362 521 640 539 728 945 738 759 1263

Page 7: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 7/27

  TOPIC 10 CORRELATION & REGRESSION

Figure 5.3(a): Data (a)

Figure 10.3(b): Data (b)

With reference to Figures 10.3(a) and 10.3(b), we can see that the positions of thetwo data sets [data (a) and (b)] are different. The total errors for data (a) and data

(b) are calculated as:

7

Page 8: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 8/27

  TOPIC 10 CORRELATION & REGRESSION

Data (a)

iii  y y ˆ−=∈Data (b)

iii  y y ˆ−=∈

8 – 6 = 2

1 – 5 = –4

6 – 4 = 2

7 – 6 = 1

6 – 5 = 1

2 – 4 = –2

∑∈i= 0 ∑∈i= 0

Total errors are zero for both data (a) and (b), and this always holds. This figure

shows that the distance of data points (a) and (b) from the regression line is the

same. However, from both graphs in Figure 5.3, we can see that this is not true.There exist differences in positions of data points (a) and (b) from the regression

line where data points (b) are closer to the regression line compared to data points

(a). Hence, ∑∈i is not suitable to be used as a selection criteria.

So, how can we solve this problem? It can be solved if we squared each error before

summing them up. The following table are the values of 2

)ˆ(∑ − ii  y y for data (a)

and (b).

Data (a)

2 2ˆ( )i i i y y∈ = −

Data (b)

2 2ˆ( )i i i y y∈ = −

(8 – 6)2 = 4

(1 – 5)2 = 16

(6 – 4)2 = 4

(7 – 6)2 = 1

(6 – 5)2 = 1

(2 – 4)2 = 4

∑(∈i)2=24 ∑(∈i)

2= 6

 

Based on 2ˆ( )i i y y−∑ values for both data (a) and (b), it shows that the total sum of 

squares for data (b) is smaller than (a). This proves that points for data (b) are

nearer to the regression line and this line is the best fitted line. This method to

obtain the best fitted line based on the least squares summation is known as the

least squares method.

To fit the regression line, we need to get the estimates for regression coefficients 0β  

and 1β . Using the least squares method, the formula for regression coefficient 1β is:

8

Page 9: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 9/27

  TOPIC 10 CORRELATION & REGRESSION

11

22

1

ˆ

n

i i

i

n

i

i

 x y nxy

 x nx

=

−β =

(10.8)

where

 1β   = Estimated value of regression coefficient 1β

i x = Value of independent variable

i y = Value of dependent variable

 x = Mean value of independent variable y = Mean value of dependent variable

n = number of ( x, y) pairs

After getting the estimate for  1β , we can derive the value for  0β . The formula to get

0β is:

0 1ˆ y xβ = − β (10.9)

where

0β = Estimated value of regression coefficient

Worked Example 10.1

For the following data, find the value of regression coefficients 0β   and 1β , and

write down the fitted regression model:

 x 3 7 6 6 10 12 12 12 13 13 14 15

 y 33 38 24 61 52 45 29 65 82 63 50 79

Answer:

To facilitate the calculation of parameter values 0β and 1β , we can form the

following table.

i x i y i i x y 2

i x2

i y

3 33 99 9 1089

7 38 266 49 1444

6 24 144 36 576

6 61 366 36 3721

9

Page 10: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 10/27

  TOPIC 10 CORRELATION & REGRESSION

10 52 520 100 2704

12 45 540 144 2025

12 29 348 144 4225

12 65 780 144 6724

13 82 1066 169 841

13 63 819 169 3969

14 50 700 196 2500

15 79 1185 225 6241

i x∑ = 123 i y∑ = 621 i i x y∑  = 6833 2i x∑ = 1421 2i

 y∑ = 36059

From the table, firstly, we need to calculate x

and y

,

12310.25

12

i x x

n= = =

∑and

62151.75

12

i y y

n= = =

∑ . Now, we can get 1β regression coefficient using this

formula:

( ) ( )

( )

11 2

2 2

1

6833 12 10.25 51.75ˆ 2.92

1421 12 10.25

n

i i

i

n

i

i

 x y nxy

 x nx

=

−−

β = = =

−−

∑and for  0β regression coefficient,

0 1ˆ 51.75 2.92(10.25) 21.82 y xβ = −β = − =

Hence, the simple linear regression model is  y = 21.82 + 2.92 x.

10

EXERCISE 10.2

Fit a simple linear regression model to the data below:

 x 60 62 64 65 66 67 68 70 72 74

 y 63.6 65.2 66.0 65.5 66.9 67.1 67.4 68.3 70.1 70

Interpret the regression coefficients obtained in this model.

Page 11: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 11/27

  TOPIC 10 CORRELATION & REGRESSION

INFERENCES ON REGRESSIONCOEFFICIENTS

Inferential statistics is a branch of statistics which is concerned with making

conclusions about population based on information from samples. The calculatedregression coefficients

0β   and 1β from sample data are only estimation of the

  population parameters. In other words, the regression coefficients values are

subject to sampling errors. Regression coefficient values, whether positive or 

negative, do not necessarily mean that the intended population parameters possess

the sample values. Hence, a test is needed to verify whether the regression

coefficients obtained are either in positive or negative form, which is not a zero

value.

We are going to test the parameter for the population regression slope 1β using the

1β regression coefficient. The hypothesis testing process for testing population

 parameter  1β is similar to that of testing mean and variance. We will begin with a

hypothesis statement. The null hypothesis claims that there is no linear relationship,

which means the slope of the regression line is zero. If we accept the null

hypothesis, this means the population regression line is a straight line that shows y

value does not change with the changes in x value. In this case, information on x is

not enough to assist in predicting y value. On the other hand, if the null hypothesis

is rejected, there is enough evidence to say 1β is not zero, that is either  1β >0 or  1β <

0. This shows that the regression line has a tendency to increase or decrease and thishelps in predicting y value using x value.

10.4

11

SELF-CHECK 

State the application of the least squares estimate method.

Page 12: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 12/27

  TOPIC 10 CORRELATION & REGRESSION

We can perform either a one-sided ( 1β > 0 or  1β < 0) or a two-sided ( 1β ≠ 0) test to

determine if there is enough evidence to conclude the existence of a linear 

relationship (that is population 1β is not zero). Hence, test the hypothesis:

 s( 1

ˆ

β ) is the standard deviation for  1

ˆ

β . The formula to get the standard deviation for 1β is:

( )

2

0 1 1

12 2

ˆ ˆ

i i i

i

 y y x y

n s

 x nx

− β − β−β =−

∑ ∑ ∑

∑Apart from hypothesis testing, we can also construct a confidence interval for  1β .

Confidence interval will provide a confidence range that contains the value of 

 population parameter at a certain α level. Based on T test statistic (two-sided) thatfollows 2nt  − distribution, we can construct a (1 - α ) 100% confidence interval as

 below:

( ) ( )1 2 2 1 1 1 2 2 1α − α −β − β ≤ β ≤ β + β  ,n ,nˆ ˆ ˆ ˆ  t s t s

Worked Example 10.2

Based on the data in Example 10.1, prove that at 0.05 significance level, there is

12

 H o

:1β = 0

 H 1

: 1.1β > 0

2. 1β < 0

3. 1β   ≠ 0

Test Statistic :1

1 1

ˆ ˆ ˆ

ˆ ˆ( ) ( )T 

 s s

β − β β= =

β β 

Test Result : T follows t distribution with v = n – 2 degrees of freedom and α .significance level.

Reject H 0 when :1. T  > t α  ,v

2. T  < – t α  ,v

3. |T | > t α /2,v

Page 13: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 13/27

  TOPIC 10 CORRELATION & REGRESSION

enough evidence to say that there is a linear relationship between x and y, that is

( 1β ≠ 0). Construct a 95% confidence interval for  1β .

Answer:The hypothesis statement: 

 H 0 : 1β = 0

 H 1 : 1β  ≠ 0

Test Statistic :( )1

ˆ 2.922.317

ˆ 1.26T 

 s

β= = =

β

 

Test Result : T follows a t distribution with v = 12 – 2 = 10 degrees of freedom at 0.05 significance level.

Reject H 0 when

|T | > 0.025,10t  = 2.228

Prior to obtaining the test statistic value, we need to calculate the value of  ( )1ˆ  s β .

( )

( ) ( )

( )1

2

36059 21.82 621 2.92 6833

10ˆ

1421 12 10.25

=1.26

 s

− −

β =−

Since the test statistic (T = 2.317) > 2.228 ( 0.025,10t  ), we reject the null hypothesis.

Hence, we can conclude that 1β is not zero, that there is enough evidence of the

existence of a linear relationship between x and y.

The 95% confidence interval (hence α = 0.05) for  1β is

( ) ( )( ) ( )

1 0 025 10 1 1 1 0 025 10 1

1

1

2 92 2 228 1 26 2 92 2 228 1 260 113 5 727

. , . ,ˆ ˆ ˆ ˆ  t s t s

. . . . . .

. .

β − β ≤ β ≤ β + β− ≤ β ≤ +

≤ β ≤

This confidence interval shows that the y value will increase between 0.113 and

5.727 for each increment in x. The wide range for  1β is due to small sample size.

13

Page 14: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 14/27

  TOPIC 10 CORRELATION & REGRESSION

A test can also be performed on y-intercept0

β  using regression coefficient0

ˆ β .

However, as discussed in Section 9.2, in most cases, the y-intercept does not

carry any meaning; hence a test on its value can be ignored.

MODEL ADEQUACY CHECK 

Confidence interval constructed using available data must be able to represent the

 population involved. The fitted regression model should satisfy all assumptions in

the underlying model. The inferential model is not valid if these assumptions are

not satisfied. There are three methods to check for model adequacy – coefficient of 

determination, residual plot and transformation.

10.5.1 Coefficient of Determination, R2

Coefficient of Determination,  R2, is a measurement of the proportion of variation

in the dependent variable that can be explained by the fitted regression model. To

further understand the coefficient of determination, refer to Figure 5.4.

10.5

14

EXERCISE 10.3Answer the following questions based on the data in Exercise 5.2:

(a) Test the significance of  1β parameter at 0.05 level; and

(b) Construct a 99% confidence interval for  1β . Interpret this confidence

interval.

Page 15: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 15/27

  TOPIC 10 CORRELATION & REGRESSION

Figure 5.4

This is similar to saying that we are quantifying the contribution of x in predicting

y. Refer to Figure 10.4. For each x value, for example 0 x , we can separate the 0 y  

deviation from mean y into two parts; one part is for “explained variation” and the

other part for “unexplained variation”. Total variation is the total sum of squares of deviations from mean of the y points, that is ( )

2

i y y−∑ . This can be derived from

variation term in y, that is,

)ˆ()ˆ(  y y y y y y ii −+−=−

∑∑∑ −+−=− 222 )ˆ()ˆ()(  y y y y y y ii

The unexplained variation is the sum of squares of deviations of observed y from y

estimate, that is ( )2

i y y−∑ and explained variation is the sum of squares of 

deviations of fitted values from mean, that is2

ˆ( ) y y−∑ .

Hence, if we want to express the proportion of explained variation, the simplified

formula for coefficient of determination 2 R value is,

2

0 12

2 2

ˆ ˆi i i

i

 y x y ny R

 y ny

β + β −=

−∑ ∑

∑(10.10)

15

ˆ  y

 y

 y

Page 16: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 16/27

  TOPIC 10 CORRELATION & REGRESSION

This coefficient of determination (10.10) is always positive, that is 0 ≤   R2  ≤ 1, and

it is usually expressed in percentage that is by multiplying with 100%. For example,

if  2 R = 0.57, we say that 57% of the variation in y can be explained by the fitted

regression. The remaining 43% cannot be explained. The bigger the value of  R2

(approaching 1), the better the data fits the simple linear regression model, that is

the data concerned can explain the population well.

Refer to the following example:

Worked Example 10.3

Based on the data in Example 10.1, calculate the coefficient of determination and

interpret its meaning if  y = sales and x = number of radio advertisements.

Answer:

The coefficient of determination is

221 82 621 2 92 6833 12 51 752236059 12 51 75

0 3481

. ( ) . ( ) ( . ) R

( . )

.

+ −=

=

This means the fitted regression model can explain only 34.81% of variation in

sales and 65.19% of variation in sales can be explained by other factors.

10.5.2 Residual Plot

The validity of many of the inferences associated with a regression analysis

depends on the error term, ε, satisfying certain assumptions. Hence, it is highly

16

ACTIVITY 10.2

What is the difference between the application of correlation coefficients and

coefficient of determination in regression model? Discuss in class.

EXERCISE 10.4

Based on data in Exercise 10.2, calculate the coefficient of determination R2

and interpret the value obtained.

Page 17: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 17/27

  TOPIC 10 CORRELATION & REGRESSION

recommended that some sort of analysis be conducted to assess these assumptions.

 No regression analysis can be considered as complete without such examination.

This can be done through a graphical technique called residual plot. From the plot,

we will be able to check whether:

(a) The fitted model is linear or not;

(b) Variance for error  i∈ is constant or proportional to i x ; and

(c) The residuals are distributed as normal or not.

The following are a few graphs that show deviations from assumptions made.

Figure 10.5 is a plot of  i∈ versus the fitted values i y or  i x to determine whether 

the linearity assumption is met or not. The graph shows that the data plotted forms a

curve and hence, we can conclude that the fitted model is non-linear.

Figure 10.5: Model is non-linear, instead curvature

Figure 10.6 (plot of  i∈ versus the fitted values i y or  i x ) shows deviation of the model

from assumption that the random errors have constant variance. Plot of data shows a

 bell-shaped pattern. This means random errors instead of having a non-constant

variance, the errors are actually proportional to  y values. The random errors have

constant variance if the graph shows a random pattern or no trend.

17

or i iˆ  y x

Page 18: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 18/27

  TOPIC 10 CORRELATION & REGRESSION

Figure 10.6: Model with non-constant error variance, instead proportional to iˆ  y or  i

 x

Graph in Figure 10.7 is histogram on errors. This is to determine whether random

errors or residuals are distributed as normal. If it is distributed as normal, the

histogram will form a bell-shaped curve. Histogram in Figure 10.6 shows the

assumption on normality of random errors is not met, as the histogram’s shape is

not normal.

 Figure 10.7: Histogram of residuals

18

ACTIVITY 10.3

What can you do to the model if there exists violation of assumptions?

EXERCISE 10.5

Explain the meaning of each of these plots:

(a) (b) (c)

or i iˆ  y x

Page 19: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 19/27

  TOPIC 10 CORRELATION & REGRESSION

10.5.3 Some Transformations

Transformation is important if the regression model is in non-linear shape. The

linearity of any model can be verified by drawing a two-way scatter plot. A linear regression model will display a linear function, which is in straight-line form. A

common transformation function is logarithm or inverse, either on x or y. The

following are a few transformation examples to change some non-linear functions

to their linear form.

Table 10.1: Some Transformations

Functional Form

that Relates y to xTransformation

Linear Regression

Model Form

Exponent: y = 1

0

 xe

ββ  y* = ln y y* = ln 0β + 1β  x

Power:  y = 1

0 xββ  y* = log y; x* = log x y* = log 0β + 1β  x*

Inverse:  x* = 1–  x ;  y = 0β + 1β  x*

Hyperbolic:  y* = 1–  y ;  x* = 1–  x y* = 0β  x*+ 1β

A two-way scatter plot is very useful to ascertain whether a model has a linear or 

non-linear form. Hence, it is good to know the shape of Exponential, Power,Inverse and Hyperbolic functions (refer Figure 10.8). Observe Figures 5.1 and 5.8

on the chosen transformations.

19

EXERCISE 10.6

Draw a two-way plot on the following regression models and then perform

transformation on the models and obtain the linear regression models.

(a)1

2.67 0.68 y

 x

 = −    

(b)  y =2e3.1 x

(c)  y = 1.5 x0.85

(d)0.4 2

=+

 x y

 x 

Page 20: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 20/27

  TOPIC 10 CORRELATION & REGRESSION

(a) Exponential Function

(b) Power Function

(a) Inverse Function

(a) Hyperbolic Function

Figure 10.8: Functional forms

20

Page 21: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 21/27

  TOPIC 10 CORRELATION & REGRESSION

PREDICTION AND ESTIMATIONUSING REGRESSION MODEL

One of the reasons to build a linear regression is to predict variable values at futurex values. For example, refer to the property development manager’s problem in

estimating the selling price (in RM) for each house built (refer section 10.2). Using

regression model

ˆ 25,000 90 , b a y x x x x= + ≤ ≤ (10.11)

where  y = selling price and  x = house size (in square foot). The  x values are in

 between a x and b x . If we would like to predict the selling price of a house where

the built-up area is 2,000 square feet, where the value 2,000 > a x , we can use the

regression model with  x value = 2,000. Based on the regression equation, themanager can predict that the selling price for each house with 2,000 square feet is

RM205,000.

However, this selling price is a forward estimation and it does not explain about the

 position of that value with respect to actual selling price. In other words, is the

estimation value close to the actual value or very different? This relates to the

reliability aspect of certain prediction. To get information on the position of 

estimation values versus actual values, we need to use intervals. There are two

types of interval used - prediction interval for any dependent variable y andestimation interval for estimated value of  y.

10.6.1Prediction Interval for an IndividualValue of y 

The prediction interval is used to predict a certain value of dependent variable y,

given a specific value of independent variable x when this x value is outside the

range of x values, that is  x > a x or  x < b x . The term “prediction interval” is used

rather than confidence interval because a population parameter is not being

estimated in this case; instead, the response or performance of a single individual in

10.6

21

ACTIVITY 10.4

What are other situations that require prediction or estimation?

Page 22: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 22/27

  TOPIC 10 CORRELATION & REGRESSION

the population is being predicted. The formula to get a (1 - α ) 100% prediction

interval is:

( )( )

2

/ 2 2

1ˆ 1g 

i

 x x  y t sn x x

α ε −± + +

−∑ (10.12)

 y = Future estimated value of dependent variable calculated from

( )0 1 g ˆ ˆ ˆ  y x= β + β

2

tα  = The critical value of t distribution with n – 2 degrees of freedom at

α   significance level

 s

∈= The standard deviation for estimator 

 g  x = The specific value of independent variable

 x = Mean value of independent variable

n = Sample size

Standard error of the estimate, se, is a measure of reliability of any estimation

equation. It represents an estimate of the standard deviation around the regression

lines. It is often referred to as the standard error of the regression, that is the degree

of deviation of observed values from estimated values based on regression line. The

formula for  s

∈ is:

( )2

1

ˆ

2

n

i i

i

 y y

 sn

=∈

=−

(10.13)

Worked Example 10.4

Refer to data in Example 10.1. Calculate the 95% prediction interval for  x = 20 and

explain its meaning if  y = sales and x = number of advertisement in the radio.

Answer:

Refer to Example 5.1, the simple linear regression model is

   y = 21.82 + 2.92 x

When  g  x = 20, ˆ y = 21.82 + 2.92 (20) = 80.22.

To get the standard error of the estimate, we need y values. This can be generated

using regression model ˆ y = 21.82 + 2.92 x. Hence, data as in the table:

22

Page 23: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 23/27

  TOPIC 10 CORRELATION & REGRESSION

 x 3 7 6 6 10 12

 y 33 38 24 61 52 45

 y^ 30.58 42.26 39.34 39.34 51.02 56.86

 x 12 12 13 13 14 15

 y 29 65 82 63 50 79

 y^ 56.86 56.86 59.78 59.78 62.70 65.62

Hence,

( )2

1

ˆ2556.946

15.99

2 10

n

i i

i

 y y

 s

n

=∈

= = =

Since α = 0.05, t a/2 = t 0.025 = 2.228. Thus, the 95% prediction interval for   g  x = 20 is

( )( )

( )

2

/ 2 2

2

1ˆ 1

20 10.25180.22 (2.228)(15.99) 1

12 160.25

80.22 46.13

 g 

i

 x x y t s

n x xα  ∈

−± + +

−± + +

±

The lower and upper limits of the prediction interval are 34.09 and 126.35

respectively. This shows that the minimum predicted sales is 34 units and

maximum is at 126 units when 20 advertisements are broadcasted on radio.

10.6.2 Confidence Interval for a Mean Valueof y

The points on the least squares line corresponding to each x value in population

regression model 0 1 y x= β + β + ∈, y will be distributed as normal with mean

23

EXERCISE 10.7

Refer to data in Exercise 10.2. Calculate the 99% prediction interval for  x = 86

and provide an explanation for it.

Page 24: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 24/27

  TOPIC 10 CORRELATION & REGRESSION

( ) 0 1  E y x= β + β

Hence, to estimate a mean value of  y, given any x g value, we can use the following

interval:

( )( )

2

/ 2 2

i

 x x  y t s

n x xα  ∈

−± +

−∑ (10.14)

This interval applies when any specific value of  x lies between the interval for 

independent variable x values, i.e. xb ≤  x ≤   xa.

Worked Example 10.5

Refer to data in Example 10.1. Calculate the 95% confidence interval for the mean

value of y when 11 g  x = and explain its meaning if  y = sales and  x = number of 

radio advertisements.

Answer:

The values for  y , t α /2 and s∈ can be obtained from Example 10.4. Hence, the 95%

confidence interval for the mean value of y when 11 g  x = is:

 

( )( )

( )

2

/ 2 2

2

1ˆ 

11 10.25153.94 (2.228)(15.99)

12 160.25

53.94 10.50

 g 

i

 x x  y t sn x x

α  ∈ −± +−

−± +

±

 Note that the lower and upper confidence limits for the mean value of  y are 43.44

and 64.44 respectively. This shows that the minimum mean sales is 43 units while

the maximum is 64 units when 11 radio advertisements are broadcasted.

24

EXERCISE 10.8

Refer to Exercise 10.2. Calculate the 99% confidence interval for mean y

when x = 69 and provide an explanation on it.

Page 25: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 25/27

  TOPIC 10 CORRELATION & REGRESSION 25

EXERCISE 10.9

1. Determine whether the following statements are true or false. If false,

write down the correct statement.

(a) Regression analysis is used to display the validity of an estimated

equation that explains the relationship of subjects under study.

(b) Given a straight-line equation  y = a – bx, we can say the

relationship between y and x is positive linear.

(c)  R2 value approaching zero indicates a strong correlation between x

and y.

(d) Regression line is derived from sampling, not from population

under study.

2. Fill in the blanks for each of the following statements:

(a) If the value of a dependent variable decreases when the value of an

independent variable increases, their relationship is

 _____________________.

(b) Each straight line has ___________________, which explains how

much the change in dependent variable given a unit increase in

independent variable.

(c) The least squares method is used to get _______________ and

 __________________ of regression line.

(d) If the coefficient of determination is 0.80, this means 80% of 

variation in the dependent variable ______________ by variation

in the independent variable.

3. An economist would like to know the influence of interest rate on total

investments made by a company. He has collected some data for a

duration of eight months and it is displayed in the table below:

Investment

(RM Million)1.8 1.8 2.1 2.2 2.8 3.1 3.6 4.1

Interest Rate (%) 9.9 10.5 9.6 9.8 12.1 9.2 9.5 7.7

 

Page 26: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 26/27

  TOPIC 10 CORRELATION & REGRESSION

State the dependent and independent variables.Without using a two-way scatter plot, is the regression slope positive

or negative? Explain its meaning.

By how much will investment value change for a unit change in the

interest rate?

How much is the variation in investment can be explained by variation

in interest rate?

Many people assume that the total amount of money saved depends on their 

total income. The following data shows the average saving per month(RM’00) and average income per month (RM’00) for various groups

of employees.

Income192227303643475161641 Saving1.01.41.82.43.03.84.34.55.86.3Fit

a simple linear regression model for this data.

How much money will a person saved if his monthly income is

RM4,500?

Test the significance of regression slope at 5% significance level. Use

a one-sided test and explain the reason for using this test.

Obtain a 95% confidence interval for β1.

For the following data:

 x 12371011141416171 y203050100150200260400400700Calculate the

residuals.

Plot residuals versus x and .

What can you conclude from the plot in (b)?

Is the normality assumption for random errors satisfied?

Is there any sign of violation from model assumptions? If yes, whatneed to be done?

Obtain the appropriate transformation.

26

ACTIVITY 5.5

Page 27: KFS2312 Topic 10(1t)

8/2/2019 KFS2312 Topic 10(1t)

http://slidepdf.com/reader/full/kfs2312-topic-101t 27/27

  TOPIC 10 CORRELATION & REGRESSION

Please visit the following websites to read more about simple linear regression:

• http//www.pinkmonkey.com/studyguides/subjects/stats/chap8/s0808n01.asp

• http//stat.tamu.edu/stat30x/notes/node155.html

• Simple linear regression is a technique used to analyse the relationship

 between two variables. Regression analysis assumes that the relationship is in

linear form.

• The least squares method is used to get parameter estimates for slope of 

regression line and intercept on y-axis.

• A hypothesis test is performed on the slope of regression line to determine if 

there is enough evidence to support the existence of a linear relationship. The

validity of the relationship between the two variables can be determined using

coefficient of determination.

• Model adequacy check that is checking on violation of assumptions can be

 performed using residual plot and histogram.• If the simple linear regression model obtained is adequate for the data, the

model can be used to estimate dependent variable value for any specific

independent variable value. This model can also be used for prediction of a

mean value of dependent variables.

27