Upload
hartini-rahman
View
214
Download
0
Embed Size (px)
Citation preview
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 1/27
INTRODUCTION
In Topic 8, we will learn about a method to visually check for the relationship
between two variables using the two-way scatter plot and another method to
measure the strength of this relationship using correlation. If a relationship exists,
we would like to know the meaning of the relationship. Once we have determinedthe relationship in terms of equation, we will be able to predict the value of a
variable given the value of the other variable. Statistical method used to examine a
linear relationship between 2 variables is called Simple Linear Regression. Only
quantitative variables are considered in this case.
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain regression concepts,
2. Construct simple linear regression model and identify the assumptions
made;
3. Prove mathematically using the least squares estimate method, how a
regression model is constructed;
4. Identify inferential concepts for the regression parameters;
5. Use appropriate methods to evaluate data suitability in fitting aregression model;
6. Use regression analysis for prediction and variables estimation.
TopicTopic
1010Simple Linear
RegressionAnalysis
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 2/27
TOPIC 10 CORRELATION & REGRESSION
INTRODUCTION TO REGRESSIONCONCEPTS
Regression analysis concepts deal with finding the best relationship betweendependent variable Y and independent variable X, quantifying the strength of that
relationship, and the use of methods that allow for prediction of the response values
(Y) given values of the regressor χ . The y variable value can only be determined if
the independent variables values (denoted by 1 2, , ...,
k x x x where k is the number of
independent variables) are known.
Examples of independent variables are the amount of electrical consumption in a
house, profit made by a company, final examination students’ grades, selling price
of a house etc. These are considered as dependent variables as their values dependon other variables. For example, the amount of electrical consumption in a house
depends on outside temperature during that day. If the temperature for the day were
high, then the occupants of the house would most probably turn on their air-
conditioner or fan to cool themselves down. Hence, we can say that temperature is
an independent variable since it is a factor that influences the amount of electrical
consumption in a house. Another possible variable is the number of electrical
appliances in a house - the more it has, the greater the amount of electrical
consumption.
Regression analysis is used to determine the mathematical relationship betweenthese variables through a linear equation termed as regression model. From the
model, we can predict the y value for a given value of χ.
10.1
2
SELF-CHECK
Try to think of the independent variables for the following dependent
variables:
(a) Profit made by a firm;
(b) Students’ final examination grade; and
(c) Selling price of a house.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 3/27
TOPIC 10 CORRELATION & REGRESSION
SIMPLE LINEAR REGRESSIONMODEL AND ITS ASSUMPTIONS
A simple linear regression model involves only one variable, that is for k = 1 case.Multiple linear regressions are employed for cases involving more than one
independent variable (k > 1). A simple linear regression model is written as
0 1 y x= β + β ∈
ε refers to the random variable for errors/residuals. Errors/Residuals exist due to
imperfect relationship between variables and measurements are rarely done without
errors. To further understand about errors, let us see the following example:
A property development manager would like to know the estimation of the selling price for each house that will be built. He knows that the cost of building a house is
RM90 for each square feet and the land price is RM25,000 for an area of 4,500
square feet. Hence, the manager can estimate the selling price using the equation
below:
y = 25,000 + 90 x (10.2)
where y = selling price and x = house size in square feet. If the house is 2,000
square feet, the price would be RM205,000, that is
y = 25,000 + 90(2,000) = 205,000
However, this is only an estimated price and the actual price (based on observation)
would be between RM180,000 and RM250,000. For this reason, to reflect the actual
situation, another simple linear regression model replaces the previous model, that
is:
y = 25,000 + 90 x + ε (10.2)
where ε is a random variable for errors representing all other variables which are
not considered in equation (5.1). In other words, the selling price for the same size
will also differ due to other factors such as location, number of bedrooms, toilets
and other unknown factors.
The simple linear regression model 0 1 y x= β + β is a population model and
regression coefficients 0β and 1
β values are population parameters. It is difficult to
get these values of the population parameter and for this purpose, sample data is
collected to estimate the values. The estimation model is as shown below.
0 1ˆ ˆˆ = β + β y x (10.3)
10.2
3
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 4/27
TOPIC 10 CORRELATION & REGRESSION
Here, y is the predicted/fitted value for y, 0β is the estimation for population
parameter 0β and 1β is the estimation for population parameter 1
β . The estimation
model (10.3) is a linear equation with 1β parameter as the regression slope and 0β parameter as the y-intercept , which is the y value when x is zero (Refer Figure
10.1). However, in most cases, when x = 0, the y value does not carry any
significant meaning and at times x = 0 is not possible to happen. The slope of a
straight line is a fixed value that explains the changes (increasing or decreasing) in
y value given a one unit change in x value.
Figure 10.1: Estimation model
Errors (Refer to Figure 10.1) are obtained from the difference between y observed
values with y fitted values. This is denoted by i∈ for i = 1, 2…n and the formula
is:
iiiy y ˆ−=∈ (10.4)
The residuals, ∈i is a random variable. To determine whether a calculated simple
linear regression is a good estimate for the population, we need to ensure that the∈i random variable satisfies few conditions. The assumptions made on ∈i random
variable are:
(a) i∈ is distributed as normal; that is i∈ ~ N (0, s2), i=1, 2, …, n.
(b) mean for i∈ is zero, that is E ( i∈ ) = 0, I = 1,2,…,n.
(c) standard deviation for i∈ is s; that is s( i∈ ) = s , i=1, 2, …., n fixed
4
0 1β β = +ˆ ˆ ˆ y x
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 5/27
TOPIC 10 CORRELATION & REGRESSION
(d) i∈ for any y value is independent of i∈ for other values of y.
Assumption 1 is made to facilitate the inferential processes (hypothesis test and
confidence interval) on the significance of the relationship between x and y, asdisplayed by the fitted line. Assumptions 2 and 3 refer to the linearity of a
regression model. Suppose we have the population regression model as below:
0 1 y x= β + β + ∈ (10.5)
For each x value, y is distributed as normal with mean
( ) 0 1 E y x = β + β (10.6)
and standard deviation
s( y) = σ∈. (10.7)
Observe from equation (10.6), mean E ( y) depends on x but the standard deviation
does not depend on anything. This is because σ∈ is fixed for all x values. The visual
display of a simple linear regression is shown in Figure 10.2 below.
Figure 10.2: Simple linear regression model
5
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 6/27
TOPIC 10 CORRELATION & REGRESSION
THE LEAST SQUARES METHOD
Using the available data, how can we derive a simple linear regression model
1ˆ ˆˆ
o y x= β + β ? In the previous sub-topic, we will learn about two-way scatter plot to
visualize the relationship between two variables, or in this case between
independent and dependent variables. If there exists a linear relationship, we would
be able to draw a straight line across all available data. However, this situation is
rare due to errors/residuals.
You can try some online activities at these websites:
• http//www.stat.uiuc.edu/~stat100/java/guess/PPApplet.html
• http//www.texasoft.com/winkslr.html
When the straight line fails to capture all the data (point ( x, y) on the graph), what
must we do to obtain the best straight line? This best straight line refers to the fitted
straight line that we build in the two-way scatter plot that best represents the
relationship between the two variables. This fitted line would be a straight line that
is close to points ( x, y) and when the errors between the points on the straight line
(estimated) and actual observed points are minimised. However, the total errors
i∈∑ does not represent the distance between the actual and observed points. Let us
look at an example to prove why ˆ( )i i y y−∑ is not suitable to represent the
distance value of the actual and observed points.
10.
3
6
EXERCISE 10.1
Given regression equation y = –12.84 + 36.18 x, state the values of 0β and 1β and explain both values. Next, calculate the residuals using the following data:
x 8.3 8.3 12.1 12.1 17.0 17.0 17.0 24.3 24.3 24.3 33.6
y 227 312 362 521 640 539 728 945 738 759 1263
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 7/27
TOPIC 10 CORRELATION & REGRESSION
Figure 5.3(a): Data (a)
Figure 10.3(b): Data (b)
With reference to Figures 10.3(a) and 10.3(b), we can see that the positions of thetwo data sets [data (a) and (b)] are different. The total errors for data (a) and data
(b) are calculated as:
7
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 8/27
TOPIC 10 CORRELATION & REGRESSION
Data (a)
iii y y ˆ−=∈Data (b)
iii y y ˆ−=∈
8 – 6 = 2
1 – 5 = –4
6 – 4 = 2
7 – 6 = 1
6 – 5 = 1
2 – 4 = –2
∑∈i= 0 ∑∈i= 0
Total errors are zero for both data (a) and (b), and this always holds. This figure
shows that the distance of data points (a) and (b) from the regression line is the
same. However, from both graphs in Figure 5.3, we can see that this is not true.There exist differences in positions of data points (a) and (b) from the regression
line where data points (b) are closer to the regression line compared to data points
(a). Hence, ∑∈i is not suitable to be used as a selection criteria.
So, how can we solve this problem? It can be solved if we squared each error before
summing them up. The following table are the values of 2
)ˆ(∑ − ii y y for data (a)
and (b).
Data (a)
2 2ˆ( )i i i y y∈ = −
Data (b)
2 2ˆ( )i i i y y∈ = −
(8 – 6)2 = 4
(1 – 5)2 = 16
(6 – 4)2 = 4
(7 – 6)2 = 1
(6 – 5)2 = 1
(2 – 4)2 = 4
∑(∈i)2=24 ∑(∈i)
2= 6
Based on 2ˆ( )i i y y−∑ values for both data (a) and (b), it shows that the total sum of
squares for data (b) is smaller than (a). This proves that points for data (b) are
nearer to the regression line and this line is the best fitted line. This method to
obtain the best fitted line based on the least squares summation is known as the
least squares method.
To fit the regression line, we need to get the estimates for regression coefficients 0β
and 1β . Using the least squares method, the formula for regression coefficient 1β is:
8
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 9/27
TOPIC 10 CORRELATION & REGRESSION
11
22
1
ˆ
n
i i
i
n
i
i
x y nxy
x nx
−
=
−β =
−
∑
∑
(10.8)
where
1β = Estimated value of regression coefficient 1β
i x = Value of independent variable
i y = Value of dependent variable
x = Mean value of independent variable y = Mean value of dependent variable
n = number of ( x, y) pairs
After getting the estimate for 1β , we can derive the value for 0β . The formula to get
0β is:
0 1ˆ y xβ = − β (10.9)
where
0β = Estimated value of regression coefficient
0β
Worked Example 10.1
For the following data, find the value of regression coefficients 0β and 1β , and
write down the fitted regression model:
x 3 7 6 6 10 12 12 12 13 13 14 15
y 33 38 24 61 52 45 29 65 82 63 50 79
Answer:
To facilitate the calculation of parameter values 0β and 1β , we can form the
following table.
i x i y i i x y 2
i x2
i y
3 33 99 9 1089
7 38 266 49 1444
6 24 144 36 576
6 61 366 36 3721
9
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 10/27
TOPIC 10 CORRELATION & REGRESSION
10 52 520 100 2704
12 45 540 144 2025
12 29 348 144 4225
12 65 780 144 6724
13 82 1066 169 841
13 63 819 169 3969
14 50 700 196 2500
15 79 1185 225 6241
i x∑ = 123 i y∑ = 621 i i x y∑ = 6833 2i x∑ = 1421 2i
y∑ = 36059
From the table, firstly, we need to calculate x
and y
,
12310.25
12
i x x
n= = =
∑and
62151.75
12
i y y
n= = =
∑ . Now, we can get 1β regression coefficient using this
formula:
( ) ( )
( )
11 2
2 2
1
6833 12 10.25 51.75ˆ 2.92
1421 12 10.25
n
i i
i
n
i
i
x y nxy
x nx
−
=
−−
β = = =
−−
∑
∑and for 0β regression coefficient,
0 1ˆ 51.75 2.92(10.25) 21.82 y xβ = −β = − =
Hence, the simple linear regression model is y = 21.82 + 2.92 x.
10
EXERCISE 10.2
Fit a simple linear regression model to the data below:
x 60 62 64 65 66 67 68 70 72 74
y 63.6 65.2 66.0 65.5 66.9 67.1 67.4 68.3 70.1 70
Interpret the regression coefficients obtained in this model.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 11/27
TOPIC 10 CORRELATION & REGRESSION
INFERENCES ON REGRESSIONCOEFFICIENTS
Inferential statistics is a branch of statistics which is concerned with making
conclusions about population based on information from samples. The calculatedregression coefficients
0β and 1β from sample data are only estimation of the
population parameters. In other words, the regression coefficients values are
subject to sampling errors. Regression coefficient values, whether positive or
negative, do not necessarily mean that the intended population parameters possess
the sample values. Hence, a test is needed to verify whether the regression
coefficients obtained are either in positive or negative form, which is not a zero
value.
We are going to test the parameter for the population regression slope 1β using the
1β regression coefficient. The hypothesis testing process for testing population
parameter 1β is similar to that of testing mean and variance. We will begin with a
hypothesis statement. The null hypothesis claims that there is no linear relationship,
which means the slope of the regression line is zero. If we accept the null
hypothesis, this means the population regression line is a straight line that shows y
value does not change with the changes in x value. In this case, information on x is
not enough to assist in predicting y value. On the other hand, if the null hypothesis
is rejected, there is enough evidence to say 1β is not zero, that is either 1β >0 or 1β <
0. This shows that the regression line has a tendency to increase or decrease and thishelps in predicting y value using x value.
10.4
11
SELF-CHECK
State the application of the least squares estimate method.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 12/27
TOPIC 10 CORRELATION & REGRESSION
We can perform either a one-sided ( 1β > 0 or 1β < 0) or a two-sided ( 1β ≠ 0) test to
determine if there is enough evidence to conclude the existence of a linear
relationship (that is population 1β is not zero). Hence, test the hypothesis:
s( 1
ˆ
β ) is the standard deviation for 1
ˆ
β . The formula to get the standard deviation for 1β is:
( )
2
0 1 1
12 2
ˆ ˆ
2ˆ
i i i
i
y y x y
n s
x nx
− β − β−β =−
∑ ∑ ∑
∑Apart from hypothesis testing, we can also construct a confidence interval for 1β .
Confidence interval will provide a confidence range that contains the value of
population parameter at a certain α level. Based on T test statistic (two-sided) thatfollows 2nt − distribution, we can construct a (1 - α ) 100% confidence interval as
below:
( ) ( )1 2 2 1 1 1 2 2 1α − α −β − β ≤ β ≤ β + β ,n ,nˆ ˆ ˆ ˆ t s t s
Worked Example 10.2
Based on the data in Example 10.1, prove that at 0.05 significance level, there is
12
H o
:1β = 0
H 1
: 1.1β > 0
2. 1β < 0
3. 1β ≠ 0
Test Statistic :1
1 1
ˆ ˆ ˆ
ˆ ˆ( ) ( )T
s s
β − β β= =
β β
Test Result : T follows t distribution with v = n – 2 degrees of freedom and α .significance level.
Reject H 0 when :1. T > t α ,v
2. T < – t α ,v
3. |T | > t α /2,v
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 13/27
TOPIC 10 CORRELATION & REGRESSION
enough evidence to say that there is a linear relationship between x and y, that is
( 1β ≠ 0). Construct a 95% confidence interval for 1β .
Answer:The hypothesis statement:
H 0 : 1β = 0
H 1 : 1β ≠ 0
Test Statistic :( )1
ˆ 2.922.317
ˆ 1.26T
s
β= = =
β
Test Result : T follows a t distribution with v = 12 – 2 = 10 degrees of freedom at 0.05 significance level.
Reject H 0 when
|T | > 0.025,10t = 2.228
Prior to obtaining the test statistic value, we need to calculate the value of ( )1ˆ s β .
( )
( ) ( )
( )1
2
36059 21.82 621 2.92 6833
10ˆ
1421 12 10.25
=1.26
s
− −
β =−
Since the test statistic (T = 2.317) > 2.228 ( 0.025,10t ), we reject the null hypothesis.
Hence, we can conclude that 1β is not zero, that there is enough evidence of the
existence of a linear relationship between x and y.
The 95% confidence interval (hence α = 0.05) for 1β is
( ) ( )( ) ( )
1 0 025 10 1 1 1 0 025 10 1
1
1
2 92 2 228 1 26 2 92 2 228 1 260 113 5 727
. , . ,ˆ ˆ ˆ ˆ t s t s
. . . . . .
. .
β − β ≤ β ≤ β + β− ≤ β ≤ +
≤ β ≤
This confidence interval shows that the y value will increase between 0.113 and
5.727 for each increment in x. The wide range for 1β is due to small sample size.
13
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 14/27
TOPIC 10 CORRELATION & REGRESSION
A test can also be performed on y-intercept0
β using regression coefficient0
ˆ β .
However, as discussed in Section 9.2, in most cases, the y-intercept does not
carry any meaning; hence a test on its value can be ignored.
MODEL ADEQUACY CHECK
Confidence interval constructed using available data must be able to represent the
population involved. The fitted regression model should satisfy all assumptions in
the underlying model. The inferential model is not valid if these assumptions are
not satisfied. There are three methods to check for model adequacy – coefficient of
determination, residual plot and transformation.
10.5.1 Coefficient of Determination, R2
Coefficient of Determination, R2, is a measurement of the proportion of variation
in the dependent variable that can be explained by the fitted regression model. To
further understand the coefficient of determination, refer to Figure 5.4.
10.5
14
EXERCISE 10.3Answer the following questions based on the data in Exercise 5.2:
(a) Test the significance of 1β parameter at 0.05 level; and
(b) Construct a 99% confidence interval for 1β . Interpret this confidence
interval.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 15/27
TOPIC 10 CORRELATION & REGRESSION
Figure 5.4
This is similar to saying that we are quantifying the contribution of x in predicting
y. Refer to Figure 10.4. For each x value, for example 0 x , we can separate the 0 y
deviation from mean y into two parts; one part is for “explained variation” and the
other part for “unexplained variation”. Total variation is the total sum of squares of deviations from mean of the y points, that is ( )
2
i y y−∑ . This can be derived from
variation term in y, that is,
)ˆ()ˆ( y y y y y y ii −+−=−
∑∑∑ −+−=− 222 )ˆ()ˆ()( y y y y y y ii
The unexplained variation is the sum of squares of deviations of observed y from y
estimate, that is ( )2
i y y−∑ and explained variation is the sum of squares of
deviations of fitted values from mean, that is2
ˆ( ) y y−∑ .
Hence, if we want to express the proportion of explained variation, the simplified
formula for coefficient of determination 2 R value is,
2
0 12
2 2
ˆ ˆi i i
i
y x y ny R
y ny
β + β −=
−∑ ∑
∑(10.10)
15
ˆ y
y
y
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 16/27
TOPIC 10 CORRELATION & REGRESSION
This coefficient of determination (10.10) is always positive, that is 0 ≤ R2 ≤ 1, and
it is usually expressed in percentage that is by multiplying with 100%. For example,
if 2 R = 0.57, we say that 57% of the variation in y can be explained by the fitted
regression. The remaining 43% cannot be explained. The bigger the value of R2
(approaching 1), the better the data fits the simple linear regression model, that is
the data concerned can explain the population well.
Refer to the following example:
Worked Example 10.3
Based on the data in Example 10.1, calculate the coefficient of determination and
interpret its meaning if y = sales and x = number of radio advertisements.
Answer:
The coefficient of determination is
221 82 621 2 92 6833 12 51 752236059 12 51 75
0 3481
. ( ) . ( ) ( . ) R
( . )
.
+ −=
−
=
This means the fitted regression model can explain only 34.81% of variation in
sales and 65.19% of variation in sales can be explained by other factors.
10.5.2 Residual Plot
The validity of many of the inferences associated with a regression analysis
depends on the error term, ε, satisfying certain assumptions. Hence, it is highly
16
ACTIVITY 10.2
What is the difference between the application of correlation coefficients and
coefficient of determination in regression model? Discuss in class.
EXERCISE 10.4
Based on data in Exercise 10.2, calculate the coefficient of determination R2
and interpret the value obtained.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 17/27
TOPIC 10 CORRELATION & REGRESSION
recommended that some sort of analysis be conducted to assess these assumptions.
No regression analysis can be considered as complete without such examination.
This can be done through a graphical technique called residual plot. From the plot,
we will be able to check whether:
(a) The fitted model is linear or not;
(b) Variance for error i∈ is constant or proportional to i x ; and
(c) The residuals are distributed as normal or not.
The following are a few graphs that show deviations from assumptions made.
Figure 10.5 is a plot of i∈ versus the fitted values i y or i x to determine whether
the linearity assumption is met or not. The graph shows that the data plotted forms a
curve and hence, we can conclude that the fitted model is non-linear.
Figure 10.5: Model is non-linear, instead curvature
Figure 10.6 (plot of i∈ versus the fitted values i y or i x ) shows deviation of the model
from assumption that the random errors have constant variance. Plot of data shows a
bell-shaped pattern. This means random errors instead of having a non-constant
variance, the errors are actually proportional to y values. The random errors have
constant variance if the graph shows a random pattern or no trend.
17
or i iˆ y x
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 18/27
TOPIC 10 CORRELATION & REGRESSION
Figure 10.6: Model with non-constant error variance, instead proportional to iˆ y or i
x
Graph in Figure 10.7 is histogram on errors. This is to determine whether random
errors or residuals are distributed as normal. If it is distributed as normal, the
histogram will form a bell-shaped curve. Histogram in Figure 10.6 shows the
assumption on normality of random errors is not met, as the histogram’s shape is
not normal.
Figure 10.7: Histogram of residuals
18
ACTIVITY 10.3
What can you do to the model if there exists violation of assumptions?
EXERCISE 10.5
Explain the meaning of each of these plots:
(a) (b) (c)
or i iˆ y x
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 19/27
TOPIC 10 CORRELATION & REGRESSION
10.5.3 Some Transformations
Transformation is important if the regression model is in non-linear shape. The
linearity of any model can be verified by drawing a two-way scatter plot. A linear regression model will display a linear function, which is in straight-line form. A
common transformation function is logarithm or inverse, either on x or y. The
following are a few transformation examples to change some non-linear functions
to their linear form.
Table 10.1: Some Transformations
Functional Form
that Relates y to xTransformation
Linear Regression
Model Form
Exponent: y = 1
0
xe
ββ y* = ln y y* = ln 0β + 1β x
Power: y = 1
0 xββ y* = log y; x* = log x y* = log 0β + 1β x*
Inverse: x* = 1– x ; y = 0β + 1β x*
Hyperbolic: y* = 1– y ; x* = 1– x y* = 0β x*+ 1β
A two-way scatter plot is very useful to ascertain whether a model has a linear or
non-linear form. Hence, it is good to know the shape of Exponential, Power,Inverse and Hyperbolic functions (refer Figure 10.8). Observe Figures 5.1 and 5.8
on the chosen transformations.
19
EXERCISE 10.6
Draw a two-way plot on the following regression models and then perform
transformation on the models and obtain the linear regression models.
(a)1
2.67 0.68 y
x
= −
(b) y =2e3.1 x
(c) y = 1.5 x0.85
(d)0.4 2
=+
x y
x
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 20/27
TOPIC 10 CORRELATION & REGRESSION
(a) Exponential Function
(b) Power Function
(a) Inverse Function
(a) Hyperbolic Function
Figure 10.8: Functional forms
20
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 21/27
TOPIC 10 CORRELATION & REGRESSION
PREDICTION AND ESTIMATIONUSING REGRESSION MODEL
One of the reasons to build a linear regression is to predict variable values at futurex values. For example, refer to the property development manager’s problem in
estimating the selling price (in RM) for each house built (refer section 10.2). Using
regression model
ˆ 25,000 90 , b a y x x x x= + ≤ ≤ (10.11)
where y = selling price and x = house size (in square foot). The x values are in
between a x and b x . If we would like to predict the selling price of a house where
the built-up area is 2,000 square feet, where the value 2,000 > a x , we can use the
regression model with x value = 2,000. Based on the regression equation, themanager can predict that the selling price for each house with 2,000 square feet is
RM205,000.
However, this selling price is a forward estimation and it does not explain about the
position of that value with respect to actual selling price. In other words, is the
estimation value close to the actual value or very different? This relates to the
reliability aspect of certain prediction. To get information on the position of
estimation values versus actual values, we need to use intervals. There are two
types of interval used - prediction interval for any dependent variable y andestimation interval for estimated value of y.
10.6.1Prediction Interval for an IndividualValue of y
The prediction interval is used to predict a certain value of dependent variable y,
given a specific value of independent variable x when this x value is outside the
range of x values, that is x > a x or x < b x . The term “prediction interval” is used
rather than confidence interval because a population parameter is not being
estimated in this case; instead, the response or performance of a single individual in
10.6
21
ACTIVITY 10.4
What are other situations that require prediction or estimation?
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 22/27
TOPIC 10 CORRELATION & REGRESSION
the population is being predicted. The formula to get a (1 - α ) 100% prediction
interval is:
( )( )
2
/ 2 2
1ˆ 1g
i
x x y t sn x x
α ε −± + +
−∑ (10.12)
y = Future estimated value of dependent variable calculated from
( )0 1 g ˆ ˆ ˆ y x= β + β
2
tα = The critical value of t distribution with n – 2 degrees of freedom at
α significance level
s
∈= The standard deviation for estimator
g x = The specific value of independent variable
x = Mean value of independent variable
n = Sample size
Standard error of the estimate, se, is a measure of reliability of any estimation
equation. It represents an estimate of the standard deviation around the regression
lines. It is often referred to as the standard error of the regression, that is the degree
of deviation of observed values from estimated values based on regression line. The
formula for s
∈ is:
( )2
1
ˆ
2
n
i i
i
y y
sn
=∈
−
=−
∑
(10.13)
Worked Example 10.4
Refer to data in Example 10.1. Calculate the 95% prediction interval for x = 20 and
explain its meaning if y = sales and x = number of advertisement in the radio.
Answer:
Refer to Example 5.1, the simple linear regression model is
y = 21.82 + 2.92 x
When g x = 20, ˆ y = 21.82 + 2.92 (20) = 80.22.
To get the standard error of the estimate, we need y values. This can be generated
using regression model ˆ y = 21.82 + 2.92 x. Hence, data as in the table:
22
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 23/27
TOPIC 10 CORRELATION & REGRESSION
x 3 7 6 6 10 12
y 33 38 24 61 52 45
y^ 30.58 42.26 39.34 39.34 51.02 56.86
x 12 12 13 13 14 15
y 29 65 82 63 50 79
y^ 56.86 56.86 59.78 59.78 62.70 65.62
Hence,
( )2
1
ˆ2556.946
15.99
2 10
n
i i
i
y y
s
n
=∈
−
= = =
−
∑
Since α = 0.05, t a/2 = t 0.025 = 2.228. Thus, the 95% prediction interval for g x = 20 is
( )( )
( )
2
/ 2 2
2
1ˆ 1
20 10.25180.22 (2.228)(15.99) 1
12 160.25
80.22 46.13
g
i
x x y t s
n x xα ∈
−± + +
−
−± + +
±
∑
The lower and upper limits of the prediction interval are 34.09 and 126.35
respectively. This shows that the minimum predicted sales is 34 units and
maximum is at 126 units when 20 advertisements are broadcasted on radio.
10.6.2 Confidence Interval for a Mean Valueof y
The points on the least squares line corresponding to each x value in population
regression model 0 1 y x= β + β + ∈, y will be distributed as normal with mean
23
EXERCISE 10.7
Refer to data in Exercise 10.2. Calculate the 99% prediction interval for x = 86
and provide an explanation for it.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 24/27
TOPIC 10 CORRELATION & REGRESSION
( ) 0 1 E y x= β + β
Hence, to estimate a mean value of y, given any x g value, we can use the following
interval:
( )( )
2
/ 2 2
1ˆ
g
i
x x y t s
n x xα ∈
−± +
−∑ (10.14)
This interval applies when any specific value of x lies between the interval for
independent variable x values, i.e. xb ≤ x ≤ xa.
Worked Example 10.5
Refer to data in Example 10.1. Calculate the 95% confidence interval for the mean
value of y when 11 g x = and explain its meaning if y = sales and x = number of
radio advertisements.
Answer:
The values for y , t α /2 and s∈ can be obtained from Example 10.4. Hence, the 95%
confidence interval for the mean value of y when 11 g x = is:
( )( )
( )
2
/ 2 2
2
1ˆ
11 10.25153.94 (2.228)(15.99)
12 160.25
53.94 10.50
g
i
x x y t sn x x
α ∈ −± +−
−± +
±
∑
Note that the lower and upper confidence limits for the mean value of y are 43.44
and 64.44 respectively. This shows that the minimum mean sales is 43 units while
the maximum is 64 units when 11 radio advertisements are broadcasted.
24
EXERCISE 10.8
Refer to Exercise 10.2. Calculate the 99% confidence interval for mean y
when x = 69 and provide an explanation on it.
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 25/27
TOPIC 10 CORRELATION & REGRESSION 25
EXERCISE 10.9
1. Determine whether the following statements are true or false. If false,
write down the correct statement.
(a) Regression analysis is used to display the validity of an estimated
equation that explains the relationship of subjects under study.
(b) Given a straight-line equation y = a – bx, we can say the
relationship between y and x is positive linear.
(c) R2 value approaching zero indicates a strong correlation between x
and y.
(d) Regression line is derived from sampling, not from population
under study.
2. Fill in the blanks for each of the following statements:
(a) If the value of a dependent variable decreases when the value of an
independent variable increases, their relationship is
_____________________.
(b) Each straight line has ___________________, which explains how
much the change in dependent variable given a unit increase in
independent variable.
(c) The least squares method is used to get _______________ and
__________________ of regression line.
(d) If the coefficient of determination is 0.80, this means 80% of
variation in the dependent variable ______________ by variation
in the independent variable.
3. An economist would like to know the influence of interest rate on total
investments made by a company. He has collected some data for a
duration of eight months and it is displayed in the table below:
Investment
(RM Million)1.8 1.8 2.1 2.2 2.8 3.1 3.6 4.1
Interest Rate (%) 9.9 10.5 9.6 9.8 12.1 9.2 9.5 7.7
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 26/27
TOPIC 10 CORRELATION & REGRESSION
State the dependent and independent variables.Without using a two-way scatter plot, is the regression slope positive
or negative? Explain its meaning.
By how much will investment value change for a unit change in the
interest rate?
How much is the variation in investment can be explained by variation
in interest rate?
Many people assume that the total amount of money saved depends on their
total income. The following data shows the average saving per month(RM’00) and average income per month (RM’00) for various groups
of employees.
Income192227303643475161641 Saving1.01.41.82.43.03.84.34.55.86.3Fit
a simple linear regression model for this data.
How much money will a person saved if his monthly income is
RM4,500?
Test the significance of regression slope at 5% significance level. Use
a one-sided test and explain the reason for using this test.
Obtain a 95% confidence interval for β1.
For the following data:
x 12371011141416171 y203050100150200260400400700Calculate the
residuals.
Plot residuals versus x and .
What can you conclude from the plot in (b)?
Is the normality assumption for random errors satisfied?
Is there any sign of violation from model assumptions? If yes, whatneed to be done?
Obtain the appropriate transformation.
26
ACTIVITY 5.5
8/2/2019 KFS2312 Topic 10(1t)
http://slidepdf.com/reader/full/kfs2312-topic-101t 27/27
TOPIC 10 CORRELATION & REGRESSION
Please visit the following websites to read more about simple linear regression:
• http//www.pinkmonkey.com/studyguides/subjects/stats/chap8/s0808n01.asp
• http//stat.tamu.edu/stat30x/notes/node155.html
• Simple linear regression is a technique used to analyse the relationship
between two variables. Regression analysis assumes that the relationship is in
linear form.
• The least squares method is used to get parameter estimates for slope of
regression line and intercept on y-axis.
• A hypothesis test is performed on the slope of regression line to determine if
there is enough evidence to support the existence of a linear relationship. The
validity of the relationship between the two variables can be determined using
coefficient of determination.
• Model adequacy check that is checking on violation of assumptions can be
performed using residual plot and histogram.• If the simple linear regression model obtained is adequate for the data, the
model can be used to estimate dependent variable value for any specific
independent variable value. This model can also be used for prediction of a
mean value of dependent variables.
27