8/9/2019 11 SSGB Amity BSI Regression
1/43
Module-11
Linear Regression
8/9/2019 11 SSGB Amity BSI Regression
2/43
Linear & Polynomial Regression
Learning Objectives
At the end of this section delegates will be able to:
Understand the role of regression analysis within theTransactional DMAIC Improvement Process
Understand that regression can be used to explore the
relationships between inputs (xs) and outputs (ys)
8/9/2019 11 SSGB Amity BSI Regression
3/43
Linear & Polynomial Regression - Agenda
Regression within DMAIC
Review of Scatter Diagrams
Introduction to Regression
Linear Regression
Polynomial Regression
Summary
8/9/2019 11 SSGB Amity BSI Regression
4/43
Six Sigma Transactional Improvement Process
15 20 25 30 35
LSL USL
Define Measures (ys)
Check Data Integrity
Determine ProcessStability
Determine ProcessCapability
Set Targets forMeasures
Phase Review
Control Critical xs
Monitor ys
Validate ControlPlan
Identify furtheropportunities
Close Project
1 5 10 15 20
10.2
10.0
9.8
9.6
Upper Control Limit
Lower Control Limit
y
Phase Review
Develop DetailedProcess Maps
Identify CriticalProcess Steps (xs)by looking for:
Process Bottlenecks Rework / Repetition
Non-value AddedSteps
Sources of Error /Mistake
Map the Ideal
Process Identify gaps
between current andideal
START
PROCESSSTEPS
DECISION
STOP
Phase Review
Brainstorm PotentialImprovement Strategies
Select ImprovementStrategy
Plan and Implement
Pilot Verify Improvement
ImplementCountermeasures
Criteria A B C D
Time + s - +
Cost + - + s
Service - + - +
Etc s s - +
15 20 25 30 35
LSL USL
Phase Review
Analyse Improve ControlMeasureDefine Select Project
Define Project
Objective
Form the Team
Map the Process
Identify CustomerRequirements
Identify Priorities
Update Project File
Phase Review
Define
8/9/2019 11 SSGB Amity BSI Regression
5/43
Purpose
To show:
How one variable changes in response to changes in
another.
The nature of the relationship between two
variables.
The strength of relationship between two variables.
Scatter Diagram Revisited
8/9/2019 11 SSGB Amity BSI Regression
6/43
Amount
Overweight
Health
Index
Suspected
Cause
AmountOverweight
HealthIndex
50 .5381 .32
117 .10
100 .1368 .59
77 .40
112 .2849 .4570 .50
89 .2570 .34
115 .1852 .60
90 .42
70 .43121 .1580 .49
40 .6575 .22
35 .58100 .35
SuspectedCause
High
Low
High
Low
SuspectedEffect
Suspected
Effect
Scatter Diagram
8/9/2019 11 SSGB Amity BSI Regression
7/43
Amount
Overweight
Health
Index .60
.50
.40
.30
.20
.10
45 65 85 105 125
Scatter Diagram
8/9/2019 11 SSGB Amity BSI Regression
8/43
Positive CorrelationAn increase in Y may depend on
increases in X. If X is controlled,Y could be controlled.
Possible Positive
CorrelationIf X is increased, Y may increase
somewhat, but Y seems to have other
causes than X.
No CorrelationThere is no correlation.
Possible Negative
CorrelationIf X is increased, Y may decrease
somewhat, but Y seems to have other
causes than X.
Negative CorrelationAn increase in X may cause a
decrease in Y. Therefore, if X iscontrolled, Y could be controlled.
Scatter Diagram
8/9/2019 11 SSGB Amity BSI Regression
9/43
Scatter Diagram: Risks & Limitations
Does not prove anything
Both axes should be of equal length
Conclusions must not be made outside the
experimental range Experimental range should be wide enough to draw
useful conclusions
8/9/2019 11 SSGB Amity BSI Regression
10/43
At the heart of Six Sigma activities is identifying whichinputs or process steps cause unwanted variation in process
outputs
y = f(x)
Regression analysis will allow us to determine which inputs
(xs) influence our output or outputs (ys)
We can sometimes use regression analysis to build a
mathematical model which can be used to predict the value
of our outputs
Regression
8/9/2019 11 SSGB Amity BSI Regression
11/43
Regression analysis is used when we wish
to determine the relationship between
two or more continuous variables In Six Sigma activities we often need to
understand the relationship between our
output (y) and our critical xs In Problem Solving activities this may
help us to discover root causes
y
x
In Process Improvement activities, regression analysiswill allow us to find optimal settings for our critical xs
Why do we use Regression?
8/9/2019 11 SSGB Amity BSI Regression
12/43
Regression Exercise
Volume of Speech and Alcohol Volume Consumed
Site Location and Quantity of Defects
Shipping Defects and Customer Distance from Distribution Depot
WIP and Yield
Education and Salary
Age and Beauty
Sales and Advertising
Pick Errors and Cycle time
Sales Representative and Sales value
Goals Scored per Season and Purchase Price of the Player
Quantity Sold and Selling Price
Speed of Query Resolution and Experience of the Operator
Exercise - consider the following pairs of measures could we draw a
line which might summarise the relationship/regression between them?
8/9/2019 11 SSGB Amity BSI Regression
13/43
The simplest form of regression is single variable linear
regression
y is the dependent variable x is the independent variable
The equation for linear regression is:
y = 0 + 1x + error
0 is the intercept
1 is the slope
y
x
Linear Regression
8/9/2019 11 SSGB Amity BSI Regression
14/43
A finance department is carrying out an investigation into
the number of errors that are generated on customerinvoices.
They suspect that the number of errors may be affected by
the volume of invoicing on any particular day.
The following slide shows the data for the past 50 working
days.
Linear Regression - Example
8/9/2019 11 SSGB Amity BSI Regression
15/43
Linear Regression - Example
Volume
175
173
201297
165
193
162
271
179
197
162
265
221
165154
199
Errors
3
3
728
5
2
0
17
5
8
2
14
9
65
5
Volume
178
155
186201
241
174
163
207
188
154
163
178
210
263162
165
224
Errors
6
3
57
8
8
4
10
6
1
3
5
9
133
3
6
Volume
155
165
170198
276
209
186
288
176
208
163
173
174
223196
241
283
Errors
2
5
35
26
3
4
23
3
4
5
3
1
117
10
26
8/9/2019 11 SSGB Amity BSI Regression
16/43
Scatter Diagram
Open Worksheet: Invoicing Errors
Enter Errors
in Y and
Volume in X
and click OK
8/9/2019 11 SSGB Amity BSI Regression
17/43
A scatter diagram reveals that there may be a relationship between the number of
errors and the volume of invoices. A regression analysis will reveal the existence
and/or the strength of the relationship.
Scatter Diagram
Volume
Errors
300275250225200175150
30
25
20
15
10
5
0
Scatterplot of Errors vs Volume
8/9/2019 11 SSGB Amity BSI Regression
18/43
We first need to establish the equation for the best fitting line which will minimise the
sum of squares of the predicted y values from the observed y values. In short, this is
known as the least squares method.
Linear Regression Least Squares Method
Volume
Errors
300275250225200175150
30
25
20
15
10
5
0
Scatterplot of Errors vs Volume
8/9/2019 11 SSGB Amity BSI Regression
19/43
Regression - Minitab
Open Worksheet: Invoicing Errors
8/9/2019 11 SSGB Amity BSI Regression
20/43
Regression - Minitab
1. Enter Errorsand Volume
2. Check Linear
8/9/2019 11 SSGB Amity BSI Regression
21/43
Volume
Er
rors
300275250225200175150
30
25
20
15
10
5
0
S 2.98583
R-Sq 79.3%
R-Sq(adj) 78.9%
Fitted Line PlotErrors = - 21.74 + 0.1465 Volume
Minitab Regression Plot
This is the equation for
the best fit line.
We can use it to
predict:
e.g. if we have 200invoices we would
predict:
-21.74 + 0.1465 (200)
= 7.6 errors
8/9/2019 11 SSGB Amity BSI Regression
22/43
Volume
Er
rors
300275250225200175150
30
25
20
15
10
5
0
S 2.98583
R-Sq 79.3%
R-Sq(adj) 78.9%
Fitted Line PlotErrors = - 21.74 + 0.1465 Volume
Minitab Regression Plot
The R-Squared and R-
Squared (adjusted) tell
us how much of thevariation in Errors can
be explained by the
changes in Volume.
Here it is around 79%.
8/9/2019 11 SSGB Amity BSI Regression
23/43
Volume
Er
rors
300275250225200175150
30
25
20
15
10
5
0
S 2.98583
R-Sq 79.3%
R-Sq(adj) 78.9%
Fitted Line PlotErrors = - 21.74 + 0.1465 Volume
Minitab Regression Plot
The s value is the
standard error of
the y values aboutthe best fit line. It
is the standard
deviation of the
residuals
(the difference
between actual and
best-fit y values foreach x)
8/9/2019 11 SSGB Amity BSI Regression
24/43
Linear Regression Minitab Output
Regression Analysis: Errors versus Volume
The regression equation is
Errors = - 21.74 + 0.1465 Volume
S = 2.98583 R-Sq = 79.3% R-Sq(adj) = 78.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 1642.07 1642.07 184.19 0.000
Error 48 427.93 8.92
Total 49 2070.00
A p value of
8/9/2019 11 SSGB Amity BSI Regression
25/43
The analysis of variance divides up the total variation in y
(errors) into its constituent parts.
We can learn a lot from this table:
1. What is the overall variation in y?
2. Is there a significant relationship between y and x?3. How much of the variation in y is due to changes in x?4. How much variation in y is still unexplained?5. How accurate is my prediction of y for a given value of x?
Source Degreesof Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92Total 49 2070.00
Analysis of Variance (ANOVA) for Linear Regression
8/9/2019 11 SSGB Amity BSI Regression
26/43
6.542.245
42.24549
00.2070
1
2
1
==
==
n
n
Check this out by calculating the standard
deviation of the 50 error results
The total variation in y is given by the Total Sum of Squares = 2070.00
The Total Sum of Squares =
The total mean square =The total sum of squares
Total Degrees of Freedom
2)( yy
( ) 21
2
1=
= n
n
yy
What is the overall variation in y?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92Total 49 2070.00
8/9/2019 11 SSGB Amity BSI Regression
27/43
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92Total 49 2070.00
We can test the significance of the relationship between y and x by examining the
F-Ratio. The F-Ratio is name after Sir Ronald Fisher, who devised this test forcomparing variances.
F-Ratio =Regression Mean Square
Residual Mean Square=
1642.07
8.92= 184.19
Examining the F tables for F0.05,1,48 gives a value of 4.03.
Our value of 184.19 is greater than 4.03 so we can assume that there is a
statistically significant relationship between y and x.
Is there a significant relationship between y and x?
8/9/2019 11 SSGB Amity BSI Regression
28/43
Is there a significant relationship between y and x?
Analysis of Variance
Source DF SS MS F P
Regression 1 1642.07 1642.07 184.19 0.000
Error 48 427.93 8.92Total 49 2070.00
Minitab gives a P value as the outcome of a Hypothesis Test:
H0 = The regression is not significant (i.e. variation in the x is not significant in
explaining the variation in the y)
H1 = The regression is significant
Minitabs P value is the probability that we would get this F value if the NullHypothesis were true
Since it is below 0.05 we can conclude with at least 95% Confidence that the
number of errors is influenced by the volume of invoices processed
8/9/2019 11 SSGB Amity BSI Regression
29/43
Source Degreesof Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92
Total 49 2070.00
SSTOTAL = SSREGRESSION + SSRESIDUAL
SSTOTAL = Total Sum of Squares = Total variability in y values.
SSREGRESSION = Regression Sum of Squares = the amount of variability in the
y values explained by the
regression relationship.
SSRESIDUAL = Residual Sum of Squares = the amount of variability in the
(or Error Sum of Squares) y values not accounted for by the
regression relationship.
How much variation in y is explained by changes in x?
8/9/2019 11 SSGB Amity BSI Regression
30/43
The coefficient of determination is normally expressed as a
percentage. It represents the percentage of the total variability
accounted for by the regression relationship. It can also be used to
test whether the regression accounts for a statistically significant
amount of the total variability.
SSTOTAL
How much variation in y is explained by changes in x?
The Coefficient of Determination R2
R2 =
SSREGRESSION 1642.07
2070.00 = 0.79=
Source Degreesof Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92
Total 49 2070.00
8/9/2019 11 SSGB Amity BSI Regression
31/43
The Residual (Error) term provides us with information concerning the
amount of variation in y which is not accounted for by the regression.
The square root of the residual mean square is the standard error of y
about the regression equation.
ErrorMS = standard error of y about x
We can use the standard error to calculate confidence intervals for y
values for any given value of x.
How much variation in y is still unexplained?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92Total 49 2070.00
8/9/2019 11 SSGB Amity BSI Regression
32/43
Residuals are the difference between the observed values of y andthe predicted values based on the regression model.
ErrorMS = standard error of y about x
Residuals
Volume
Errors
300275250225200175150
30
25
20
15
10
5
0
Scatterplot of Errors vs Volume
Actual
value
Predicted
value
Residual
8/9/2019 11 SSGB Amity BSI Regression
33/43
Observed y Predicted y
x y y = -21.74+0.1465x ( y - y ) ( y y )2
155 2 0.9675 -1.0325 1.066
165 5 2.4325 -2.5675 6.592
170 3 2.485 -0.515 0.265
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
199 5 6.6175 1.6175 2.616
427.93 = SSRESIDUAL
Residuals are the differences between the observed values of y and the predictedvalues based on the regression model. If there was no difference between these two
entities, then we would have a perfect model. In reality, this is unlikely to occur.
Examination of Residuals
8/9/2019 11 SSGB Amity BSI Regression
34/43
By examining the residual plot we can check for:
Lack of fit (model inadequacy)
Non-constant variability When we have sufficient data points, a normality test can also
be carried out. The distribution of residuals should be normal if
the model is a good fit to the data.
Residuals vs Fits
Fitted Value
Resid
ual
20151050
7.5
5.0
2.5
0.0
-2.5
-5.0
Residuals Versus the Fitted Values(response is Errors)
8/9/2019 11 SSGB Amity BSI Regression
35/43
In this case a Normality Test of the Residuals shows that they areNormal (p value > 0.05)
Normality of Residuals
RESI1
Percent
86420-2-4-6-8
99
95
90
80
70
60
50
40
30
20
10
5
1
Mean 3.055334E-15
StDev 2.955
N 50
AD 0.342
P-Value 0.479
Probability Plot of RESI1Normal
8/9/2019 11 SSGB Amity BSI Regression
36/43
ErrorMS = standard error of y about x
Statistical software programs will use the error mean square to
calculate confidence intervals when predicting y for a given value of
x. We can obtain confidence intervals for the predicted mean value
and also for the predicted individual values.
How accurate is my prediction of y?
Source Degrees
of Variation of Freedom Sum of Squares Mean Square F-Ratio
Regression 1 1642.07 1642.07 184.19
Residual (Error) 48 427.93 8.92
Total 49 2070.00
H i di i f ?
8/9/2019 11 SSGB Amity BSI Regression
37/43
How accurate is my prediction of y?
Open Worksheet: Invoicing Errors
H t i di ti f ?
8/9/2019 11 SSGB Amity BSI Regression
38/43
How accurate is my prediction of y?
1. Enter Errorsand Volume
2. Check Linear3. Click on Options
H t i di ti f ?
8/9/2019 11 SSGB Amity BSI Regression
39/43
How accurate is my prediction of y?
Tick both
Display Options
H t i di ti f ?
8/9/2019 11 SSGB Amity BSI Regression
40/43
How accurate is my prediction of y?
Volume
Errors
300275250225200175150
30
20
10
0
-10
S 2.98583
R-Sq 79.3%
R-Sq(adj) 78.9%
Regression
95% C I
95% PI
Fitted Line PlotErrors = - 21.74 + 0.1465 Volume
95% Confidence Intervals show the range of values we expect for the average value of
errors for any particular volume of invoices being processed
95% Prediction Intervals show the range of values within which we expect 95% of the
individual error values to be if we use the regression equation to predict this
Precise values can be obtained within the Stat > Regression > Regression menu
R i E i
8/9/2019 11 SSGB Amity BSI Regression
41/43
Regression Exercises
Question 1:A company developing healthcare software solutions is bidding for a new
contract and has historical data on similar previous contracts. It wants to
minimise the risk of failing to deliver the solution on time, so wants a good
estimate of the man-years of effort needed (the output measure, or y).
The variables previously recorded are the number of application sub-programs
written (x1), and the number of software configuration change proposals
implemented (x2).
Use regression to:1. Investigate the relationship between x1 and the man-years required
2. Investigate the relationship between x2 and the man-years required
3. If the company estimates that 150 application sub-programs will be required,and there are likely to be 100 software configuration change proposalsimplemented, what would be your recommendation for the number of man-years they should estimate?
Data is in Minitab Worksheet: Transactional Regression Exercises.mtw
Regression Exercises
8/9/2019 11 SSGB Amity BSI Regression
42/43
Regression Exercises
Question 2:
The team investigating the Expense Claims process have
identified a potential input variable (x) that they believe
could affect the amount of time taken to pay the claims. The
potential variable is the amount of money claimed, and they
have gathered data on amounts claimed for the 100 payment
times they already had. Use Regression Analysis toinvestigate the relationship, and be prepared to advise the
team on your conclusions.
Data is in Minitab Worksheet:
PAYMENT TIMES.mtw
Summary Linear & Polynomial Regression
8/9/2019 11 SSGB Amity BSI Regression
43/43
Regression Analysis can be used to identify xs that
are affecting the ys
A linear or polynomial regression model of y=f(x) canbe developed for individual xs
The model can be tested to see if it is significant and
how well it fits the data The model can be used to make predictions of y for
given values of x
Regression is used much more extensively inoperational and DFSS activities
Summary - Linear & Polynomial Regression