Upload
brent-johnston
View
222
Download
1
Tags:
Embed Size (px)
Citation preview
Thoratec
Workshop in Applied Statistics forQA/QC, Mfg, and R+D
Part 1 of 3:Basic Statistical Concepts
Instructor : John Zorichwww.JOHNZORICH.COM [email protected]
Part 1 was designed for students whoknow high-school algebra but who have
never had a college-level statistics course.
John Zorich's Qualifications: 20 years as a "regular" employee in the medical device
industry (R&D, Mfg, Quality) ASQ Certified Quality Engineer (since 1996) Statistical consultant and trainer (since 1999) for many
companies, including Siemens Medical, Boston Scientific, Stryker, and Novellus
Instructor in applied statistics for Ohlone College, Silicon Valley Polytechnic Institute, and KEMA/DEKRA
Past instructor in applied statistics for UC Santa Cruz Extension, ASQ Silicon Valley Biomedical Group, & TUV .
Publisher of 9 commercial, formally validated, statistical application Excel spreadsheets that have been purchased by over 80 companies, world wide. Applications include: Reliability, Normality Tests & Normality Transformations, Sampling Plans, SPC, Gage R&R, and Power.
You’re invited to “connect” with me on LinkedIn.
Objectives
PART 1 (today's topics):Obtain an understanding of BASIC Statistics in general (its vocabulary, methods, & uses) as needed to understand Parts 2 and 3.
PART 2 (not today's topics):Learn INTERMEDIATE statistical applications, and tests as needed to understand Part 3; also includes "reliability calculations", power calculations, and sample size determinations.
PART 3 (not today's topics):Become familiar with commonly used ADVANCED Statistical applications (Reliability Plotting, Sampling plans, SPC, Process Capability calculations, Equipment control).
Self-teaching & Reference Texts
RECOMMENDED by John Zorich
Clements: Handbook of Statistical Methods in Manufacturing
Kaminsky et. al.: Statistics and Quality Control for the Workplace
Mlodinow: The Drunkard’s Walk --- How Randomness Rules Our Lives
Motulsky: Intuitive Biostatistics
NIST Engineering Statistics Internet Handbook, at... http://www.itl.nist.gov/div898/handbook/index.htm
Philips: How to Think about Statistics
Free
Main Topics in Today's Workshop
• Regulatory Requirements• Population vs. Sample• Parameter vs. Statistic• Probability• Law of Large Numbers• Distributions (Charting and Graphing)• Binomial Distribution• Hypergeometric Distribution• Normal Distribution• Central Limit Theorem• Standard Deviation and Standard Error• Linear Regression & Correlation Coefficients
Regulatory Requirements
ISO 9001:2008 (8.1), and ISO 13485:2003(8.1) " The organization shall plan and implement the monitoring, measurement, analysis and improvement processes needed to demonstrate conformity [to requirements]....This shall include determination of applicable methods, including statistical techniques, & the extent of their use."
21CFR820.250 (FDA) " Where appropriate, each manufacturer shall
establish and maintain procedures for identifying valid statistical techniques required for establishing, controlling, and verifying the acceptability of process capability and product characteristics."
(as used in this class...)
Sample means part of a PopulationThe sample could be the part of an individual batch or lot that was purchased or produced; you inspect the sample prior to applying an "approved" label on the entire batch or lot.
"Representative Sample": a sample represents the population --- it is typically not a "Random Sample" but rather is usually taken evenly from thruout the population (e.g., a few items taken from each box in the batch).
"Sample size" can be anything from 1 to over 1,000,000. The term "one sample" or "a single sample" means the entire sample, no matter what the sample size is.
(as used in this class...)
Statistic is a mathematical summary value
calculated from data taken from a Sample. All of the following are statistics:
Avg thickness of every 100th cable produced last week. Range of thicknesses in that sample Median thickness in that sample.
Parameter is a mathematical summary value
calculated from data taken from the entire Population;that is, every data point in the entire population(e.g., average thickness of all cables produced last week).
"Statistics" as a science is the mathematical analysis of "statistics", not of parameters. Statistics is the science of
using "statistics" to guesstimate "parameters".
As a group, let's discuss...
Which are parameters and which statistics?1. Baseball "stats"
Answer: Parameters, because baseball "stats" are calculated using all the data.
2. United States Census dataAnswer: Some are statistics, because they are just a sample of the population (this is the preference of Democrats) whereas others are parameters because we attempt to count the entire US population (this is the preference of Republicans).
3. Average age of the people in this classDepends -- is this class a population or sample?
Probability (as used in this class) means...
The same as "chance" or "odds", but not based on a hunch or intuition or what has historicly occurred.
• The following statements are not using probability in the sense we mean here today:
He'll probably come home before 9pm. They'll probably win tonight's game. They haven’t won a game in 6 weeks---they’re due!!
• Those are examples of “Adverbial Probability” (see Inductive Logic, by Hibbens, 1896, chapter 15)
• Instead, the Science of Statistics uses... “Mathematical Probability”.
“Mathematical Probability” is the same as the "theoretical expected frequency", that is, the # of times one type of event would happen (if no cheating occurs) divided by the total number of all possible equi-probable events; e.g.,...
Probability
1:1 = "Fifty-Fifty" = 1 / 2 = 0.50 = 50 %
Those terms (above) all mean the same thing.They all mean that you have the same chance at winning as you have at losing, as opposed to...
1 / 4 = 0.25 = 25 % chance or odds of...1 / 10 = 0.10 = 10 % probability of...1 / 3 = 0.3333 (rounded) = 33.33 %1 / 6 = 0.1667 (rounded) = 16.67 %
(By definition...) Probability / chance / likelihood... never can exceed 1.00 = 100%, and never can be less than 0.00 = 0%
Probability
PROBABILITY OF ROLLING A GIVEN NUMBER ON 1 TOSS OF 1 DIEThe NULL HYPOTHESIS is that the DIE is "honest".
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
1 2 3 4 5 6
Number Observed on the 1 Die
Pro
ba
bil
ity
On a single die, the chance of one number appearing face up is the same any other number.
dfdf
ProbabilityPROBABILITY OF OBSERVING A GIVEN SUM ON 1 TOSS OF 2 DICE
The NULL HYPOTHESIS is that the dice are "honest".
0.00
0.05
0.10
0.15
0.20
2 3 4 5 6 7 8 9 10 11 12
Sum of Numbers Observed on 2 Dice
Pro
ba
bil
ity
These can be "calculated" by enumeration (instructor has a demo file on this).
dfdfkjfkdjf;lskdjff
ProbabilityPROBABILITY OF OBSERVING HEADS ON FLIP OF 1 COIN
The NULL HYPOTHESIS is that the COIN is "honest".
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0 1
Number of heads observed
Pro
bab
ilit
y
The probability of having heads come up on a toss of one coin is 0.50. "Tails" have the same probability.
dfdf
ProbabilityPROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50
0.00
0.05
0.10
0.15
0.20
0 5 10 15 20 25 30
Number of observed HEADS
Pro
ba
bil
ity
Instructor has an Excel file that generates a series of charts like this one, based on tosses of up to several hundred coins.
dfdf
When tossing 30 honest coins, the "true" average is 15 heads, but by chance we may see some other result.
Probability of Independent Events
The MULTIPLICATIVE RULE:
The probability of Event A happening and Event B and Event C (assuming that they are independent events), is the multiplication of their probabilities: Pa x Pb x Pc (where Pa is the probability of Event A, and so on).
-- Class Exercise --Let's try answering these questions:
The MULTIPLICATIVE RULE (examples): The chance of rolling 2 dice and obtaining a 5 on both of them is...
1 / 6 x 1 / 6 = 1 / 36 = 0.028 = 2.8%
The probability of flipping a coin 4 times and obtaining "heads" every time is... 1 / 2 x 1 / 2 x 1 / 2 x 1 / 2 = 1 / 16 = 0.062 = 6.2%
Let's try that (flipping 4 coins & counting heads)
The likelihood of drawing 3 good parts from a lot of100 million parts, 99% of which are good, is...
0.99 x 0.99 x 0.99 = 0.9703 = 97.03%
Probability
The MULTIPLICATIVE RULE (corollary): Conditional Probability: If the probability changes after each sampling event, then the separate probabilities are not identical, because they are "conditional" not "independent"; e.g.
What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)?
1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700The probability of a given draw is "conditional" based upon what happened in the previous draw.
( do not use: 99/100 x 99/100 x 99/100 = 0.9703 )
Probability of Independent Events
Assuming that only one event can happen at a time, the sum of the probabilities of all possible events equals 1.000 exactly.
On a single die, only one number appears face up at a time. Therefore P1 + P2 + P3 + P4 + P5 + P6 = 1.00, where P1 is the probability of the #1 being face up, etc.
The ADDITIVE RULE:The probability of Event A happening or Event B or Event C, assuming that only one event can possibly happen at a time, is the sum of their probabilities: Pa + Pb + Pc (where Pa is the probability of Event A, and so on) --- in this case, there are assumed to be other possible events, i.e., Pc, Pd, Pe, etc.
-- Class Exercise --Let's try answering these questions:
The ADDITIVE RULE (examples):The chance of rolling 2 dice and obtaining a total of either 2 or 12 is
1 / 36 + 1 / 36 = 2 / 36 = 0.056 = 5.6%
The probability of flipping 4 coins and obtaining either all heads or all tails is
1 / 16 + 1 / 16 = 2 / 16 = 0.125 = 12.5%(based upon our example a few slides ago)
Likelihood that an n = 1 sample is out-of-spec if taken from a lot with 2% out-of-spec high & 5% out low is...
0.02 + 0.05 = 0.07 = 7%
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0 5 10 15 20 25 30
Number of observed HEADS
Pro
bab
ility
Probability
The probability of getting3 or more heads in a single
toss of 4 coins is about 30% = 0.30 = the
approximate sum of the individual histogram bar probabilities for getting 3
heads or 4 heads ( 0.25 + 0.05 = 0.30 )
(assuming the coins are honest !!)
Calculation of each of these probabilities is simple to do by
enumeration (presenter has
demo file)
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS
0.00
0.05
0.10
0.15
0 5 10 15 20 25 30
Number of observed HEADS
Pro
babi
lity
Probability
The probability of getting 22 or more
heads in a single toss of 40 coins is about
30% ( ≈ the sum of the individual histogram bar probabilities of 22 and above on the X-axis)
( 0.10 + 0.075 + 0.06 + 0.03 + 0.02 + 0.01 + 0.005 = 0.30 )
(assuming the coins are honest !!)
Calculation of each of these probabilities is done the same
way as for 4 coins, but is much more
tedious because there are over a
million possibilities!!
t-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification
0 6Sample Average minus Specification
Probability
If the number of possible result values is "infinite" or very large, then the probability histogram is more conveniently represented by a smooth curve (such as this one) rather than a histogram like
in previous slides.
For example: individual weights of thousands of coins, or the
individual avg weights ofthousands of samplestaken from a very large
population of coins.
X-axis = measured values, increasing in magnitude, from left to right
Y-a
xis
= P
rob
abil
ity
or
Fre
qu
ency
Always think of this area under
the curve as filledwith histogram bars
that we are too cheap to print.
The curve from the previous slide.
Probabilityt-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification
Y-a
xis
= P
rob
abil
ity
= F
req
uen
cy
The probability of getting a measurement equal to or
greater than value "A" on the X-axis is exactly 0.30 = the
fraction of the area under the curve that is to the right of that
point on the X-axis (the red-shaded area equals 30% of the
area under the entire curve).
In the language of calculus, the red area is the integral of the distribution function, from "A" to infinity.
AX-axis = measured values, increasing in magnitude,
from left to right
e.g.,x-axis = widths of cables
made last week
Probability
t-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification
Prob
abili
ty o
r Fre
quen
cy
The probability of getting a measurement equal to or greater
than value "B" on the X-axis is exactly 0.05 ( = the fraction of the area under the curve that is to the right of that point on the X-axis)
(the red-shaded area equals 5% of the area under the entire curve).
X-axis = measured values, increasing in magnitude,
from left to right
Y-a
xis
= P
rob
abil
ity
= F
req
uen
cy
We will use this concept many times in Day 2. Do you understand it
completely? (Let's examine it with
Instructor's Excel files)
B
e.g.,x-axis = widths of cables
made last week
The "Law of Large Numbers"
(per JZ) This "law", generalized, is somewhat self evident, and was known in principle to Archimedes over 2 millennia ago.
It applies to calculated "statistics", such as averages & standard deviations, and says nothing about the distribution of raw data.
Possibly a better name for this law is the one used over 100 years ago: The “Law of Tendency” ----
“…the law of tendency is that the larger the number of instances, the greater [= better ] will be the approximation to an accurate and definite result.”
(quote from pg 240 of Inductive Logic, 1896 by J.G. Hibbens, Scribner & Sons)
This quote shows that the “Law of Large Numbers” is part of our common language, but is unfortunately often applied incorrectly.It is misapplied here because the "Law of Large ("big") Numbers"
itself has nothing to do with "statistical significance” (we will discuss "statistical significance” in Day 2 of this workshop).
Law of Large Numbers translates (in this example) as...The larger the sample size, the closer the calculated value is likely to be to 100 = the population value (i.e., the closer the statistic is likely to be to the population parameter).
DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE )
70
100
130
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29SAMPLE SIZE
SA
MP
LE
AV
ER
AG
E(chart from "Law of Large Numbers.xls" in Student files)
Parameter = 100
Each mark on each line represents the avgof a different random sample taken from a
uniformly distributed population, 75 to 125.
-- Class Exercise --
If this population has an average value of 100, the average value of a SMALL
sample from this population will, in the long run,
be smaller or larger than 100 ?
Will the average value of a LARGE sample, in the long run,
be larger or smaller than 100?
This (below) is bimodal
Answer: In the long run, both small & large samples will be close to 100. In the long run, samples avgs equal population avgs, no matter what the sample size or population shape (that is, in the long run, statistics = parameter).
Graphical Methodsused to Describe
Variability
Number Line
. . .. ..... ... . .400 450 500 550 600
The small red squares graphically depict the variability, or the "distribution", of the data.
Histograms and Line Charts
Bar Charts and Line Charts
Pareto ChartREASONS WHY CUSTOMERS RETURNED CHINA
PLACE SETTINGS ORDERED OVER THE INTERNET FROM ZTC
25
17
12
42 1
0
5
10
15
20
25
30
NU
MB
ER
OF
RE
TU
RN
S I
N J
AN
200
5
This is shown here only to "complete" a survey of types of charts. We won't mention Pareto charts in the rest of the workshop.
-- Class Exercise --
If a population distribution looks bimodal, the distribution of data in a SMALL sample from that population will, on
average, look like...what?
The distribution of a LARGE sample from that population will, on average, look
like...what?
This (below) is bimodal
Answer: On average, both samples will look ≈ bimodal. On average, samples look like the parent population,no matter what the sample size.
"Binomial Distribution" Histogram
PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50
0.00
0.05
0.10
0.15
0.20
0 5 10 15 20 25 30
Number of observed HEADS
Pro
ba
bil
ity
The "binomial distribution" describes frequencies when there are only 2 possible outcomes,
(e.g., head or tails on a coin, or a vote for or
against a proposed law).
30 coins at a time
dfdf
The formula for the "Binomial Distribution" is used to calculate, e.g., the probability of 26 heads appearing on a toss of 30 coins. Part of the formula includes the following calculation:
26 x 25 x 24 x 23 x ....... 4 x 3 x 2 x 1 = ???
= (approximately) 400 Million x Billion x Billion
Prior to computers, such calculations were "impossible", except by idiot savants ( = the first "computers" --- they were actually sought after and well paid !)
Calculation of Binomial Probability
How to easily calculate the height of a single barin a Binomial Distribution Probability Histogram…
(MSExcel function)=binomdist(N,S,B,false)
N = Number of heads observed in a given toss of coinsS = Sample size = number of coins per tossB = Probability of getting heads on a single coin = 0.5false = (tells Excel to give probability of single histogram bar) e.g., =binomdist(11,30,0.5,false) = 0.0509 (check that value vs. histogram a couple slides ago) Binomial distributions are symmetrical when probability = 0.500, but skewed when probability is any other value (the farther from 0.500, the more extreme is the skewness --- see next slide).
"Binomial Distribution" Histogram
dfdf
0.00
0.05
0.10
0.15
0.20
0.25
0 5 10 15 20 25 30
Pro
ba
bilit
y
Number of Stars Face Up in Toss of 30 Dice
IF A DICE HAD 10 SIDES, ONE OF WHICH HAD A STAR ON IT, PROBABILITY OF MULTIPLE STARS FACE UP IN TOSS OF 30 DICE
This situation is modeled by the Binomial distribution because we are looking at only 2 possible outcomes:
Star or Not-a-Star. The probability of a star coming face
up is = 1 / 10 = 10%. The corresponding binomial
histogram has a peak at 30 x 10% = 3, but is not symmetrical
(it is skewed to the right).
"Hypergeometric Distribution"
The "Binomial distribution" describes frequencies of independent events, where the probability of one result is NOT influenced by a previous result (e.g., coin tosses --- reference the "multiplicative rule" of probability calculation, discussed previously).
The "Hypergeometric distribution" looks almost identical to the Binomial, but describes frequencies where the probability of one result is influenced by a previous result, and therefore are NOT independent (e.g., sampling from a lot of 100 parts, only 99 of which are good --- reference the "multiplicative rule" "corollary", discussed previously).
The Hypergeometric Distributionis very difficult to calculate by hand, but...
The MS Excel function of the probability for the "Hypergeometric distribution" is...
=hypgeomdist(N,S,D,P)
N = Observed number of items in the Sample that exhibit the sought-after characteristic (e.g., 7 "good" parts)
S = Sample size (e.g., 8 parts)
D = # of items in the Population that exhibit the sought-after characteristic (e.g., 99 “good” parts )
P = Population Size (e.g., 100 parts in the lot)
"Hypergeometric Distribution"
(back in the discussion on "probability" we asked...)What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)?
Back then, we calculated it like so:1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700
Now we can use the hypergeometric Excel function instead: =hypgeomdist( 3, 3, 99, 100 ) = 0.9700
If we had instead used the binomial Excel function, we would have obtained this wrong answer: =binomdist( 3, 3, 0.99, false ) = 0.9703 ( which equals 99/100 x 99/100 x 99/100 )
Binomial vs. Hypergeometric Formula
As long as sample size is not more than 1% of lot size,the two formulae give the "same" result. For example...
SmplSize = 10, LotSize = 1000 (= Sample is 1% of Lot) =hypgeomdist( 10, 10, 990, 1000 ) = 0.904 =binomdist( 10, 10, 0.99, false ) = 0.904SmplSize = 100, LotSize = 1000 (= Sample is 10% of Lot) =hypgeomdist( 100, 100, 990, 1000 ) = 0.347 =binomdist( 100, 100, 0.99, false ) = 0.366
FYI: MS Excel cannot calculate every combination of Hypergeometric values --- for example...
=hypgeomdist( 135, 135, 9900, 10000 ) = #NUM! =binomdist( 135, 135, 0.99, false ) = 0.258
right wrong
Examples of Normal Distributions
Each of these "normal" curves describes a
population that has the same average value, but different degrees of variability within
the population.
The single most-used distribution in statistical analysis is the Normal distribution.
( X-axis is in the same units as the raw data. Y-axis is count, i.e., # of observed items of a given X-value.)
Examples of Normal Distributions
X-axis is in “standard units” (which we will discuss later).
Y-axis is count, i.e., # of observed items of a given
X-value.)
"Normal Distribution" equation
The equation for what we now call the "Normal distribution histogram" was discovered around 1730, as a way to simplify calculation of the Binomial distribution; only power & square root tables were needed (rather than idiot savants).
The Normal distribution histogram has the "same" shape as the Binomial when sample size is large and the probabilities of the outcomes are exactly 50:50 (for example, a histogram describing the various possible number of heads in a toss of a 10,000 coins).
The larger the sample (e.g., the more coins), the closer the Normal histogram shape is to the Binomial histogram shape.
"Normal Distribution" equation Independently re-discovered ≈ 1800 by 2 astronomers
(Gauss & Laplace); nowadays, sometimes called the " Gaussian curve "
They used it to describe the distribution of errors in measurements; it became known as the " error curve "...
...because errors in measurements act like a binomial situation, that is, a very precise measurement can be only one of two possibilities, namely either greater than the true value or less than the true value (ignoring the remote possibility of being exactly equal to the true value).
Renamed the " Normal Distribution " around 1900 after it was discovered that the "error curve" closely described the typical (i.e., the normal) distribution of many biological values (e.g., heights of humans, weights of walruses, lengths of lizards).
"Normal Distribution Histogram"
If a histogram of your measurement data does not mimic the histogram created by this equation, then your
data may actually not be "normal" !!
This equation looks intimidating, but your "student" files contain a spreadsheet that does the calculations for you,
and then automatically creates the histogram!
Y = # of items expected at X (divide by N to get probability)N = # of items examined (e.g., 225 people) i = width of each single bar ( = length of interval) on histogram (for binomial & other discreet distributions, i = 1 )X = x-axis midpoint of a given histogram barμ = average or expected value of all N itemsσ = standard deviation of all N items (we'll explain in a few minutes what a "standard deviation" is)
Let's examine "Student Normal Histogram.xls"
"Normal (quantity) Histogram"
Normal QUANTITY Distribution HISTOGRAM
0
5
10
15
20
25
30
4.0 4.5 5.0 5.5 6.0 6.5 7.0
QU
AN
TIT
Y
This was created using the Normal Distribution Histogram equation, with N = 225, i = 0.1,
Avg = 5.5, & StdDev = 0.33. This could represent the
distribution of a heights of 225 randomly selected people.
The sum of all these bars = N = 225.
"Normal (probability) Histogram"
Normal PROBABILITYDistribution HISTOGRAM
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
4.0 4.5 5.0 5.5 6.0 6.5 7.0
PR
OB
AB
ILIT
Y
This was created from the previous chart by dividing each quantity by N. The sum of all these bars = 1.000, no matter what the
sample size is ( N = 225, or
N = 1,000,000,000 ).
"Normal (probability) Curve"Normal PROBABILITY Curve
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
4.0 4.5 5.0 5.5 6.0 6.5 7.0
PR
OB
AB
ILIT
Y
This was created from the previous chart by drawing a smooth line from top to top
of each bar, and then deleting the bars. The sum
of the area under this curve is defined as = 1.000
This is the basis of the "normal curve" used in many statistical tests!
Always view such curves as really a histogram whose bars we are too cheap to print.
ddf
The "Central Limit Theorem"
(The text above is a scanned image from Bowker & Lieberman, Engineering Statistics, 2nd ed., p. 100)
CENTRAL LIMIT THEOREM translates as... for any population of raw data with any shaped distribution... in regards to the distribution of a large number of statistics
taken repeatedly from the population (e.g., averages, ranges, standard deviations, etc.)...
the distribution of the statistics will look more+more "normal“ (“bell” shaped) the larger+larger is the sample size;
that is true because the value of a statistic will be somewhere near the parameter, either larger or smaller than it (ignoring the unlikely event of equaling the parameter);i.e., it has a binomial distribution, which as we saw before, is modeled by the "Error Curve", which in modern-times is called the "Normal distribution".
the distribution of the statistics will never "be" Normal, except in cases when N is very large and the raw data population distribution is "normal". Often, the distribution of statistics is “ t ” shaped, as we will see in Day 2 of this course.
Let's examine STUDENT file: Central Limit Theorem.xls
Distribution of Sample Avgs. vs. Population
Theoretical distribution of thousands of individual avgs taken from the population.
DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE )
70
100
130
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Let's look at this in more detail using
MS Excel
Shape is due to Central Limit Theorem.
Shape is due to Central Limit Theorem. Width is due to Law of Large Numbers.
Numerical Expressions
Range ( 1848 ? )Standard Deviation ( 1893 ? )
Standard Error ( 1897 ? )
Another important term is the " Mean ", which is another way to say the "average".
"Mean" in that sense was coined about 1750.
What is an "average" ?
About a hundred years ago, "average" usually meant the "median" ("the median home price in Dallas is..."). However, in more modern times, the word "average", by itself, always refers to the sum of all the values, divided by the number of values (i.e., the "arithmetic mean"):
Value#1+ Value#2+ Value#3+ ( etc. )+ Value#N= Sum of all Values
Average = Sum of all Values / N
What is a "range" ?
The "range" of a set of numbers refers to the difference between the largest and smallest value in that set:
Range =
Largest Value – Smallest Value
The range of height of people in this room is approximately...? (different for women than men...? )
What is a "range" ?
. . .. ..... ... . .400 450 500 550 600
This "number line" uses small red squares to graphically depict the variability, that is, the "distribution", of the data in a small sample. The width of the difference between the
value on the far left-hand side and the value on the far right-hand side is the "range".
In this data, the range looks to be about 200 units.
“Standard” calculationsStandard XXX
(the mathematical definition, for population parameter)
∑ ( Xi – Mean )2
# of data points in the Mean
Standard XXX (when using a sample to guess what the population parameter Standard XXX is)
Y = whole number, greater than zero; value depends on which "standard" statistic is being calculated.
∑ ( Xi – Mean )2
# of data points in the Mean, minus Y
Standard Deviation & Standard Error
Standard XXX (from a previous slide)
XXX = "Deviation" (that is "Standard Deviation") when talking about raw data (e.g., heights of humans, and lengths of lizards).
XXX = "Error" (that is, "Standard Error") when talking about calculated values (i.e., Statistics), for example: -- sample means ("Standard Error of the Mean"), or -- sample standard deviations ("Standard Error of the Standard Deviation").
100 Samples per data point (each point = average Std Dev of all 100 ). Random samples taken from normal population with Std Dev = 10.0
0
2
4
6
8
10
12
1 10 100 1000 10000
Sample Size
Sta
nd
ard
Dev
iatio
n C
alcu
late
d
Standard Deviation (n-1)
Standard Deviation (n)
Other random samples would produce differently shaped curves; but, on average, the "n" curve would be farther away (on the low side) from the true value than the "n-1" curve. That is, the "n-1" statistic is a better estimator of the parameter than the "n" one.
This is another example of how to think about the "Law of Large Numbers"; that is, the larger the sample size,the closer (on average) the "statistic" is to the "parameter".
(revisited)Distribution of Sample Avgs. vs. Population
Width of
Distribution is
measured in...
Std Errors
Std Deviations
Theoretical distribution of thousands of individual avgs taken from the population.
As was stated on a previous slide, a distribution (of raw data or of statistics such as "averages") is "normal" if it's histogram mimics a "normal" one.
Said differently, a distribution is "normal" if its distribution has characteristics that mimic that of the "normal probability curve", such as...
+/– 1 StdXXX from Avg = 68.3 % of area under curve+/– 2 StdXXX from Avg = 95.5 % of area under curve+/– 3 StdXXX from Avg = 99.7 % of area under curve
as seen in next few slides...
Areas under the "normal" curve
The darkened
area equals 68.3 %
of the area under the
curve.
70 80 90 100 110 120 130
Areas under the "normal" curve
The darkened
area equals 95.5 %
of the area under the
curve.
70 80 90 100 110 120 130
Areas under the "normal" curve
The darkened area equals 99.73 % of area under curve.
If a population with Avg =100, StdXXX = 10
is believed to be Normally distributed,
then... (1 – 0.9973) / 2 of
population (≈ 0.135%)is predicted to be
below X = 70
70 80 90 100 110 120 130
This +/− 3 interval is used
extensively in “Statistical
Process Control” (SPC).
Areas under the "normal" curve
The darkened
area equals 99.0 % of the area under the curve.
70 80 90 100 110 120 130
This +/− 2.58 interval is used
extensively in “Gage R&R”
and other “Metrology”
methods.
Areas under the "normal" curve
If Standard XXX is 10 , then...
+/– 1.96 Std XXX equals of area under curve
0
-0.86-0.84-0.82-0.8
-0.78-0.76-0.74-0.72-0.7
-0.68-0.66-0.64-0.62-0.6
95.00%---
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
0.045
67 73 80 87 93 100 107 113 120 127 133
PR
OB
AB
ILIT
Y
This +/− 1.96 interval is used in
some Reliability calculations & in
some tests of “Significance”.
The darkened
area equals 95.0 % of the area under the curve.
70 80 90 100 110 120 130
In a normal distribution, +/– Z std deviations from
the Parameter Avg encompasses 2 x A of the population of numbers.
+/– 1.96 standard deviations equals2 x 0.4750 = 95.0%of the area under the normal curve
This is called a " Z " Table
+/– 3.00 standard deviations equals2 x 0.4987 = 99.7%of the area under the normal curve
Class exercise: Estimation of Std Dev
Assuming this ≈ normal distribution of raw data, approximately what is the Std Deviation?
Almost all of distribution is
≈ Mean +/– 30. If "normal", then 30 ≈ 3 StdDevs;
therefore StdDeviation ≈ 10
70 80 90 100 110 120 130
Class exercise: Estimation of Std Error
Assuming this ≈ normal distribution of Smpl Avgs, approximately what is the ≈ Standard Error?
Almost all of distribution is
≈ Mean +/– 15. If "normal", then 15 ≈ 3 Std Errors;
therefore Std Error ≈ 5
85 90 95 100 105 110 115
Calculating a "standard error"
Any statistic from a single sample will likely not be identical to the parameter. For example, you can expect a sample mean to be off by some unknown amount from the population mean, i.e. to have some amount of "error". The "standard" amount of error to expect is called the "standard error". The theoretical definitions of two important standard errors are:
Std Error of Mean = Std Dev of all possible (or at least a very large number of) sample averages (of a single sample size) taken from a Population.
Std Error of StdDev = Std Dev of all possible (or at least a very large number of) "n-1" std deviations (of a single sample size) taken from a Population.
Calculating a "standard error"
Avg#1 StdDev#1Avg#2 StdDev#2 Avg#3 StdDev#3Avg#4 StdDev#4 etc. etc.
Avg#N StdDev#N ------------- ----------------Std Dev of Avgs = Std Error of the Mean
Std Dev of StdDevs = Std Error of the Std Deviation
Multiple samples
(with replacement) of
the same sample size,
from the same
population, generated
these Avgs & StdDevs
Standard Error of the (sample) Mean( estimated from 1 sample )
Practical formula for "Std Error of Mean"
Sample Standard Deviation
Sample Size .
Linear Regression & the
Correlation Coefficient
What is the meaning of a Linear Regression Correlation Coefficient?
In 2009, a billion dollar manufacturing company submitted to a government regulatory agency a report from a product technical file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99).
The regulatory personnel knew that such a claim is nonsense, and they so officially requested a literature or text book reference that explained such a rationale. After a few rounds of emails and re-writings of the report (and still no literature reference) the company consulted a professional statistician, who recommended using a different statistical method to prove equivalency.
X (class = grade Y (minutes of study before in school) being easily distracted)
2 4.2
3 5.9
5 10.4
6 11.5
Is there a linear relationship between class (= grade) in school and tendency toward distraction? How strong is it?
How consistent is the relationship (that is, what is the degree of co-relation (more commonly called "correlation")? Let's use Excel to find out!
Understanding Linear Regression & the Correlation Coefficient
• This is a "linear regression plot" of the data.• The "regression coefficient" is 1.91• The "correlation coefficient" is " R " or " r " ,
that is, " r " = the square root of 0.9897 = 0.995
Understanding Linear Regression & the Correlation Coefficient
y = 1.91x + 0.36
R2 = 0.9897
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
• Linear regression puts the "best" straight line thru a plot of X vs. Y data points.
• The "regression coefficient" (= 1.91 = the slope of this line) tells use how STRONG the relationship is.
Understanding Linear Regression & the Correlation Coefficient
y = 1.91x + 0.36
R2 = 0.9897
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
• The linear regression equation (e.g., Y = 1.91X + 0.36 ) allows us to predict the Y value for a nearby X value.
• CLASS EXERCISE: What Y value do we expect at X = 1.0ANSWER: ( 1.91 times 1.0 ) + 0.36 = 2.27
Understanding Linear Regression & the Correlation Coefficient
y = 1.91x + 0.36
R2 = 0.9897
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
This is an example of "Reliability Plotting", which is discussed in
Day 3 of this workshop.
MS Excel Spreadsheet functions...
linear regression coefficient =SLOPE( known_y's, known_x's )
correlation coefficient --- same result given by either...=CORREL( known_y's, known_x's )
or... =CORREL( known_x's, known_y's )
Notice that the function formula for the slope cares about which data set is X and which is Y, but the formula for the correlation coefficient does not.
Understanding Linear Regression & the Correlation Coefficient
Are Correlation Coefficients the same if data sets are the same except for magnitude...???
YES !!0
200
400
600
800
1000
0 5 10 15 20
0.955
r
0.955
0.955
YES !!
The regression coefficients are
also all identical, because the
slopes are all identical.
0
200
400
600
800
1000
0 5 10 15 20
Understanding Linear Regression & the Correlation Coefficient
Does the Correlation Coefficient increasein size with additional data points...??
0.955
0.962
0.971
NO !!
This data seems to say
that the CC decreases in
size as the number of
data points increases !!
r
0
200
400
600
800
1000
0 5 10 15 20
Does a large Correlation Coefficient indicate that the data is truly linear...???
NO !! (notice how the the lower- most 2 data sets show a
slight curve) (the solid black lines are all straight, not curved)
0.955
0.955
0.955
r
But the regression coefficients
are not identical, because
the slopes are not identical.
85
0
200
400
600
800
1000
0 5 10 15 20
slight slope to this lowest regression line
Understanding Linear Regression & the Correlation Coefficient
If the data is close to the line, is the Correlation Coefficient always large...???
NO !!0.955
0.064
0.791
r
0
200
400
600
800
1000
0 5 10 15 20
slight slope & 2 dots per point in this lowest regression line
Understanding Linear Regression & the Correlation Coefficient
Does a large Correlation Coefficient indicate that the X,Y data have a strong relationship
(i.e., that the regression coefficient is large)...??
NO !!0.955
0.955
0.955
r
There are at least a dozen different formulas for the Correlation Coefficient.
The instructor considers this the best formula for teaching the meaning of Correlation.
The next few slides explain it....
Understanding Linear Regression & the Correlation Coefficient
Sy
Ser | r |
Ye is calculated from the linear regression equation that is used to draw the "straight line" thru the data: Ye = y = aX + b
The square root of 0.9897 = r = 0.995 = correlation coefficient(this chart & equation were produced by MS Excel)
Understanding Linear Regression & the Correlation Coefficient
y = 1.91x + 0.36
R2 = 0.9897
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
Continuing with the data and equation from the previous slide:
This equation from previous slide Ye = 1.91 ( X ) + 0.36
observed X observed Y Ye2 4.2 4.183 5.9 6.095 10.4 9.916 11.5 11.82Std Dev 3.505 = Sy 3.487 = Se
r = 3.487 / 3.505 = 0.995 = same as on previous slide (this is not a trick; it is just one of many mathematically identical formulas for calculating the magnitude of “ r ”)
Understanding Linear Regression & the Correlation Coefficient
The absolute value of the Correlation Coefficient : Correlation Coefficient is the ratio of 2 standard deviations: The numerator is the smallest possible standard deviation that can be expected in the Y data points ( = Se ), and the denominator is the observed standard deviation in the Y data points ( = Sy ). If the observed data were closer to the linear regression line, then Sy would be smaller and then the Se / Sy ratio would be closer to 1.000. THE CORRELATION COEFFICIENT THEREFORE IS A MEASURE OF VARIABILITY, OF HOW CONSISTENTLY THE PLOTTED DATA TRACKS TO THE LINEAR REGRESSION LINE.
Understanding Linear Regression & the Correlation Coefficient
Sy
Ser | r |
The Correlation Coefficient is... the fraction of the observed Y data variation ( = Sy, the std deviation of the observed Y values) that is explainable by a linear relationship between X and Y ( the variation “associated with” or “caused by” that linear relationship is Se, the std deviation of the predicted Y values). The rest of the variation in the data is definitely due to something else (e.g., poor measurement equipment, poor measurement technique, other factors, random error, or... the fact that the data are NOT linearly related !!).
Understanding Linear Regression & the Correlation Coefficient
Sy
Ser | r |
92
0
200
400
600
800
1000
0 5 10 15 20
slight slope to this lowest regression line
Understanding Linear Regression & the Correlation Coefficient
Assuming Y is dependent on X, what is the source (the "cause") of the variation in Y-values?
0.955
0.064
r
fsf
fsf
Almost all variation in Y is "caused" by relationship
between X & Y (variation in X "causes" variation in Y).
Almost no variation in Y is "caused" by relationship between X & Y (something else is the "cause", such as assay variation or measurement error).
Sometimes there is no "cause" (e.g., correlation
between arm-lengthand leg-length).
• The correlation coefficient is...an indicator of predictability in the data on the Y axis.
• It represents...the fraction of the variation in the Y-data that can be explained by an hypothesized linear relationship between X and Y.
• If that hypothesis is false, i.e., if the relationship between X and Y is not truly linear, then the Correlation Coefficient is meaningless.
What is the meaning of a (linear regression) Correlation Coefficient?
(stdev solid dots)
(stdev hollow dots)r =
As mentioned earlier, in 2009, a billion dollar company submitted to a regulatory agency a report in a tech file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99). Have you learned enough to explain why that is nonsense?
y = 0.5076x - 0.078
R2 = 0.9937
0
2
4
6
0 5 10 15
ST
RE
SS
ED
UNSTRESSED
y = 0.5076x - 0.078
R2 = 0.9937
0
5
10
15
0 5 10 15
ST
RE
SS
ED
UNSTRESSED
96
0
200
400
600
800
1000
0 5 10 15 20
0.955
0.955
0.955
• Just because Excel lets you put a Linear Regression line thru data points does not mean the data is a straight line.
• Just because the Correlation Coefficient is large does not mean you have a straight line.
• You must use your judgment to determine if the line is straight, and if "yes", then and only then can you use the Linear Regression Equation and Correlation Coefficient to help you evaluate the relationship between your X and Y values.
Conclusion to: Understanding Linear Regression & the Correlation Coefficient:
How to implement what you learned today?
A new language (and some of its vocabulary) is primarily what you learned today.
Like any language, you must speak it if you are to learn it well.
Read your company's SOP (or ??) on statistical techniques.
Ask to read some of the validation protocols and validation reports that relate to your work, and study their "statistics" section (or it might be called the "data analysis" section).
Ask your boss to explain statistical statements made in meetings, reports, or SOPs.