Thoratec Workshop in Applied Statistics for QA/QC, Mfg, and R+D Part 1 of 3: Basic Statistical Concepts Instructor : John Zorich [email protected]

Thoratec

Workshop in Applied Statistics forQA/QC, Mfg, and R+D

Part 1 of 3:Basic Statistical Concepts

Instructor : John Zorichwww.JOHNZORICH.COM [email protected]

Part 1 was designed for students whoknow high-school algebra but who have

never had a college-level statistics course.

John Zorich's Qualifications: 20 years as a "regular" employee in the medical device

industry (R&D, Mfg, Quality) ASQ Certified Quality Engineer (since 1996) Statistical consultant and trainer (since 1999) for many

companies, including Siemens Medical, Boston Scientific, Stryker, and Novellus

Instructor in applied statistics for Ohlone College, Silicon Valley Polytechnic Institute, and KEMA/DEKRA

Past instructor in applied statistics for UC Santa Cruz Extension, ASQ Silicon Valley Biomedical Group, & TUV .

Publisher of 9 commercial, formally validated, statistical application Excel spreadsheets that have been purchased by over 80 companies, world wide. Applications include: Reliability, Normality Tests & Normality Transformations, Sampling Plans, SPC, Gage R&R, and Power.

You’re invited to “connect” with me on LinkedIn.

Objectives

PART 1 (today's topics):Obtain an understanding of BASIC Statistics in general (its vocabulary, methods, & uses) as needed to understand Parts 2 and 3.

PART 2 (not today's topics):Learn INTERMEDIATE statistical applications, and tests as needed to understand Part 3; also includes "reliability calculations", power calculations, and sample size determinations.

PART 3 (not today's topics):Become familiar with commonly used ADVANCED Statistical applications (Reliability Plotting, Sampling plans, SPC, Process Capability calculations, Equipment control).

Self-teaching & Reference Texts

RECOMMENDED by John Zorich

Clements: Handbook of Statistical Methods in Manufacturing

Kaminsky et. al.: Statistics and Quality Control for the Workplace

Mlodinow: The Drunkard’s Walk --- How Randomness Rules Our Lives

Motulsky: Intuitive Biostatistics

NIST Engineering Statistics Internet Handbook, at... http://www.itl.nist.gov/div898/handbook/index.htm

Philips: How to Think about Statistics

Free

Main Topics in Today's Workshop

• Regulatory Requirements• Population vs. Sample• Parameter vs. Statistic• Probability• Law of Large Numbers• Distributions (Charting and Graphing)• Binomial Distribution• Hypergeometric Distribution• Normal Distribution• Central Limit Theorem• Standard Deviation and Standard Error• Linear Regression & Correlation Coefficients

Regulatory Requirements

ISO 9001:2008 (8.1), and ISO 13485:2003(8.1) " The organization shall plan and implement the monitoring, measurement, analysis and improvement processes needed to demonstrate conformity [to requirements]....This shall include determination of applicable methods, including statistical techniques, & the extent of their use."

21CFR820.250 (FDA) " Where appropriate, each manufacturer shall

establish and maintain procedures for identifying valid statistical techniques required for establishing, controlling, and verifying the acceptability of process capability and product characteristics."

(as used in this class...)

Sample means part of a PopulationThe sample could be the part of an individual batch or lot that was purchased or produced; you inspect the sample prior to applying an "approved" label on the entire batch or lot.

"Representative Sample": a sample represents the population --- it is typically not a "Random Sample" but rather is usually taken evenly from thruout the population (e.g., a few items taken from each box in the batch).

"Sample size" can be anything from 1 to over 1,000,000. The term "one sample" or "a single sample" means the entire sample, no matter what the sample size is.

(as used in this class...)

Statistic is a mathematical summary value

calculated from data taken from a Sample. All of the following are statistics:

Avg thickness of every 100th cable produced last week. Range of thicknesses in that sample Median thickness in that sample.

Parameter is a mathematical summary value

calculated from data taken from the entire Population;that is, every data point in the entire population(e.g., average thickness of all cables produced last week).

"Statistics" as a science is the mathematical analysis of "statistics", not of parameters. Statistics is the science of

using "statistics" to guesstimate "parameters".

As a group, let's discuss...

Which are parameters and which statistics?1. Baseball "stats"

Answer: Parameters, because baseball "stats" are calculated using all the data.

2. United States Census dataAnswer: Some are statistics, because they are just a sample of the population (this is the preference of Democrats) whereas others are parameters because we attempt to count the entire US population (this is the preference of Republicans).

3. Average age of the people in this classDepends -- is this class a population or sample?

Probability (as used in this class) means...

The same as "chance" or "odds", but not based on a hunch or intuition or what has historicly occurred.

• The following statements are not using probability in the sense we mean here today:

He'll probably come home before 9pm. They'll probably win tonight's game. They haven’t won a game in 6 weeks---they’re due!!

• Those are examples of “Adverbial Probability” (see Inductive Logic, by Hibbens, 1896, chapter 15)

• Instead, the Science of Statistics uses... “Mathematical Probability”.

“Mathematical Probability” is the same as the "theoretical expected frequency", that is, the # of times one type of event would happen (if no cheating occurs) divided by the total number of all possible equi-probable events; e.g.,...

Probability

1:1 = "Fifty-Fifty" = 1 / 2 = 0.50 = 50 %

Those terms (above) all mean the same thing.They all mean that you have the same chance at winning as you have at losing, as opposed to...

1 / 4 = 0.25 = 25 % chance or odds of...1 / 10 = 0.10 = 10 % probability of...1 / 3 = 0.3333 (rounded) = 33.33 %1 / 6 = 0.1667 (rounded) = 16.67 %

(By definition...) Probability / chance / likelihood... never can exceed 1.00 = 100%, and never can be less than 0.00 = 0%

Probability

PROBABILITY OF ROLLING A GIVEN NUMBER ON 1 TOSS OF 1 DIEThe NULL HYPOTHESIS is that the DIE is "honest".

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

1 2 3 4 5 6

Number Observed on the 1 Die

Pro

ba

bil

ity

On a single die, the chance of one number appearing face up is the same any other number.

dfdf

ProbabilityPROBABILITY OF OBSERVING A GIVEN SUM ON 1 TOSS OF 2 DICE

The NULL HYPOTHESIS is that the dice are "honest".

0.00

0.05

0.10

0.15

0.20

2 3 4 5 6 7 8 9 10 11 12

Sum of Numbers Observed on 2 Dice

Pro

ba

bil

ity

These can be "calculated" by enumeration (instructor has a demo file on this).

dfdfkjfkdjf;lskdjff

ProbabilityPROBABILITY OF OBSERVING HEADS ON FLIP OF 1 COIN

The NULL HYPOTHESIS is that the COIN is "honest".

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 1

Number of heads observed

Pro

bab

ilit

y

The probability of having heads come up on a toss of one coin is 0.50. "Tails" have the same probability.

dfdf

ProbabilityPROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20 25 30

Number of observed HEADS

Pro

ba

bil

ity

Instructor has an Excel file that generates a series of charts like this one, based on tosses of up to several hundred coins.

dfdf

When tossing 30 honest coins, the "true" average is 15 heads, but by chance we may see some other result.

Probability of Independent Events

The MULTIPLICATIVE RULE:

The probability of Event A happening and Event B and Event C (assuming that they are independent events), is the multiplication of their probabilities: Pa x Pb x Pc (where Pa is the probability of Event A, and so on).

-- Class Exercise --Let's try answering these questions:

The MULTIPLICATIVE RULE (examples): The chance of rolling 2 dice and obtaining a 5 on both of them is...

1 / 6 x 1 / 6 = 1 / 36 = 0.028 = 2.8%

The probability of flipping a coin 4 times and obtaining "heads" every time is... 1 / 2 x 1 / 2 x 1 / 2 x 1 / 2 = 1 / 16 = 0.062 = 6.2%

Let's try that (flipping 4 coins & counting heads)

The likelihood of drawing 3 good parts from a lot of100 million parts, 99% of which are good, is...

0.99 x 0.99 x 0.99 = 0.9703 = 97.03%

Probability

The MULTIPLICATIVE RULE (corollary): Conditional Probability: If the probability changes after each sampling event, then the separate probabilities are not identical, because they are "conditional" not "independent"; e.g.

What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)?

1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700The probability of a given draw is "conditional" based upon what happened in the previous draw.

( do not use: 99/100 x 99/100 x 99/100 = 0.9703 )

Probability of Independent Events

Assuming that only one event can happen at a time, the sum of the probabilities of all possible events equals 1.000 exactly.

On a single die, only one number appears face up at a time. Therefore P1 + P2 + P3 + P4 + P5 + P6 = 1.00, where P1 is the probability of the #1 being face up, etc.

The ADDITIVE RULE:The probability of Event A happening or Event B or Event C, assuming that only one event can possibly happen at a time, is the sum of their probabilities: Pa + Pb + Pc (where Pa is the probability of Event A, and so on) --- in this case, there are assumed to be other possible events, i.e., Pc, Pd, Pe, etc.

-- Class Exercise --Let's try answering these questions:

The ADDITIVE RULE (examples):The chance of rolling 2 dice and obtaining a total of either 2 or 12 is

1 / 36 + 1 / 36 = 2 / 36 = 0.056 = 5.6%

The probability of flipping 4 coins and obtaining either all heads or all tails is

1 / 16 + 1 / 16 = 2 / 16 = 0.125 = 12.5%(based upon our example a few slides ago)

Likelihood that an n = 1 sample is out-of-spec if taken from a lot with 2% out-of-spec high & 5% out low is...

0.02 + 0.05 = 0.07 = 7%

PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0 5 10 15 20 25 30


Pro

bab

ility

Probability

The probability of getting3 or more heads in a single

toss of 4 coins is about 30% = 0.30 = the

approximate sum of the individual histogram bar probabilities for getting 3

heads or 4 heads ( 0.25 + 0.05 = 0.30 )

(assuming the coins are honest !!)

Calculation of each of these probabilities is simple to do by

enumeration (presenter has

demo file)

PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINS

0.00

0.05

0.10

0.15

0 5 10 15 20 25 30


Pro

babi

lity

Probability

The probability of getting 22 or more

heads in a single toss of 40 coins is about

30% ( ≈ the sum of the individual histogram bar probabilities of 22 and above on the X-axis)

( 0.10 + 0.075 + 0.06 + 0.03 + 0.02 + 0.01 + 0.005 = 0.30 )

(assuming the coins are honest !!)

Calculation of each of these probabilities is done the same

way as for 4 coins, but is much more

tedious because there are over a

million possibilities!!

t-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification

0 6Sample Average minus Specification

Probability

If the number of possible result values is "infinite" or very large, then the probability histogram is more conveniently represented by a smooth curve (such as this one) rather than a histogram like

in previous slides.

For example: individual weights of thousands of coins, or the

individual avg weights ofthousands of samplestaken from a very large

population of coins.

X-axis = measured values, increasing in magnitude, from left to right

Y-a

xis

= P

rob

abil

ity

or

Fre

qu

ency

Always think of this area under

the curve as filledwith histogram bars

that we are too cheap to print.

The curve from the previous slide.

Probabilityt-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification

Y-a

xis

= P

rob

abil

ity

= F

req

uen

cy

The probability of getting a measurement equal to or

greater than value "A" on the X-axis is exactly 0.30 = the

fraction of the area under the curve that is to the right of that

point on the X-axis (the red-shaded area equals 30% of the

area under the entire curve).

In the language of calculus, the red area is the integral of the distribution function, from "A" to infinity.

AX-axis = measured values, increasing in magnitude,

from left to right

e.g.,x-axis = widths of cables

made last week

Probability

t-Test of Null HypothesisNull Hypothesis: True Average is not greater than the Specification

Prob

abili

ty o

r Fre

quen

cy

The probability of getting a measurement equal to or greater

than value "B" on the X-axis is exactly 0.05 ( = the fraction of the area under the curve that is to the right of that point on the X-axis)

(the red-shaded area equals 5% of the area under the entire curve).

X-axis = measured values, increasing in magnitude,

from left to right

Y-a

xis

= P

rob

abil

ity

= F

req

uen

cy

We will use this concept many times in Day 2. Do you understand it

completely? (Let's examine it with

Instructor's Excel files)

B

e.g.,x-axis = widths of cables

made last week

The "Law of Large Numbers"

(per JZ) This "law", generalized, is somewhat self evident, and was known in principle to Archimedes over 2 millennia ago.

It applies to calculated "statistics", such as averages & standard deviations, and says nothing about the distribution of raw data.

Possibly a better name for this law is the one used over 100 years ago: The “Law of Tendency” ----

“…the law of tendency is that the larger the number of instances, the greater [= better ] will be the approximation to an accurate and definite result.”

(quote from pg 240 of Inductive Logic, 1896 by J.G. Hibbens, Scribner & Sons)

This quote shows that the “Law of Large Numbers” is part of our common language, but is unfortunately often applied incorrectly.It is misapplied here because the "Law of Large ("big") Numbers"

itself has nothing to do with "statistical significance” (we will discuss "statistical significance” in Day 2 of this workshop).

Law of Large Numbers translates (in this example) as...The larger the sample size, the closer the calculated value is likely to be to 100 = the population value (i.e., the closer the statistic is likely to be to the population parameter).

DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE )

70

100

130

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29SAMPLE SIZE

SA

MP

LE

AV

ER

AG

E(chart from "Law of Large Numbers.xls" in Student files)

Parameter = 100

Each mark on each line represents the avgof a different random sample taken from a

uniformly distributed population, 75 to 125.

-- Class Exercise --

If this population has an average value of 100, the average value of a SMALL

sample from this population will, in the long run,

be smaller or larger than 100 ?

Will the average value of a LARGE sample, in the long run,

be larger or smaller than 100?

This (below) is bimodal

Answer: In the long run, both small & large samples will be close to 100. In the long run, samples avgs equal population avgs, no matter what the sample size or population shape (that is, in the long run, statistics = parameter).

Graphical Methodsused to Describe

Variability

Number Line

. . .. ..... ... . .400 450 500 550 600

The small red squares graphically depict the variability, or the "distribution", of the data.

Histograms and Line Charts

Bar Charts and Line Charts

Pareto ChartREASONS WHY CUSTOMERS RETURNED CHINA

PLACE SETTINGS ORDERED OVER THE INTERNET FROM ZTC

25

17

12

42 1

0

5

10

15

20

25

30

NU

MB

ER

OF

RE

TU

RN

S I

N J

AN

200

5

This is shown here only to "complete" a survey of types of charts. We won't mention Pareto charts in the rest of the workshop.

-- Class Exercise --

If a population distribution looks bimodal, the distribution of data in a SMALL sample from that population will, on

average, look like...what?

The distribution of a LARGE sample from that population will, on average, look

like...what?

This (below) is bimodal

Answer: On average, both samples will look ≈ bimodal. On average, samples look like the parent population,no matter what the sample size.

"Binomial Distribution" Histogram

PROBABILITY OF MULTIPLE HEADS IN ONE TOSS OF MANY COINSThe Null Hypothesis is that coins are honest, i.e. probability of heads = 0.50

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20 25 30


Pro

ba

bil

ity

The "binomial distribution" describes frequencies when there are only 2 possible outcomes,

(e.g., head or tails on a coin, or a vote for or

against a proposed law).

30 coins at a time

dfdf

The formula for the "Binomial Distribution" is used to calculate, e.g., the probability of 26 heads appearing on a toss of 30 coins. Part of the formula includes the following calculation:

26 x 25 x 24 x 23 x ....... 4 x 3 x 2 x 1 = ???

= (approximately) 400 Million x Billion x Billion

Prior to computers, such calculations were "impossible", except by idiot savants ( = the first "computers" --- they were actually sought after and well paid !)

Calculation of Binomial Probability

How to easily calculate the height of a single barin a Binomial Distribution Probability Histogram…

(MSExcel function)=binomdist(N,S,B,false)

N = Number of heads observed in a given toss of coinsS = Sample size = number of coins per tossB = Probability of getting heads on a single coin = 0.5false = (tells Excel to give probability of single histogram bar) e.g., =binomdist(11,30,0.5,false) = 0.0509 (check that value vs. histogram a couple slides ago) Binomial distributions are symmetrical when probability = 0.500, but skewed when probability is any other value (the farther from 0.500, the more extreme is the skewness --- see next slide).

"Binomial Distribution" Histogram

dfdf

0.00

0.05

0.10

0.15

0.20

0.25

0 5 10 15 20 25 30

Pro

ba

bilit

y

Number of Stars Face Up in Toss of 30 Dice

IF A DICE HAD 10 SIDES, ONE OF WHICH HAD A STAR ON IT, PROBABILITY OF MULTIPLE STARS FACE UP IN TOSS OF 30 DICE

This situation is modeled by the Binomial distribution because we are looking at only 2 possible outcomes:

Star or Not-a-Star. The probability of a star coming face

up is = 1 / 10 = 10%. The corresponding binomial

histogram has a peak at 30 x 10% = 3, but is not symmetrical

(it is skewed to the right).

"Hypergeometric Distribution"

The "Binomial distribution" describes frequencies of independent events, where the probability of one result is NOT influenced by a previous result (e.g., coin tosses --- reference the "multiplicative rule" of probability calculation, discussed previously).

The "Hypergeometric distribution" looks almost identical to the Binomial, but describes frequencies where the probability of one result is influenced by a previous result, and therefore are NOT independent (e.g., sampling from a lot of 100 parts, only 99 of which are good --- reference the "multiplicative rule" "corollary", discussed previously).

The Hypergeometric Distributionis very difficult to calculate by hand, but...

The MS Excel function of the probability for the "Hypergeometric distribution" is...

=hypgeomdist(N,S,D,P)

N = Observed number of items in the Sample that exhibit the sought-after characteristic (e.g., 7 "good" parts)

S = Sample size (e.g., 8 parts)

D = # of items in the Population that exhibit the sought-after characteristic (e.g., 99 “good” parts )

P = Population Size (e.g., 100 parts in the lot)

"Hypergeometric Distribution"

(back in the discussion on "probability" we asked...)What is the probability of drawing 3 good parts from a lot of 100 parts, 99% of which are good (that is 99 of which are good and one of which is bad)?

Back then, we calculated it like so:1st draw 2nd draw 3rd draw 99 / 100 x 98 / 99 x 97 / 98 = 0.9700

Now we can use the hypergeometric Excel function instead: =hypgeomdist( 3, 3, 99, 100 ) = 0.9700

If we had instead used the binomial Excel function, we would have obtained this wrong answer: =binomdist( 3, 3, 0.99, false ) = 0.9703 ( which equals 99/100 x 99/100 x 99/100 )

Binomial vs. Hypergeometric Formula

As long as sample size is not more than 1% of lot size,the two formulae give the "same" result. For example...

SmplSize = 10, LotSize = 1000 (= Sample is 1% of Lot) =hypgeomdist( 10, 10, 990, 1000 ) = 0.904 =binomdist( 10, 10, 0.99, false ) = 0.904SmplSize = 100, LotSize = 1000 (= Sample is 10% of Lot) =hypgeomdist( 100, 100, 990, 1000 ) = 0.347 =binomdist( 100, 100, 0.99, false ) = 0.366

FYI: MS Excel cannot calculate every combination of Hypergeometric values --- for example...

=hypgeomdist( 135, 135, 9900, 10000 ) = #NUM! =binomdist( 135, 135, 0.99, false ) = 0.258

right wrong

Examples of Normal Distributions

Each of these "normal" curves describes a

population that has the same average value, but different degrees of variability within

the population.

The single most-used distribution in statistical analysis is the Normal distribution.

( X-axis is in the same units as the raw data. Y-axis is count, i.e., # of observed items of a given X-value.)

Examples of Normal Distributions

X-axis is in “standard units” (which we will discuss later).

Y-axis is count, i.e., # of observed items of a given

X-value.)

"Normal Distribution" equation

The equation for what we now call the "Normal distribution histogram" was discovered around 1730, as a way to simplify calculation of the Binomial distribution; only power & square root tables were needed (rather than idiot savants).

The Normal distribution histogram has the "same" shape as the Binomial when sample size is large and the probabilities of the outcomes are exactly 50:50 (for example, a histogram describing the various possible number of heads in a toss of a 10,000 coins).

The larger the sample (e.g., the more coins), the closer the Normal histogram shape is to the Binomial histogram shape.

"Normal Distribution" equation Independently re-discovered ≈ 1800 by 2 astronomers

(Gauss & Laplace); nowadays, sometimes called the " Gaussian curve "

They used it to describe the distribution of errors in measurements; it became known as the " error curve "...

...because errors in measurements act like a binomial situation, that is, a very precise measurement can be only one of two possibilities, namely either greater than the true value or less than the true value (ignoring the remote possibility of being exactly equal to the true value).

Renamed the " Normal Distribution " around 1900 after it was discovered that the "error curve" closely described the typical (i.e., the normal) distribution of many biological values (e.g., heights of humans, weights of walruses, lengths of lizards).

"Normal Distribution Histogram"

If a histogram of your measurement data does not mimic the histogram created by this equation, then your

data may actually not be "normal" !!

This equation looks intimidating, but your "student" files contain a spreadsheet that does the calculations for you,

and then automatically creates the histogram!

Y = # of items expected at X (divide by N to get probability)N = # of items examined (e.g., 225 people) i = width of each single bar ( = length of interval) on histogram (for binomial & other discreet distributions, i = 1 )X = x-axis midpoint of a given histogram barμ = average or expected value of all N itemsσ = standard deviation of all N items (we'll explain in a few minutes what a "standard deviation" is)

Let's examine "Student Normal Histogram.xls"

"Normal (quantity) Histogram"

Normal QUANTITY Distribution HISTOGRAM

0

5

10

15

20

25

30

4.0 4.5 5.0 5.5 6.0 6.5 7.0

QU

AN

TIT

Y

This was created using the Normal Distribution Histogram equation, with N = 225, i = 0.1,

Avg = 5.5, & StdDev = 0.33. This could represent the

distribution of a heights of 225 randomly selected people.

The sum of all these bars = N = 225.

"Normal (probability) Histogram"

Normal PROBABILITYDistribution HISTOGRAM

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

4.0 4.5 5.0 5.5 6.0 6.5 7.0

PR

OB

AB

ILIT

Y

This was created from the previous chart by dividing each quantity by N. The sum of all these bars = 1.000, no matter what the

sample size is ( N = 225, or

N = 1,000,000,000 ).

"Normal (probability) Curve"Normal PROBABILITY Curve

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

4.0 4.5 5.0 5.5 6.0 6.5 7.0

PR

OB

AB

ILIT

Y

This was created from the previous chart by drawing a smooth line from top to top

of each bar, and then deleting the bars. The sum

of the area under this curve is defined as = 1.000

This is the basis of the "normal curve" used in many statistical tests!

Always view such curves as really a histogram whose bars we are too cheap to print.

ddf

The "Central Limit Theorem"

(The text above is a scanned image from Bowker & Lieberman, Engineering Statistics, 2nd ed., p. 100)

CENTRAL LIMIT THEOREM translates as... for any population of raw data with any shaped distribution... in regards to the distribution of a large number of statistics

taken repeatedly from the population (e.g., averages, ranges, standard deviations, etc.)...

the distribution of the statistics will look more+more "normal“ (“bell” shaped) the larger+larger is the sample size;

that is true because the value of a statistic will be somewhere near the parameter, either larger or smaller than it (ignoring the unlikely event of equaling the parameter);i.e., it has a binomial distribution, which as we saw before, is modeled by the "Error Curve", which in modern-times is called the "Normal distribution".

the distribution of the statistics will never "be" Normal, except in cases when N is very large and the raw data population distribution is "normal". Often, the distribution of statistics is “ t ” shaped, as we will see in Day 2 of this course.

Let's examine STUDENT file: Central Limit Theorem.xls

Distribution of Sample Avgs. vs. Population

Theoretical distribution of thousands of individual avgs taken from the population.

DISTRIBUTION OF SAMPLE AVERAGES TAKEN FROM 1ST 250 ROWS( 250 SAMPLE AVERAGES PER EACH SAMPLE SIZE )

70

100

130

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

Let's look at this in more detail using

MS Excel

Shape is due to Central Limit Theorem.

Shape is due to Central Limit Theorem. Width is due to Law of Large Numbers.

Numerical Expressions

Range ( 1848 ? )Standard Deviation ( 1893 ? )

Standard Error ( 1897 ? )

Another important term is the " Mean ", which is another way to say the "average".

"Mean" in that sense was coined about 1750.

What is an "average" ?

About a hundred years ago, "average" usually meant the "median" ("the median home price in Dallas is..."). However, in more modern times, the word "average", by itself, always refers to the sum of all the values, divided by the number of values (i.e., the "arithmetic mean"):

Value#1+ Value#2+ Value#3+ ( etc. )+ Value#N= Sum of all Values

Average = Sum of all Values / N

What is a "range" ?

The "range" of a set of numbers refers to the difference between the largest and smallest value in that set:

Range =

Largest Value – Smallest Value

The range of height of people in this room is approximately...? (different for women than men...? )

What is a "range" ?

. . .. ..... ... . .400 450 500 550 600

This "number line" uses small red squares to graphically depict the variability, that is, the "distribution", of the data in a small sample. The width of the difference between the

value on the far left-hand side and the value on the far right-hand side is the "range".

In this data, the range looks to be about 200 units.

“Standard” calculationsStandard XXX

(the mathematical definition, for population parameter)

∑ ( Xi – Mean )2

# of data points in the Mean

Standard XXX (when using a sample to guess what the population parameter Standard XXX is)

Y = whole number, greater than zero; value depends on which "standard" statistic is being calculated.

∑ ( Xi – Mean )2

# of data points in the Mean, minus Y

Standard Deviation & Standard Error

Standard XXX (from a previous slide)

XXX = "Deviation" (that is "Standard Deviation") when talking about raw data (e.g., heights of humans, and lengths of lizards).

XXX = "Error" (that is, "Standard Error") when talking about calculated values (i.e., Statistics), for example: -- sample means ("Standard Error of the Mean"), or -- sample standard deviations ("Standard Error of the Standard Deviation").

100 Samples per data point (each point = average Std Dev of all 100 ). Random samples taken from normal population with Std Dev = 10.0

0

2

4

6

8

10

12

1 10 100 1000 10000

Sample Size

Sta

nd

ard

Dev

iatio

n C

alcu

late

d

Standard Deviation (n-1)

Standard Deviation (n)

Other random samples would produce differently shaped curves; but, on average, the "n" curve would be farther away (on the low side) from the true value than the "n-1" curve. That is, the "n-1" statistic is a better estimator of the parameter than the "n" one.

This is another example of how to think about the "Law of Large Numbers"; that is, the larger the sample size,the closer (on average) the "statistic" is to the "parameter".

(revisited)Distribution of Sample Avgs. vs. Population

Width of

Distribution is

measured in...

Std Errors

Std Deviations

Theoretical distribution of thousands of individual avgs taken from the population.

As was stated on a previous slide, a distribution (of raw data or of statistics such as "averages") is "normal" if it's histogram mimics a "normal" one.

Said differently, a distribution is "normal" if its distribution has characteristics that mimic that of the "normal probability curve", such as...

+/– 1 StdXXX from Avg = 68.3 % of area under curve+/– 2 StdXXX from Avg = 95.5 % of area under curve+/– 3 StdXXX from Avg = 99.7 % of area under curve

as seen in next few slides...

Areas under the "normal" curve

The darkened

area equals 68.3 %

of the area under the

curve.

70 80 90 100 110 120 130


The darkened

area equals 95.5 %

of the area under the

curve.

70 80 90 100 110 120 130


The darkened area equals 99.73 % of area under curve.

If a population with Avg =100, StdXXX = 10

is believed to be Normally distributed,

then... (1 – 0.9973) / 2 of

population (≈ 0.135%)is predicted to be

below X = 70

70 80 90 100 110 120 130

This +/− 3 interval is used

extensively in “Statistical

Process Control” (SPC).


The darkened

area equals 99.0 % of the area under the curve.

70 80 90 100 110 120 130

This +/− 2.58 interval is used

extensively in “Gage R&R”

and other “Metrology”

methods.


If Standard XXX is 10 , then...

+/– 1.96 Std XXX equals of area under curve

0

-0.86-0.84-0.82-0.8

-0.78-0.76-0.74-0.72-0.7

-0.68-0.66-0.64-0.62-0.6

95.00%---

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045

67 73 80 87 93 100 107 113 120 127 133

PR

OB

AB

ILIT

Y

This +/− 1.96 interval is used in

some Reliability calculations & in

some tests of “Significance”.

The darkened

area equals 95.0 % of the area under the curve.

70 80 90 100 110 120 130

In a normal distribution, +/– Z std deviations from

the Parameter Avg encompasses 2 x A of the population of numbers.

+/– 1.96 standard deviations equals2 x 0.4750 = 95.0%of the area under the normal curve

This is called a " Z " Table

+/– 3.00 standard deviations equals2 x 0.4987 = 99.7%of the area under the normal curve

Class exercise: Estimation of Std Dev

Assuming this ≈ normal distribution of raw data, approximately what is the Std Deviation?

Almost all of distribution is

≈ Mean +/– 30. If "normal", then 30 ≈ 3 StdDevs;

therefore StdDeviation ≈ 10

70 80 90 100 110 120 130

Class exercise: Estimation of Std Error

Assuming this ≈ normal distribution of Smpl Avgs, approximately what is the ≈ Standard Error?

Almost all of distribution is

≈ Mean +/– 15. If "normal", then 15 ≈ 3 Std Errors;

therefore Std Error ≈ 5

85 90 95 100 105 110 115

Calculating a "standard error"

Any statistic from a single sample will likely not be identical to the parameter. For example, you can expect a sample mean to be off by some unknown amount from the population mean, i.e. to have some amount of "error". The "standard" amount of error to expect is called the "standard error". The theoretical definitions of two important standard errors are:

Std Error of Mean = Std Dev of all possible (or at least a very large number of) sample averages (of a single sample size) taken from a Population.

Std Error of StdDev = Std Dev of all possible (or at least a very large number of) "n-1" std deviations (of a single sample size) taken from a Population.

Calculating a "standard error"

Avg#1 StdDev#1Avg#2 StdDev#2 Avg#3 StdDev#3Avg#4 StdDev#4 etc. etc.

Avg#N StdDev#N ------------- ----------------Std Dev of Avgs = Std Error of the Mean

Std Dev of StdDevs = Std Error of the Std Deviation

Multiple samples

(with replacement) of

the same sample size,

from the same

population, generated

these Avgs & StdDevs

Standard Error of the (sample) Mean( estimated from 1 sample )

Practical formula for "Std Error of Mean"

Sample Standard Deviation

Sample Size .

Linear Regression & the

Correlation Coefficient

What is the meaning of a Linear Regression Correlation Coefficient?

In 2009, a billion dollar manufacturing company submitted to a government regulatory agency a report from a product technical file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99).

The regulatory personnel knew that such a claim is nonsense, and they so officially requested a literature or text book reference that explained such a rationale. After a few rounds of emails and re-writings of the report (and still no literature reference) the company consulted a professional statistician, who recommended using a different statistical method to prove equivalency.

X (class = grade Y (minutes of study before in school) being easily distracted)

2 4.2

3 5.9

5 10.4

6 11.5

Is there a linear relationship between class (= grade) in school and tendency toward distraction? How strong is it?

How consistent is the relationship (that is, what is the degree of co-relation (more commonly called "correlation")? Let's use Excel to find out!

Understanding Linear Regression & the Correlation Coefficient

• This is a "linear regression plot" of the data.• The "regression coefficient" is 1.91• The "correlation coefficient" is " R " or " r " ,

that is, " r " = the square root of 0.9897 = 0.995


y = 1.91x + 0.36

R2 = 0.9897

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

• Linear regression puts the "best" straight line thru a plot of X vs. Y data points.

• The "regression coefficient" (= 1.91 = the slope of this line) tells use how STRONG the relationship is.


y = 1.91x + 0.36

R2 = 0.9897

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

• The linear regression equation (e.g., Y = 1.91X + 0.36 ) allows us to predict the Y value for a nearby X value.

• CLASS EXERCISE: What Y value do we expect at X = 1.0ANSWER: ( 1.91 times 1.0 ) + 0.36 = 2.27


y = 1.91x + 0.36

R2 = 0.9897

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

This is an example of "Reliability Plotting", which is discussed in

Day 3 of this workshop.

MS Excel Spreadsheet functions...

linear regression coefficient =SLOPE( known_y's, known_x's )

correlation coefficient --- same result given by either...=CORREL( known_y's, known_x's )

or... =CORREL( known_x's, known_y's )

Notice that the function formula for the slope cares about which data set is X and which is Y, but the formula for the correlation coefficient does not.


Are Correlation Coefficients the same if data sets are the same except for magnitude...???

YES !!0

200

400

600

800

1000

0 5 10 15 20

0.955

r

0.955

0.955

YES !!

The regression coefficients are

also all identical, because the

slopes are all identical.

0

200

400

600

800

1000

0 5 10 15 20


Does the Correlation Coefficient increasein size with additional data points...??

0.955

0.962

0.971

NO !!

This data seems to say

that the CC decreases in

size as the number of

data points increases !!

r

0

200

400

600

800

1000

0 5 10 15 20

Does a large Correlation Coefficient indicate that the data is truly linear...???

NO !! (notice how the the lower- most 2 data sets show a

slight curve) (the solid black lines are all straight, not curved)

0.955

0.955

0.955

r

But the regression coefficients

are not identical, because

the slopes are not identical.

85

0

200

400

600

800

1000

0 5 10 15 20

slight slope to this lowest regression line


If the data is close to the line, is the Correlation Coefficient always large...???

NO !!0.955

0.064

0.791

r

0

200

400

600

800

1000

0 5 10 15 20

slight slope & 2 dots per point in this lowest regression line


Does a large Correlation Coefficient indicate that the X,Y data have a strong relationship

(i.e., that the regression coefficient is large)...??

NO !!0.955

0.955

0.955

r

There are at least a dozen different formulas for the Correlation Coefficient.

The instructor considers this the best formula for teaching the meaning of Correlation.

The next few slides explain it....


Sy

Ser | r |

Ye is calculated from the linear regression equation that is used to draw the "straight line" thru the data: Ye = y = aX + b

The square root of 0.9897 = r = 0.995 = correlation coefficient(this chart & equation were produced by MS Excel)


y = 1.91x + 0.36

R2 = 0.9897

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

Continuing with the data and equation from the previous slide:

This equation from previous slide Ye = 1.91 ( X ) + 0.36

observed X observed Y Ye2 4.2 4.183 5.9 6.095 10.4 9.916 11.5 11.82Std Dev 3.505 = Sy 3.487 = Se

r = 3.487 / 3.505 = 0.995 = same as on previous slide (this is not a trick; it is just one of many mathematically identical formulas for calculating the magnitude of “ r ”)


The absolute value of the Correlation Coefficient : Correlation Coefficient is the ratio of 2 standard deviations: The numerator is the smallest possible standard deviation that can be expected in the Y data points ( = Se ), and the denominator is the observed standard deviation in the Y data points ( = Sy ). If the observed data were closer to the linear regression line, then Sy would be smaller and then the Se / Sy ratio would be closer to 1.000. THE CORRELATION COEFFICIENT THEREFORE IS A MEASURE OF VARIABILITY, OF HOW CONSISTENTLY THE PLOTTED DATA TRACKS TO THE LINEAR REGRESSION LINE.


Sy

Ser | r |

The Correlation Coefficient is... the fraction of the observed Y data variation ( = Sy, the std deviation of the observed Y values) that is explainable by a linear relationship between X and Y ( the variation “associated with” or “caused by” that linear relationship is Se, the std deviation of the predicted Y values). The rest of the variation in the data is definitely due to something else (e.g., poor measurement equipment, poor measurement technique, other factors, random error, or... the fact that the data are NOT linearly related !!).


Sy

Ser | r |

92

0

200

400

600

800

1000

0 5 10 15 20

slight slope to this lowest regression line


Assuming Y is dependent on X, what is the source (the "cause") of the variation in Y-values?

0.955

0.064

r

fsf

fsf

Almost all variation in Y is "caused" by relationship

between X & Y (variation in X "causes" variation in Y).

Almost no variation in Y is "caused" by relationship between X & Y (something else is the "cause", such as assay variation or measurement error).

Sometimes there is no "cause" (e.g., correlation

between arm-lengthand leg-length).

• The correlation coefficient is...an indicator of predictability in the data on the Y axis.

• It represents...the fraction of the variation in the Y-data that can be explained by an hypothesized linear relationship between X and Y.

• If that hypothesis is false, i.e., if the relationship between X and Y is not truly linear, then the Correlation Coefficient is meaningless.

What is the meaning of a (linear regression) Correlation Coefficient?

(stdev solid dots)

(stdev hollow dots)r =

As mentioned earlier, in 2009, a billion dollar company submitted to a regulatory agency a report in a tech file, claiming that performance data between the stressed and unstressed product were not significantly different, because the “correlation coefficient” between the data sets was large (about 0.99). Have you learned enough to explain why that is nonsense?

y = 0.5076x - 0.078

R2 = 0.9937

0

2

4

6

0 5 10 15

ST

RE

SS

ED

UNSTRESSED

y = 0.5076x - 0.078

R2 = 0.9937

0

5

10

15

0 5 10 15

ST

RE

SS

ED

UNSTRESSED

96

0

200

400

600

800

1000

0 5 10 15 20

0.955

0.955

0.955

• Just because Excel lets you put a Linear Regression line thru data points does not mean the data is a straight line.

• Just because the Correlation Coefficient is large does not mean you have a straight line.

• You must use your judgment to determine if the line is straight, and if "yes", then and only then can you use the Linear Regression Equation and Correlation Coefficient to help you evaluate the relationship between your X and Y values.

Conclusion to: Understanding Linear Regression & the Correlation Coefficient:

How to implement what you learned today?

A new language (and some of its vocabulary) is primarily what you learned today.

Like any language, you must speak it if you are to learn it well.

Read your company's SOP (or ??) on statistical techniques.

Ask to read some of the validation protocols and validation reports that relate to your work, and study their "statistics" section (or it might be called the "data analysis" section).

Ask your boss to explain statistical statements made in meetings, reports, or SOPs.

Documents

Thoratec Workshop in Applied Statistics for QA/QC, Mfg, and R+D Part 1 of 3: Basic Statistical Concepts Instructor : John Zorich [email protected]