449
1. Statistics, Data, and Statistical Thinking 1.1 The Science of Statistics Definition 1.1 Statistics is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information. 1.2 Types of Statistical Applications Definition 1.2

Report Engineering Probability and Statistics Ahmedawad

Embed Size (px)

DESCRIPTION

ain shams universityfaculty of engineeringpostgraduate degreeProbability and statistics contact:https://www.facebook.com/[email protected]

Citation preview

Page 1: Report Engineering Probability and Statistics Ahmedawad

1. Statistics, Data, and Statistical Thinking

1.1 The Science of Statistics

Definition 1.1

Statistics is the science of data. This involves collecting, classifying, summarizing, organizing, analyzing, and interpreting numerical information.

1.2 Types of Statistical Applications

Definition 1.2

Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present that information in a convenient form.

Page 2: Report Engineering Probability and Statistics Ahmedawad

Definition 1.3

Inferential statistics utilizes sample data to make estimates, decisions, predictions or other generalizations about a larger set of data.

1.3 Fundamental Elements of Statistics

Definition 1.4

A population is a set of units (usually people, objects, transactions, or events) that we are interested in studying.

Definition 1.5

Page 3: Report Engineering Probability and Statistics Ahmedawad

A variable is a characteristic or property of an individual population unit.

For example, we may be interested in the variables age, gender, and / or the number of years of education of the people currently unemployed in the United States.

The name “variable” is derived from the fact that any particular characteristic may vary among the units in a population.

Definition 1.6

A sample is a characteristic or property of an individual population unit.

Definition 1.7

A statistical inference is an estimate, prediction, or some other generalization about a population based on information contained in a sample.

Page 4: Report Engineering Probability and Statistics Ahmedawad

Following examples are for checking “Popluation, variable, sample, and Inference ”

We also need to know its reliability – that is, how good the inference is?

Thus, we introduce an element of uncertainty into our inferences.

Reliability is the fifth element of inferential statistical problems.

Definition 1.8

A measure of reliability is a statement (usually quantified) about the degree of uncertainty associated with a statistical inference.

Four Elements of Descriptive Statistical problems

1. The population or sample of interest.

Page 5: Report Engineering Probability and Statistics Ahmedawad

2. One or more variables that are to be investigated.

3. Tables, graphs, or numerical summary tools.

4. Identification of patterns in the data

Five Elements of Inferential Statitical Problems

1.The population of interest.2. One or more variables that are to be

investigated.3.The sample of population units.4.The inference about the population based

on information contained in the sample.5.A measure of reliability for the inference.

1.4 Types of Data

Definition 1.9

Page 6: Report Engineering Probability and Statistics Ahmedawad

Quantitative data are measurements that are recorded on a naturally occuring numerical scale.

Examples for Quantitative data,

the temparature,or the current unemployment rate for each of the 50 states,or the scores of a sample of 150 law school applicants on the LSAT, or the number of convicted murders who receive the death penalty each year over a 10-year.

Definition 1.10

Qualitative data are measurements that can not be measured on a natural numerical scale; they can only be classified into one of a group of categories.

Examples for Qualitative data,

Page 7: Report Engineering Probability and Statistics Ahmedawad

The political party affilation (Democratic, Republican or Independent) in a sample of 50 voters

A taste-tester’s ranking (best, worst, etc) of four brands of barbecue sauce for a panel of 10 testers.

1.5 Collecting Data

Obtain data in four different ways:1. Data from a published source (book, journal,

newspaper).2. Data from a designed experiment.3. Data from a survey.4. Data from an observational study.

Definition 1.11

Page 8: Report Engineering Probability and Statistics Ahmedawad

A representative sample exhibits characteristics typical of those possessed by the target population.

A random sample ensures that every subset of fixed size in the population has the same chance of being included in the sample.

1.6 The Role of Statistics in Critical Thinking

Definition 1.12

Statistical thinking involves applying rational thought to assess data and the inferences made from them critically.

Page 9: Report Engineering Probability and Statistics Ahmedawad

2. Methods for Describing Sets of Data

2.1 Describing Qualitative Data

Definition 2.1

A class is one of the categories into which qualitative data can be classified.

Page 10: Report Engineering Probability and Statistics Ahmedawad

Definition 2.2

A class frequency is the number of observations in the data set falling in a particular class.

Definition 2.3

The class relative frequency is the class frequency divided by the total number of observations in the data set, i.e.,

Class relative frequency = class frequency / n

2.2 Graphical Methods for Describing Quantitative Data

Dot plot

For example, here is a typical dotplot.

110 |**111 |***112 |**113 |*****114 |******115 |***

Page 11: Report Engineering Probability and Statistics Ahmedawad

116 |**117 |*

Stemplot

A set of data like the number of home runs that Barry Bonds hit can be represented by a list:16, 25, 24, 19, 33, 25, 34, 46, 37, 33, 42,40, 37, 34, 49, 73, 46, 45, 36. It is very difficult for me or just about anybody else to learn much about this data set from looking at a list of numbers like this, but a stemplot can provide a lot of insight. We use the tens digit as the stem, and the ones digit as the leaves to produce the display.

Excel output

Stem-and-Leaf Display Variable:  Barry BondsLeaf unit: 10   

1 6 92 4 5 53 3 3 4 4 6 7 7 4 0 2 5 6 6 95  

*Note the huge gap between the 40s and 70 !!!!

Page 12: Report Engineering Probability and Statistics Ahmedawad

6  7 3

Histograms

Sometimes we have too much data to do a stem plot easily. Then a histogram is a more efficient choice. Here is the algorithm for doing such a plot.

1.Divide the data into classes of equal width. 2.Count # of observations in each class. 3.Draw histogram. Put variable values

(classes) on horizontal axis. Frequencies of relative frequencies = freq / total on the horizontal axis. No space between bars. Sum of relative frequencies sum to 1, or 100

From Barry Bonds Home Run Data, We divide eight classes the following way:

Class # of HR 1-10 011-20 221-30 331-40 841-50 551-60 0

Page 13: Report Engineering Probability and Statistics Ahmedawad

61-70 071-80 1

Excel output

Barry Bonds HR Histogram

0

5

10

10 20 30 40 50 60 70 80 More

class

Fre

qu

ency

2.3 Summation Notation

means “add up all these numbers” .

2.4 Numerical Measures of Central Tendency

Measuring center: Mean, Median and Mode

Definition 2.4

Page 14: Report Engineering Probability and Statistics Ahmedawad

The mean of a set of quantitative data is the sum of the measurements divided by the number of measurements contained in the data set.One measure of center is the mean or average. The mean is defined as follows, suppose we

have a list of numbers denoted,x1 x2 , …,xn .

That is, there are n numbers in our list. The mean or average x-bar (x ) of our data is defined by adding up all the numbers and dividing by the number of numbers. In symbols this is,

x=x1+x2+…+xn

n=1

n∑i=1

n

xi.

Symbols for the Sample Mean and the Population Mean

The symbols for the mean are

x = Sample meanμ = Population mean

Page 15: Report Engineering Probability and Statistics Ahmedawad

Definition 2.5

The median M of a quantitative data set is the middle number when the measurements are arranged in ascending (or descending) order.

How to find the median.

1.Order observations from smallest to largest. 2.If n is odd, the median is the value of the

center observaton. Location is at (n+1) / 2 in the list.

3.If n is even, the median is defined to be ther average of the two center observations in the ordered list.

Comparing the Mean and the Median

Right Skewed Curve

Normal (Bell-shaped) Curve

Left Skewed Curve

Page 16: Report Engineering Probability and Statistics Ahmedawad

Definition 2.8

The mode is the measurement that occurs most frequently in the data set.

2.5 Numerical Measures of Variability

Measuring spread:

Range, Sample Variance, and Sample Standard DeviationDefinition 2.9

The range of a quantitaive data set is equal to the largest measurement minus the smallest measurement.

Definition 2.10

Page 17: Report Engineering Probability and Statistics Ahmedawad

The sample variance for a sample of n measurements is equal to the sum of the squared distances from the mean divided by (n-

1). In symbols, using s2

to represent the sample variance,

s2=∑i=1

n

( xi− x )2

n−1

Note: A shortcut formula for calculating s2

is

s2=∑i=1

n

xi2−

(∑i=1

n

x i )2

n

n−1 .

Definition 2.11

The sample standard deviation, s , is defined as the positive square root of the sample

variance, s2

. Thus,

s=√s2.

Page 18: Report Engineering Probability and Statistics Ahmedawad

Symbols for Variance and Stardard Deviation

s2 = Sample variance

s = Sample standard deviation

σ 2 = Population variance

σ = Population standard deviationIf s is large when the observations are widely spread about the mean, and s is small when the data are closely clustered about the mean. The value s goes between zero and infinity. A value like s=0 would mean all the values in the dataset had the same value, and thus no spread at all in their values.

2.6 Interpreting the standard deviation

The 68-95-99.7 Rule

In any Normal Curve:

Sixty-eight percent of all observations fall

within s

units on either side of the mean x

. 95% of all obs fall within 2 standard

deviations s

's of the mean

x.

Page 19: Report Engineering Probability and Statistics Ahmedawad

99.7% of all obs fall within 3 standard

deviations s

's of the mean

x.

Chebyshev’s Rule

Chebyshev’s Rule applies to any data set, regardless of the shape of the frequency distribution of the data.

No useful information is provided on the fraction of measurements that fall within 1 standard deviation of the mean, i.e., within

Page 20: Report Engineering Probability and Statistics Ahmedawad

the interval (x

-s

, x

+s

) for samples and (μ

) for populations. At least ¾ of the measurements will fall

within the interval (x

-2s

, x

+2s

) for samples

and (μ

-2σ

+2σ

) for populations. At least 8/9 of the measurements will fall

within the interval (x

-3s

, x

+3s

) for samples

and (μ

-3σ

+3σ

) for populations.

2.7 Numerical Measures of Relative Standing

Definition 2.12

Page 21: Report Engineering Probability and Statistics Ahmedawad

For any set of n mesarements (arranged in ascending or descending order), the p-th percentile is a number such that p% of the measurements fall below the pth percentile and (100-p)% fall above it.

Standard Normal DistributionA special normal curve to study is the standard

normal, with μ=0 and σ=1 . This is special because every normal problem can be converted to a problem about a standard normal. The conversion from a normally distributed variable, X with mean μ and standard deviation σ is carried out by the Z-Score transform given by,

z= x−μσ .

Definition 2.13

Page 22: Report Engineering Probability and Statistics Ahmedawad

The sample z-score for a mesurement x is

z= x− xs

The population z-score for a mesurement x is

z= x−μσ

Interpretation of z-Scores for Mound-Shaped Distribution of Data

Approximately 68% of the measurements will have a z-score between -1 and 1.Approximately 95% of the measurements will have a z-score between -2 and 2.Approximately 99.7% of the measurements will have a z-score between -3 and 3.

2.8 Methods for Detecting Outliers

Page 23: Report Engineering Probability and Statistics Ahmedawad

Sometimes it is important to identify inconsistent or unusual measurements in a data set. An observation that is unusually large or small relative to the data values we want to describe is called an outlier.

Definition 2.14

An observation (or measurement) that is usually large or small relative to the other vlaues in a data set is called an outlier. Outliers typically are attributable to one of the following causes:

1. The measurement associated with the outlier may be invalid.

2. The measurement comes from a different population.

3. The measurement is correct, but represents a rare (chance) event.

Measures Based on the Quartiles

We can now define some special percentiles:

Page 24: Report Engineering Probability and Statistics Ahmedawad

The first quartile Q1 is the 25th percentile, 25 percent of the observations in a list are smaller than Q1.

The second quartile, Q2 is the 50th percentile, or the median. About half the data are less than this value Q2.

The third quartile, Q3 is the 75th percentile, about 75 percent of the observations are below this value Q3.

Notice that these three quartiles cut the data set into four parts, hence the name quartiles: 1) the part between the minimum and Q1 (25%), 2) the part between Q1 and Q2 (25%), 3) the part between Q2 and Q3 (25%), and 4) the part between Q3 and the maximum (25%).

How to find the quartiles.

1.Arrange the observations in increasing order and locate the median M in the ordered list of observations.

2.The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median.

Page 25: Report Engineering Probability and Statistics Ahmedawad

3.The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median.

Boxplot

A boxplot is a graph of the five-number summary.

A central box spans the quartiles Q1 and Q3.

A line in the box marks the median M. Lines extend from the box out to the

smallest and largest observations.

A measure of spread based on these quartiles is the Interquartile range IQR =Q3 - Q1, the distance between the quartiles. The IQR gives the spread in data values covered by the middle half of the data.

The quartiles in IQR give a good measure of spread because they are not sensitive to a few extreme observations in the tails. Thus, when a dataset has outliers or skewness the IQR is an appropriate summary measure.

Page 26: Report Engineering Probability and Statistics Ahmedawad

A common rule of thumb for detecting outliers is that 1.5 times IQR should contain most of the data. Values in the dataset that are either bigger than 1.5* IQR+Q3 or values less than Q1 - 1.5* IQR are often flagged for further consideration as potential outliers.

3. Probability

Page 27: Report Engineering Probability and Statistics Ahmedawad

3.1 Events, Sample Spaces, and Probability

Definition 3.1

An experiment is an act or process of observation that leads to a single outcome that cannot be predicted with certainty.

Definition 3.2

A sample point is the most basic outcome of an experiment.

Definition 3.3

The sample space S of an experiment is the collection of all its sample points.

Definition 3.4

An event is a subset of the sample space.

Here are probability rules for sample points:

Page 28: Report Engineering Probability and Statistics Ahmedawad

1. All sample point probabilities must lie between zero and one.

2. The probabilities of all the sample points within a sample space must sum to 1.

Example Toss a coin. There are two possible sample points, and the sample space is

S = {heads, tails} or more briefly, S = {H,T}.

Example Toss a coin four times and record the results. That’s a bit vague. To be exact, record the results of each of the four tosses in order. The sample space S is the set of all 16 strings of four H’s and T’s:

S = { HHHH, HHHT, HHTH, HHTT,

HTHH, HTHT, HTTH, HTTT,

THHH, THHT, THTH, THTT,

TTHH, TTHT, TTTH, TTTT }

Suppose that our only interest is the number of heads in four tosses. The sample space contains only five outcomes:

S = { 0, 1, 2, 3, 4}.

Probability of an Event

Page 29: Report Engineering Probability and Statistics Ahmedawad

The probability of an event A is calculated by summing the probabilities of the sample points in the sample space for A

Example

Take the sample space S for four tosses of a coin to be the 16 possible outcomes in the form HTHH. Then “exactly 2 heads” is an event. Call this event A. The event A expressed as a subset of outcomes is

A={HHTT, HTHT, HTTH, THHT, THTH, TTHH}

P(A)=P({HHTT, HTHT, HTTH, THHT, THTH, TTHH})

=P({HHTT})+P({HTHT})+P({HTTH})+P({THHT})+P({THTH})+P({TTHH})= 6/16=3/8

Page 30: Report Engineering Probability and Statistics Ahmedawad

3.2 Unions and Intersections

Definition 3.5The union of two events A and B is the event that occurs if either A or B or both occur on a single performance of the experiment. We denote the union of events A and B by the

symbol A∪B

. A∪B

consists of all the sample points that belong to A or B or both.

Figure. Venn diagram showing disjoint events A and B.

Page 31: Report Engineering Probability and Statistics Ahmedawad

Definition 3.5The intersection of two events A and B is the event that occurs if both A and B occur on a single performance of the experiment. We denote the intersection of events A and B by the

symbol A∩B

. A∩B

consists of all the sample points that belong to A and B.

3.3 Complementary Events

Definition 3.7The complement of an event A is the event that A does not occur- that is, the event consisting of all sample points that are not in event A. We

denote the complement of A by Ac

.

Page 32: Report Engineering Probability and Statistics Ahmedawad

Here are some rule about probabilities:

The probability of an event happening is simply one minus the event not happening.

That is, P( A )=1−P ( Ac )

, or the probability of event A is one minus the probability of A not

happening, (Ac=A

complement).

If the events have no outcomes in common the probability of either of them happening is the sum of their probabilities. In notation, P(A or B) = P(A) + P(B).

For example, suppose a certain little town the number of children in households with children is

Outcome 1 2 3 4 5 6 or moreProbability .15 .55 .10 .10 .05 .05

The probability of two or fewer children is P(1 or 2)=P(1)+P(2)=.15+.55=.7.

Le’s denote A={1,2}. Then P(A

)=.7. How do you

find P(Ac

)?

Page 33: Report Engineering Probability and Statistics Ahmedawad

P( A )=1−P ( Ac )= 1 - .7 = .3 .

3.4 The Additive Rule and Mutually Exclusive Events

Additive Rule of Probability

The probability of the union of events A and B is the sum of the probability of events A and B minus the probability of the intersection of events A and B, that is

P( A∪B ) =P(A) + P(B) – P( A∩B )

Page 34: Report Engineering Probability and Statistics Ahmedawad

Definition 3.8Events A and B are mutually exclusive if A∩B

contains no sample points, that is, if A and B have no sample points in common.

Probability of Union of Two Mutually Exclusive Events

If two events A and B are mutually exclusive, the probability of the union of A and B equals the sum of the probabilities of A and B; that is,

P( A∪B ) =P(A) + P(B)

3.5 Conditional Probability

The new notation P( A|B) is a conditional probability. That is, it gives the probability of one event under the condition that we know another event. You can read the bar | as “given the information that.”

Page 35: Report Engineering Probability and Statistics Ahmedawad

Formula for P( A|B )

To find the conditional probability that event A occurs given that event B occurs, divide the probability that both A and B occur by the probability that B occurs, that is,

P ( A|B )=P ( A∩B )P( B )

We assume that P( B)≠0.

Page 36: Report Engineering Probability and Statistics Ahmedawad

Example Let’s define two events:

A = the woman chosen is young, ages 18 to 29

B = the woman chosen is married

The probability of choosing a young woman is

P( A )=22 ,512103 , 870

=0 .217 .

The probability that we choose a woman who is both young and married is

P( A and B)= 7 , 842103 , 870

=0 . 075 .

The conditional probability that a woman is married when we know she is under age 30 is

P( B|A )=P( A and B)

P( A )= 7 , 842

22 , 512=0. 348

.

Page 37: Report Engineering Probability and Statistics Ahmedawad

3.6 The Multiplicative Rule and Independent Events

Multiplication Rule of Probability

The probability that both of two events A and B happen together can be found by

P( A ∩ B)=P( A )×P(B|A ).

Example Slim is still at the poker table. At the moment, he wants very much to draw two diamonds in a row. As he sits at the table looking at his hand and at the upturned cards on the table, Slim sees 11 cards. Of these, 4 are diamonds. The full deck contains 13 diamonds among its 52 cards, so 9 of the 41 unseen cards are diamonds. To find Slim’s probability of drawing two diamonds, first calculate

P( first card diamond)= 9

41

P( second card diamond | first card diamond

)= 840

Page 38: Report Engineering Probability and Statistics Ahmedawad

Multiplication rule P( A ∩ B)=P( A )×P(B|A )

now says that

P( both cards diamonds)= 9

41× 8

40=0.044

.

Slim will need luck to draw his diamonds.

Page 39: Report Engineering Probability and Statistics Ahmedawad

Probability Trees

Many probability and decision making problems can be conceptualized as happening in stages, and probability trees are a great way to express such a process or problem.

Page 40: Report Engineering Probability and Statistics Ahmedawad

Example There are two disjoint paths to B (professional play). By the addition rule, P(B) is the sum of their probabilities. The probability of reaching B through college (top half of the tree) is

P( B and A )=P( A )×P(B|A )

=0.05×0 .017=0 . 00085 .

The probability of reaching B without college (bottom half of the tree) is

P( B and Ac )=P( Ac)×P (B|Ac )

=0.95×0 .001=0 . 00095 .

About 9 high school atheletes out of 10,000 will play professional sports.

Page 41: Report Engineering Probability and Statistics Ahmedawad

Independent Events

Two events A and B that both have positive probability are independent if

P( A |B)=P( A ) .

When events A and B are independent, it is also true that

P( B |A )=P(B ).

Events that are not independent are said to be dependent.

Probability of Intersection of Two Independent Events

If events A and B are independent, the probability of the intersection of A and B equals the product of the probabilities of A and B; that is

P( A ∩B)=P ( A ) P(B ) .

The converse is also true: If

P( A ∩B)=P ( A ) P(B ),

Page 42: Report Engineering Probability and Statistics Ahmedawad

then events A and B are independent.

3.7 Random Sampling

Definition 3.10

If n elements are selected from a population in such a way that every set of n elements in the population has an equal probability of being selected, the n elements are said to be a random sample.

A method of determining the number of samples is to use combinatorial mathematics. The combinatorial symbol for the number of different ways of selecting n elements from N elements is

(Nn )

,which is read “the number of combinations of N elements taken n at a time.” The formula for calculating the number is

(Nn )= N !

n !( N−n) !

Page 43: Report Engineering Probability and Statistics Ahmedawad

where “!” is the factorial symbol and is shorthand for the following multiplication:

n !=n(n−1)(n−2)⋯(3 )(2 )(1)

Thus, for example, 5 !=5⋅4⋅3⋅2⋅1=120 .(The quantity 0 ! is defined to be 1.)

Page 44: Report Engineering Probability and Statistics Ahmedawad

4. Discrete Random Variables

Definition 4.1

A random variable is a variable that assumes numerical values associated with the random outcomes of an experiment, where one (and only one) numerical value is assigned to each sample point.

For example define the random variable X as the number of heads in 2 tosses of a fair, 50-50

coin. The sample space is S={HT , HH , TH ,TT } the corresponding outcomes in this sample space get associated with values of the random

variable X as {1,2,1,0} because the outcomes have 1,2,1, and 0 heads respectively.

4.1 Two Types of Random Variables

Random Variable: Discrete random variable

Page 45: Report Engineering Probability and Statistics Ahmedawad

Continuous random variable

Discrete Random variable

A discrete random variable X has a finite number of possible values.

The following are examples of discrete random varibles:

1. the number of seizers an epileptic patient has in a given week: x=0,1,2,…

2. The number of voters in a sample of 500 who favor impeachment of the president: x=0,1,2,…,500

3. The number of students applying to medical schools this year: x=0,1,2,…

4. The number of errors on a page of an account’s ledger: x=0,1,2,…

5. The number of customers waiting to be served in a restaurant at a particular time: x=0,1,2,…

Continuous Random variable

Page 46: Report Engineering Probability and Statistics Ahmedawad

Random variables that can assume values corresponding to any of the points contained in one or more intervals are called continuous.

Suppose that we want to choose a number at random between 0 and 1, allowing any number between 0 and 1 as the outcome. Software random number generators will do this. You can visualize such a random number by thinking of a spinner (Figure). The sample space is now an entire interval of numbers:

S={all number x such 0≤x≤1 } .

Page 47: Report Engineering Probability and Statistics Ahmedawad

Figure. A spinner that generate a random number between 0 and 1.

4.2 Probability Distributions for Discrete Random Variables

The probability distribution of X lists the values and their probabilities:

Value of X ProbabilityX1 p1

X2 p2

X3 p3

: :: :

Xk pk

The Probabilities pi must satisfy two requirements:

1. Every probability pi is a number between 0 and 1.

2. p1+p2 … +pk = 1

We usually summarize all the information about a random variable with a probability table like:

X 0 1 2------------------------------------

P(x) 1/4 1/2 1/4

Page 48: Report Engineering Probability and Statistics Ahmedawad

this is the probability table representing the random variable X defined above for the 2 toss coin tossing experiment. There is one outcome with zero heads, 2 with one head, and one with 2 heads. All outcomes are equally likely, and this means the probabilities are defined as the number of outcomes in the event divided by the total number of outcomes.

Definition 4.4

The probability distribution of a discrete random variable is a graph, table, or formula that specifies the probability associated with each possible value the random variable can assume.

4.3 Expected values of Discrete Random Variables

Definition 4.5

The mean, or expected value, of a discrete random variable x is

μ=E ( x )=x1 p1+x2 p2+⋯+ xk pk

Page 49: Report Engineering Probability and Statistics Ahmedawad

=∑i=1

k

x i pi.

Suppose that X is a discrete random variable whose distribution is

Value of X ProbabilityX1 p1

X2 p2

X3 p3

: :: :

Xk pk

To find the mean of X, multiply each possible value by its probability, then add all the products:

μ=E ( x )=x1 p1+x2 p2+⋯+ xk pk

=∑i=1

k

x i pi.

This means that the average or expected value, μ , of the random variable X is equal to the sum

of all possible values of the variable, the x i ,

multiplied by the probabilities of each value happening.

Page 50: Report Engineering Probability and Statistics Ahmedawad

In our 2 tosses of a coin example, we can compute the average number of heads in 2 tosses by 0(1/4)+1(1/2)+2(1/4)=1. That is, the average number or expected number of heads in 2 tosses is one head.

A more helpful way to implement this formula is to create the random variable table again, but now add an additional column to the table, and call it X P(X). In this third column multiply the value of X by the probability. For example,

X P(x) X*P(X)----------------------------

0 1/4 0 1 1/2 1/2 2 1/4 1/2then the average or expected value of X is found by adding up all the values in the third column to

obtain μ=1 .

Another example is suppose we toss a coin 3 times, let X be the number of heads in 3 tosses. The table is:

X P(x) X*P(X)----------------------------

0 1/8 0 1 3/8 3/8

Page 51: Report Engineering Probability and Statistics Ahmedawad

2 3/8 6/8 3 1/8 3/8to give μ =12/8=1.5 so that the expected number of heads in three tosses is one and a half heads.

Since a probability distribution can be viewed as a representation of a population, we will use the population variance to measure its variability.

Definition 4.6

The variance of a random variable x is

σ 2=E [ ( x−μ)2 ]=∑ (x−μ )2 p( x )

Definition 4.7

The standard deviation of a discrete random variable x is equal to the square root of the variance, i.e.,

σ=√σ2

Page 52: Report Engineering Probability and Statistics Ahmedawad

4.4 The Binomial Random Variable

Characteristics of a Binomial Random Variable

1. The experiment consists of n identical trials.2. There are only two possible outcomes on

each trial. We will denote one outcome by S (for success) and the other by F (for Failure).

3. The probability of S remains the same from trial to trial. The probability is denoted by p, and the probability of F is denoted by q. Note that q=1-p.

4. The trials are independent.5. The binomial random variable x is the

number of S’s in n trials.

Page 53: Report Engineering Probability and Statistics Ahmedawad

The Binomial distributions for sample counts

Think of tossing a coin n times as an example of the binomial setting. Each toss gives either heads or tails. The outcomes of successive tosses are independent. If we call heads a success, then p is the probability of obtaining a head. The number of heads we count is a random variable X. The distribution of X is determined by the number of observations n and the success probability p.

Binomial Distribution

The distribution of the count X of successes is called the binomial distribution with parameters n and p. The parameter n is the number of observations, and p is the probability of a success on any one observation. The possible values of X are the whole numbers from 0 to n. As an abbreviation, we say that X is B(n,p).

Example 5.2 (a) Toss a balanced coin 10 times and count the number X of heads. There are n=10 tosses. Successive tosses are independent. If the coin is balanced, the

Page 54: Report Engineering Probability and Statistics Ahmedawad

probability of a head is p=0.5 on each toss. The number of heads we observe has the binomial distribution B(10, 0.5).

In general, we can use combinatorial mathematics to count the number of sample points. For example,

Number of sample points for which x=3= Number of different ways of selecting 3 successes of the 4 trials

= (43 )= 4 !

3 ! (4−3 )!= 4⋅3⋅2⋅1

3⋅2⋅1⋅(1 )=4

The formula that works for any value of x can be deduced as follows: suppose p=0.1 and q=0.9,

P( x=3 )=(43 )( . 1)3×( . 9 )1=(4x )( . 1)x×( .9 )4− x

The component (43 )

counts the number of sample points with x successes and the

Page 55: Report Engineering Probability and Statistics Ahmedawad

components ( .1 )x×(. 9 )4−x is the probability

associated with each sample point having x successes.

The Binomial probability Distribution

P( x )=(nx ) px×qn− x

,(x=0,1,2 , .. . , n )where

p= Probability of a success on a single trial q= 1-p n= Number of trials x= Number of successes in n trials

(nx )= n !

x ! (n−x )!

Page 56: Report Engineering Probability and Statistics Ahmedawad

As noted in Chapter 3, 5 !=5⋅4⋅3⋅2⋅1=120 .

Similarly, n !=n⋅(n−1 )⋅(n−2)⋯3⋅2⋅1 .

Binomial Mean and Standard Deviation

If a count X has the binomial distribution B(n,p), then

Mean: μ=n×p

Variance: σ2=n×p×q

Standard deviation: σ=√n×p×q

Page 57: Report Engineering Probability and Statistics Ahmedawad

Example The Helsinki study planned to give gemfibrozil to about 2000 men aged 40 to 55 and a placebo to another 2000. The probability of a heartattack during the five year period of the study for men this age is about 0.04. What are the mean and standard deviation of the number of heart attacks that will be observed in one group if the treatment does not change this probability? (Solution). There are 2000 independent observations, each having probability p=0.04 of a heart attack. The count X of heart attacks is B(2000, 0.04), so that

μ=n×p=2000×0 . 04=80

σ=√n×p×(1−p )

Page 58: Report Engineering Probability and Statistics Ahmedawad

=√2000×0 . 04×(1−0 .04 )=8 .76

Finding binomial probabilities: Tables

We can find binomail probabilities for some values for n and p by looking up probabilities in Table II (Please look at page 885) in the back of the book. The entries in the table are the probabilities P(X=k) of individual outcomes for a binomial random variable X.

Example A quality engineer selects an SRS of 10 switches from a large shipment for detailed inspection. Unknown to the engineer, 10% of the switches in the shipment fail to meet the specifications. What is the probability that no more than 1 of the 10 switches in the sample fails inspection?

Page 59: Report Engineering Probability and Statistics Ahmedawad

(Solution). Let X = the count of bad switches in the sample.

The probability that the switches in the shipment fail to meet the specification is p = 0.1 and sample size is n=10. Thus, X is B(n=10, p=0.1).

We want to calculate

P( X≤1)=P ( X=0 )+P( X=1)

Let’s look at page 885 in the Table II for this calculation, look opposite n=10 and under p=0.10. This part of the table appears at the left.

The entry opposite each k is P( X=k ). We find

P( X≤1)=P ( X=0 )+P( X=1)

=0.736 .

About 74% of all samples will contain no more than 1 bad switch.

Page 60: Report Engineering Probability and Statistics Ahmedawad

Figure Probability histogram for the binomial distribution with n=10 and p=0.1, for Example.

Page 61: Report Engineering Probability and Statistics Ahmedawad

Example Corinne is a basketball player who makes 80% of her free throws over the course of a season. In a key game, Corinne shoots 15 free throws and misses 5 of them. The fans think that she failed because she was nervous. Is it unusual for Corinne to perform this poorly?

(Solution). Because the probability of making a free throw is greater than 0.5, we count misses in order to use Table II.

Let X = the number of misses in 15 attempts.

The probability of a miss is p=1-0.80=0.20. Thus, X is B(n=15, p=0.20).

We want the probability of missing 5 or more. This is

P( X≥5)=P( X=5 )+⋯+P( X=15) .

Let’s look at page 885 in the Table II for this calculation, look opposite n=15 and under p=0.20. This part of the table appears at the left.

The entry opposite each k is P( X=k ). We find

P( X≥5)=P( X=5 )+⋯+P( X=15)

=1−P( X≤4 )

Page 62: Report Engineering Probability and Statistics Ahmedawad

=1−0 .838=0 . 162 .

Corinne will miss 5 or more out of 15 free throws about 16% of the time, or roughly one of every six games. While below her average level, this performance is well within the range of the usual chance variation in her shooting.

Page 63: Report Engineering Probability and Statistics Ahmedawad

4.5 The Poisson Random Variable

A type of probability distribution that is often useful in describing the number of events that will occur in a specific period of time or in a specific area or volume is the Poisson distribution (named after the 18th-century physicist and mathematician, Simeon Poisson)

Characteristics of a Poisson Random Variable

The experiment consists of counting the number of times a certain event occurs during a given unit of time or in a given area or volume.The probability that an event occurs in a given unit of time, area, or volume is the same for all the units.The number of events that occur in one unit of time, area, or volume is independent of the number that occur in other units.The mean (or expected) number of events in each unit is denoted by the Greek letter lambda λ

Page 64: Report Engineering Probability and Statistics Ahmedawad

Probability Distribution, Mean, and Variance for a Poisson Random variable

P( x )= λx e− λ

x ! ,(x=0,1,2 , .. . , n )

μ= λ , σ2= λ

where

λ = Mean number of events during given unit of time, area, volume, etc.

Page 65: Report Engineering Probability and Statistics Ahmedawad

5. Continuous Random Variables

5.1 Continuous Probability Distribution

Continuous Random variable

A continuous random variable takes all values in an interval of numbers. The probability distribution of X is described by a density curve. The probability of any event is the area under the density curve and above the values of X that make up the event.

Figure The probability distribution of a continuous random variable assigns probabilities as area under a density curve.

Page 66: Report Engineering Probability and Statistics Ahmedawad

The probability associated with a particular value of x is equal to 0; that is, P(x=a)=0 and hence

P(a< x<b)=P(a≤x≤b ) .

5.2 The Uniform Distribution

Continuous random variables that appear to have equally likely outcomes over their range of possible values possess a uniform probability distribution, perhaps the simplest of all continuous probability distributions.

Figure Assigning probabilities for generating a random number between 0 and 1. The probability of any interval of numbers is the area above the interval and under the curve.

Page 67: Report Engineering Probability and Statistics Ahmedawad

Suppose the random variable x can assume values only in an interval c≤x≤d . The height

of f ( x ) is constant in that interval and equals

1/(d-c). Therefore, the total area under f ( x ) is given by

Total area of rectangle =(Base)(Height)

= (d−c )( 1

d−c )=1

Probability Distribution, Mean, and Standard Deviation of a Uniform Random Variable x

f ( x )= 1d−c (c≤x≤d )

μ= c+d2

σ=d−c

√12

5.3 The Normal Distribution

Page 68: Report Engineering Probability and Statistics Ahmedawad

Probability Distribution for a Normal Random Variable x

f ( x )= 1σ √2π

e−(1/2)[(x−μ)/σ ]2

where

μ= Mean of the normal random variable x

σ= Standard deviation

π= 3.1416...

e= 2.71828...

Definition 5.1The standard normal distribution is a normal

distribution with μ=0 and σ=1 . A random variable with a standard normal distribution, denoted by the symbol z, is called a standard normal random variable.

Normal distributions as probability distributions

In the language of random variables, if x has the

Page 69: Report Engineering Probability and Statistics Ahmedawad

N(μ

, σ

) distribution, then the standardized variable

z= x−μσ

is a standard normal random variable having the distribution N(0,1).

Here are the steps for finding a Probability Corresponding to a normal Random variable .

1.Draw a normal curve first, do your best. 2.Next, label the center or mean of the curve

with zero because standard normal curves have a mean of zero.

3.Put in scaling by finding the distance from the center to the inflection point. This distance above mu is one unit. Put in a 1, that is one standard deviation above mu.

4.Use Table IV in Appendix A to find the areas corresponding to the z-values.

Page 70: Report Engineering Probability and Statistics Ahmedawad

5.4 Descriptive Methods for Assessing Normality

Determining Whether the Data Are From an Approximately Normal Distribution

1. Construct either a histrogram or stem-and-leaf display for the data and note the shape of the graph. If the data are approximately normal, the shape of the histogram or stem-and-leaf display will be similar to the normal curve.

2. Compute the intervals (x -s , x +s ), (x -2s , x +2s ), and (x -3s , x +3s ) determine the percentage of measurements falling in each. If the data are approximately normal, the percentage will be approximately equal to 68%, 95%, and 100%, respectely.

3. Find the interquartile range, IQR, and standard deviation,s, for the sample, then calculate the ratio IQR/s. if the data are approximately normal, then IQR/s = 1.3 .

4. Construct a normal probability plot for the data. If the data are approximately normal, the point will fall (approximately) on a straight line.

Definition 5.2

Page 71: Report Engineering Probability and Statistics Ahmedawad

A normal probability plot for a data set is a scatterplot with the ranked data values on one axis and their corresponding expected z-scores from a standard normal distribution on the other axis.

Arc 1.06, rev July 2004, Sun Sep 19, 2004, 22:52:29. Data set name: baseballPerformance/Salary Data for Major League Baseball teams in 1995. FromSamaniego, F. J. and Watnik, M. R. (1997). "The Separation Principle inLinear Regression." Journal of Statistics Education, Vol. 5, Number 3,available at http://www.stat.ncsu.edu:80/info/jse/v5n3/samaniego.html.Name Type n InfoHitpay Variate 28 Payroll of non-pitchers onlyPayroll Variate 28 Total payroll, millions of dollarsPitchpay Variate 28 Payroll of pitchers onlyWins Variate 28 Number of games wonTeam Text 28 Team name

5.6 The Exponential Distribution

Page 72: Report Engineering Probability and Statistics Ahmedawad

The exponential distribution is an example of a skewed distribution. It is a popular model for populations such as the length of time a light bulb lasts. For this reason, the exponential distribution is sometimes called the waiting time distribution.

Probability Distribution for an Exponential Random Variable x

The Probability density function:

f ( x )=1θ

e−1/θ

( x>0 )

Mean: μ=θ

Standard deviation: σ=θ .

Chapter 6: Some Continuous Probability Distributions

Page 73: Report Engineering Probability and Statistics Ahmedawad

Again, PDFs are population quantities which gives us information about the distribution of items in the population. There are many PDFs where are used to understand probabilities associated with random variables. There are a few PDFs which are used for multiple real-life situations. These PDFs are described next.

From this chapter, it is important to learn the following:

What are these PDFs which can be used for multiple situations

When can these PDFs be used

Page 74: Report Engineering Probability and Statistics Ahmedawad

The means and variances for random variables with these PDFs

All PDFs in this chapter will be for continuous random variables.

Page 75: Report Engineering Probability and Statistics Ahmedawad

6.1: Continuous Uniform Distribution

The simplest PDF for continuous random variables is when the probability of observing a particular range of values for X is the same for all equal length ranges! Since the probabilities are the same, this PDF is called the uniform PDF.

The Uniform PDF – Let X be a random variable on the interval [A,B]. The uniform PDF is

Notes: o We examined this PDF at the beginning of

Section 3.3!o The parameters, A and B, control the

location of the PDF. In general, this is what a graph of the PDF looks like.

Page 76: Report Engineering Probability and Statistics Ahmedawad

o The area under the curve is 1. Since the PDF looks like a rectangle, we can take baseheight = (B-A)[1/(B-A)] to find the area is 1.

Example: Uniform distribution with A=1 and B=4 (uniform.xls)

Page 77: Report Engineering Probability and Statistics Ahmedawad

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Uniform PDF

x

f(x)

Areas underneath the curve correspond to probabilities. For example, P(1<X<3) = 0.67.

How could I find this using calculus?

Note the blue lines on the x-axis should be extended to the end of the plot.

Theorem 6.1 – The mean and variance of a random variable X with a uniform PDF are

Page 78: Report Engineering Probability and Statistics Ahmedawad

and

Page 79: Report Engineering Probability and Statistics Ahmedawad

6.2: Normal Distribution

This is the main PDF that we will be using since it occurs in many applications.

Normal PDF – Let X be a random variable with mean E(X)= and Var(X)=2. The normal PDF is

Notes:The parameters, and , control the location and scale of the distribution, respectively. These are the population mean and standard deviation! Thus, a nice simplification with the normal PDF is that the mean and standard deviation can be represented easily as parameters in the function.

In most realistic applications, and will not be known and we will need to estimate them. How to do this will be discussed in future chapters.

The book denotes f(x;,) by n(x;,).

Page 80: Report Engineering Probability and Statistics Ahmedawad

Terminology: Suppose X is a random variable with a normal PDF. One can shorten how this is said by saying X is a normal random variable.

In general, this is what a graph of the distribution looks like.

o The curve graphed are (x, f(x)) connected

points. o The PDF is centered at (symmetric

about ). Thus, P(X>) = P(X<) = 0.5. The parameter is often called a location parameter since it gives the central location of the PDF.

o The area under the curve is 1.

Page 81: Report Engineering Probability and Statistics Ahmedawad

o The left and right sides of the curve extend out to - and + without touching the x-axis (although it will get very close). Note the plot above may be a little misleading with respect to this. The left and right sides of the PDFs are often called the “tails” of the PDF.

o controls the scale of the PDF. The larger , the more spread out the PDF (large variability). The smaller , the less spread out the PDF (small variability). Below are three normal PDFs demonstrating this.

Page 82: Report Engineering Probability and Statistics Ahmedawad

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Normal PDF Example

x (MPG)

f(x)

A VERY IMPORTANT specific case of a normal PDF is the standard normal PDF. This PDF has =0 and =1. Therefore,

Typically, “Z” is used instead of “X” to denote a standard normal random variable. This will be discussed more later.

Showing =1 is not as easy as it was in Chapter 3. The proof involves

Page 83: Report Engineering Probability and Statistics Ahmedawad

making a transformation to polar coordinates. Pages 104-5 of Casella and Berger’s (1990) textbook shows the proof (this book is used for STAT 882).

Example: Interactive normal PDFs (normal_dist.xls)

This file is constructed to help you visualize the normal probability distribution. For example, below is the normal PDF for =50 and =3.

Page 84: Report Engineering Probability and Statistics Ahmedawad

Experiment on your own using different values of and to see changes in the distribution. Make sure you understand the following:

What happens when is increased or decreased?

What happens when is increased or decreased?

Where is the highest point on the distribution? What is this highest point?

Also in the file are examples of how to use the NORMDIST( ) and NORMINV( ) Excel functions which be discussed in detail in Section 6.3.

Below is the proof showing that E(X) = . A similar proof can be done to show Var(X) = 2 (see p. 146 of the book).

Page 85: Report Engineering Probability and Statistics Ahmedawad
Page 86: Report Engineering Probability and Statistics Ahmedawad

6.3-6.4: Areas Under the Normal Curve and Applications of the Normal Distribution

Example: Grand Am (grand_am_normal.xls)

Suppose that it is reasonable to assume a Grand Am’s MPG has a normal PDF with a mean MPG of =24.3 and a standard deviation of =0.6. Let X denote the MPG for one tank of gas. Answer the following questions.

1) Find the probability that a randomly selected Grand Am gets less than 23 MPG for one tank of gas.

We need to find P(X<23) = F(23). This is the area to the left of the red line underneath the PDF.

Page 87: Report Engineering Probability and Statistics Ahmedawad

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=0.6

x (MPG)

f(x)

This probability can be found by:

.

Using Maple without evaluating at the limits of integration, we get:

> assume(sigma>0);> f:=1/(sqrt(2*Pi)*0.6)*exp(- (x-24.3)^2/(2*0.6^2));

Page 88: Report Engineering Probability and Statistics Ahmedawad

:= f .83333333352 e

( ) 1.388888889( )x~ 24.3 2

> int(f,x);.4999999998 ( )erf 1.178511302 x~ 28.63782464

Notice that a capital P is used in the Pi function. See http://mathworld.wolfram.com/Erf.html for more information on the erf() function.

Using Maple with the limits of integration, we get:

> int(f,x=-infinity..23);.0151301397

where it uses numerical approximations for the last integral.

To make finding probabilities easier, many software packages (and calculators) have special functions which do the integration for X in some interval. In Excel, the NORMDIST(x, , , TRUE) function finds F(x) for a normal random variable with mean and standard deviation .

Chris Bilder, 02/13/04,
FALSE evaluates the f(X) at X=23. Thus, the height of the curve is found at X=23.
Page 89: Report Engineering Probability and Statistics Ahmedawad

For this example, use

NORMDIST(23,24.3,0.6,TRUE)

This results in 0.0151.

Chris Malone’s Excel Instructions website contains help for this function at http://www.statsclass.com/excel/tables/prob_values.html#prob_n. The web page shows another way to use the function through a window based format.

Side note: To find the probability in Maple using its specialized functions, you can use the following code:

> with(stats); [ ], , , , , , ,anova describe fit importdata random statevalf statplots transform

> statevalf[cdf,normald[24.3,0.6]](23);

.01513014001

Page 90: Report Engineering Probability and Statistics Ahmedawad

2) Suppose is increased to =1.3. What do you expect to happen to P(X<23)?

The Excel function is NORMDIST(23,24.3,1.3,TRUE)

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=1.3

x (MPG)

f(x)

3) Suppose =0.6 again, but is decreased to =23.1. What do you expect to happen to P(X<23)?

The Excel function is NORMDIST(23,23.1,0.6,TRUE)

Chris Bilder, 02/13/04,
Probability goes up since the mean is closer to 23. Thus, it seems more reasonable to have an X<23. P(X<23)=0.4338
Chris Bilder, 02/13/04,
Probability goes up since there is more variability. Thus, it seems more reasonable to have an X<23. P(X<23)=0.1587
Page 91: Report Engineering Probability and Statistics Ahmedawad

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=23.1 & s=0.6

x (MPG)

f(x)

Below is a nice comparative graph for the 3 examples above.

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=0.6 m=24.3 & s=1.3 m=23.1 & s=0.6

x (MPG)

f(x)

Page 92: Report Engineering Probability and Statistics Ahmedawad

4) Suppose =0.6 and =24.3 again. What is P(23<X<25)?

The probability needs to be broken up since the NORMDIST( ) function only finds probabilities in the form of F(x).

P(23<X<25) = P(X<25) – P(X<23) = F(25) – F(23).

This can be found with the Excel functions:

NORMDIST(25,24.3,0.6,TRUE)-NORMDIST(23,24.3,0.6,TRUE)

The probability is 0.8632.

Page 93: Report Engineering Probability and Statistics Ahmedawad

2020.120.220.320.420.520.620.720.820.92121.121.221.321.421.521.621.721.821.92222.122.222.322.422.522.622.722.822.92323.123.223.323.400000000000123.500000000000123.600000000000123.700000000000123.800000000000123.900000000000124.000000000000124.100000000000124.200000000000124.300000000000124.400000000000124.500000000000124.600000000000124.700000000000124.800000000000124.900000000000125.000000000000125.100000000000125.200000000000125.300000000000125.400000000000125.500000000000125.600000000000125.700000000000125.800000000000125.900000000000126.000000000000126.100000000000126.200000000000126.300000000000126.400000000000126.500000000000126.600000000000126.700000000000126.800000000000126.900000000000127.000000000000127.100000000000127.200000000000127.300000000000127.400000000000127.500000000000127.600000000000127.700000000000127.800000000000127.900000000000128.000000000000128.100000000000128.200000000000128.300000000000128.400000000000128.500000000000128.600000000000128.700000000000128.800000000000128.900000000000129.000000000000129.100000000000129.200000000000129.300000000000129.400000000000129.500000000000129.600000000000129.700000000000129.800000000000129.900000000000130.00000000000010

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal Probability Distribution Example for m=24.3, s=0.6

(23< <25)P XX

f(X)

5) Suppose =0.6 and =24.3 again. What is P(X>23)?

6) Suppose =0.6 and =24.3 again. What is P(X<23 or X>25)?

Chris Bilder, 02/13/04,
Use complement: P(X<23 or X>25) = 1-P(23<X<25) = 1-0.8632 = 0.1368
Chris Bilder, 02/13/04,
Use complement: P(X>23) = 1-P(X<23) = 1-0.0151 = 0.9849
Page 94: Report Engineering Probability and Statistics Ahmedawad

7) What MPG is at least required for a car to be in the top 5% of all Grand Ams? Suppose =0.6 and =24.3 again.

This problem requires going in the opposite direction. We are now given a probability and need to find the corresponding “x” that works for P(X>x)=0.05. In terms of integration, we are trying to find x in the equation below:

Equivalently,

Notice the limits of integration used are in terms of y. This is done to avoid confusion of integrating from “x=x to ”.

Page 95: Report Engineering Probability and Statistics Ahmedawad

The x value can be found by using Excel’s NORMINV(area,, ) function where area=P(X<x).

Be careful! Notice that the area is for P(X<x), not P(X>x).

The x value can be found with the Excel function:

NORMINV(0.95,24.3,0.6)

Therefore, P(X>25.29)=0.05.

Page 96: Report Engineering Probability and Statistics Ahmedawad

See http://www.statsclass.com/excel/tables/crit_values.html#crit_n for more information about this function. Note that we will eventually use these types of values as “critical points” in hypothesis testing.

Here are other ways to find the value of x in Maple:

> with(stats); [ ], , , , , , ,anova describe fit importdata random statevalf statplots transform

> statevalf[icdf,normald[24.3,0.6]](0.95);

25.28691218

> f:=1/(sqrt(2*Pi)*0.6)*exp(- (y-mu)^2/(2*sigma^2));

:= f .83333333352 e

/1 2

( )y 2

2

> solve(0.95 = eval(int(f, y = -infinity..x), [mu=24.3, sigma=0.6], x);

Page 97: Report Engineering Probability and Statistics Ahmedawad

25.28691217

Example: Grading (grade_bell.xls)

Suppose the set of test #2 grades in the class has a normal distribution with =73% and =8%. Let X be a student’s grade. Answer the following.

1) What is the probability that a randomly chosen student in the class received a grade of 90% or better?

50 55 60 65 70 75 80 85 90 95 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Grading Normal PDF Example

m=73 & s=8

x (Grade)

f(x)

Page 98: Report Engineering Probability and Statistics Ahmedawad

Let X be a normal random variable with =73% and =8%. Find P(X>90). Thus, we need to find

The Excel function is 1-NORMDIST(90,73,8,TRUE) and the answer is 0.0168.

2) What percentage of students scored between a 70% and 90%?

50 55 60 65 70 75 80 85 90 95 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Grading Normal PDF Example

m=73 & s=8

x (Grade)

f(x)

Page 99: Report Engineering Probability and Statistics Ahmedawad

The Excel function is NORMDIST(90,73,8,TRUE)-NORMDIST(70,73,8,TRUE) and the answer is 0.6294.

3) Suppose that your instructor curves the test #2 grades and that ONLY the top 10% of test scores receive A’s. Would a student be better off with a test #2 grade of 81% (still with =73% and =8%) or a grade of 68% on a different test #2 that has a normal distribution with =62% and =3%?

Page 100: Report Engineering Probability and Statistics Ahmedawad

50 55 60 65 70 75 80 85 90 95 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Grading Normal PDF Example

m=73 & s=8 m=62 & s=3

x (Grade)

f(x)

Find the top 10% of the scores for each situation.

For =73% and =8%, find x for P(X>x)=0.10.

The Excel function to find this is NORMINV(0.9,73,8) and the answer is 83.25.

For =62% and =3%, find x for P(X>x)=0.10.

Page 101: Report Engineering Probability and Statistics Ahmedawad

The Excel function to find this is NORMINV(0.9,62,3) and the answer is 65.84.

A student would prefer the second test since an A would be received.

Rule of thumb for the number of standard deviations all data lies from its mean:

In Chapter 4, we discussed that approximately all data lies within 2 or 3. We also discussed the more formal expression of this using Chebyshev’s Rule. Examine what happens if our data comes from a normal PDF. The end result is what is often called the Empirical Rule.

Page 102: Report Engineering Probability and Statistics Ahmedawad

Example: Standard normal distribution template (stand_norm_prob.xls)

Let Z be a random variable with a standard normal PDF. Thus, =0 and =1. All of these results apply for 0 and 1 also. Below are three screen captures that show a standard normal PDF. The distributions show the area between 1, 2, and 3 standard deviations of the mean.

Page 103: Report Engineering Probability and Statistics Ahmedawad
Page 104: Report Engineering Probability and Statistics Ahmedawad

Notice how large the probability is that Z is between 2 or 3 standard deviations from the mean!

Reminder about P(X=x)=0

What is P(X=x)? It is 0 since X is a continuous random variable. To see why this is true, consider this proof by example.

Let Z be a standard normal random variable. The following table of probabilities can then be constructed.

Page 105: Report Engineering Probability and Statistics Ahmedawad

ProbabilityP(0.95<Z<1.05) 0.0242P(0.98<Z<1.02) 0.0096P(0.99<Z<1.01) 0.0049

P(0.99<Z<1) 0.0042P(1<Z<1.01) 0.0025

P(Z=1) 0

Notice the probability gets smaller and smaller as the interval gets smaller. Eventually, the probability will become 0.

Remember that for some PDF f(x) where X is a continuous random variable. When a=b, then

.

Standard normal PDF

Probabilities associated with the standard normal PDF have been tabled.

Page 106: Report Engineering Probability and Statistics Ahmedawad

Example: Standard normal distribution tables (stand_norm_table.xls)

Before there were readily assessable software package or calculators with functions for the normal PDF, people used tables based on the standard normal PDF in order to find probabilities associated with ANY normal PDF. Table A.3 on p.670-1 of the book is one of these tables. It provides F(z), the CDF for a standard normal random variable Z. The reason why I am using Z here is because this is the common practice when discussing standard normal random variables.

Thus, Table A.3 gives probabilities such as the one shown below.

Page 107: Report Engineering Probability and Statistics Ahmedawad

Below is an excerpt from the table contained in stand_norm_table.xls.

  0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09-3.40.00030.00030.00030.00030.00030.00030.00030.00030.00030.0002-3.30.00050.00050.00050.00040.00040.00040.00040.00040.00040.0003-3.20.00070.00070.00060.00060.00060.00060.00060.00050.00050.0005-3.10.00100.00090.00090.00090.00080.00080.00080.00080.00070.0007-3.00.00130.00130.00130.00120.00120.00110.00110.00110.00100.0010-2.90.00190.00180.00180.00170.00160.00160.00150.00150.00140.0014-2.80.00260.00250.00240.00230.00230.00220.00210.00210.00200.0019-2.70.00350.00340.00330.00320.00310.00300.00290.00280.00270.0026-2.60.00470.00450.00440.00430.00410.00400.00390.00380.00370.0036-2.50.00620.00600.00590.00570.00550.00540.00520.00510.00490.0048-2.40.00820.00800.00780.00750.00730.00710.00690.00680.00660.0064-2.30.01070.01040.01020.00990.00960.00940.00910.00890.00870.0084-2.20.01390.01360.01320.01290.01250.01220.01190.01160.01130.0110-2.10.01790.01740.01700.01660.01620.01580.01540.01500.01460.0143-2.00.02280.02220.02170.02120.02070.02020.01970.01920.01880.0183-1.90.02870.02810.02740.02680.02620.02560.02500.02440.02390.0233

Page 108: Report Engineering Probability and Statistics Ahmedawad

This table uses the NORMDIST(z, 0, 1, TRUE) function to find P(Z<z). For example,

P(Z<-3.41) = 0.0003,

P(Z<-3.03) = 0.0012,

P(Z<-2.57) = 0.0051,

Why are we concerned with this table of standard normal probabilities?

A simple transformation can be made from ANY normal PDF to the standard normal PDF using the following formula:

where X is a normal random variable with mean and standard deviation and Z is a standard normal random variable with mean 0 and standard deviation 1.

Page 109: Report Engineering Probability and Statistics Ahmedawad

Therefore, using this one table, we can find all normal PDF probabilities WITHOUT Excel or other means.

Example: Grand Am (grand_am_normal.xls)

Suppose that it is reasonable to assume a Grand Am’s MPG has a normal PDF with a mean MPG of =24.3 and a standard deviation of =0.6. Let X denote the MPG for one tank of gas. Answer the following questions.

1) Find the probability that a randomly selected Grand Am gets less than 23 MPG for one tank of gas.

We need to find P(X<23) = F(23). This is the area to the left of the red line underneath the PDF.

Page 110: Report Engineering Probability and Statistics Ahmedawad

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=0.6

x (MPG)

f(x)

The function, NORMDIST(23,24.3,0.6,TRUE), can be used in Excel to find the probability to be 0.0151.

Using the tables, P(X<23)

= = P(Z<-2.1667) P(Z<-2.17) = 0.0150.

Page 111: Report Engineering Probability and Statistics Ahmedawad

2) Suppose is increased to =1.3. What do you expect to happen to P(X<23)?

The function, NORMDIST(23,24.3,1.3,TRUE), can be used to find the probability to be 0.1587.

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=1.3

x (MPG)

f(x)

Using the tables, P(X<23)

= = P(Z<-1) = 0.1587.

Page 112: Report Engineering Probability and Statistics Ahmedawad

3) Suppose =0.6 again, but is decreased to =23.1. What do you expect to happen to P(X<23)?

The function, NORMDIST(23,23.1,0.6,TRUE), can be used to find the probability to be 0.4338.

Using the tables, P(X<23)

= = P(Z<-0.1667) P(Z<-0.17) = 0.4325

4) Suppose =0.6 and =24.3 again. What is P(23<X<25)?

The function, NORMDIST(25,24.3,0.6,TRUE)-NORMDIST(23,24.3,0.6,TRUE)

can be used to find the probability to be 0.8632

Using the tables, P(23<X<25)

=

= P(-2.1667<Z<1.1667)

Page 113: Report Engineering Probability and Statistics Ahmedawad

P(-2.17<Z<1.17)

= P(Z<1.17) – P(Z<-2.17)

= 0.87900 – 0.01500

= 0.86400

5) Suppose =0.6 and =24.3 again. What is P(X>23)?

6) Suppose =0.6 and =24.3 again. What is P(X<23 or X>25)?

7) What is MPG required for a car to be in the top 5% of all Grand Ams? Suppose =0.6 and =24.3 again.

The x value for P(X>x)=0.05 was found with the Excel function =NORMINV(0.95,24.3,0.6). This produced P(X>25.29)=0.05.

Chris Bilder, 02/13/04,
Use complement: approximately 0.1368
Chris Bilder, 02/13/04,
Use complement: approximately 0.9849
Page 114: Report Engineering Probability and Statistics Ahmedawad

Using the tables, P(X>x) =

.

Note that P(Z<z)=0.95 produces z1.64.

Then

and . Thus, z=25.284. Therefore, P(X>25.284)0.05.

Page 115: Report Engineering Probability and Statistics Ahmedawad

Observing a sample from a population characterized by a normal PDF

Suppose a population can be characterized by a normal PDF. What characteristics would you expect for a sample taken from that population?

Example: MPG (gen_norm.xls)

MPG example from before: X is a normal random variable with =E(X)=24.3 and =

=0.6. Suppose 1,000 different x’s are observed. In other words, a sample of 1,000 is taken from the population

Questions:1) What would you expect the average value

of the 1,000 observed x’s to be approximately?

2) What range would you expect most of the x’s to fall within?

Observed values of a normal random variable can also be generated in the same

Page 116: Report Engineering Probability and Statistics Ahmedawad

way as what was done in the Chapters 3 and 5. Excel also has a specific normal PDF option in the Random Number Generation window. The file, gen_norm.xls, gives an example of using the window below. More directions are available at Chris Malone’s Excel help website at http://www.statsclass.com/excel/misc/norm_dist.html.

Page 117: Report Engineering Probability and Statistics Ahmedawad

In this case, 1 variable with 1,000 observed values are generated. The mean =24.3 and standard deviation =0.6 are used to coincide with the Grand Am example. The seed number gives Excel a random place to start when generating these observed values. I can use this seed number again and generate the exact same data!

Page 118: Report Engineering Probability and Statistics Ahmedawad

Below are part of the results. MPG population sample

23.74819 mean 24.3 24.3210623.48401 standard deviation 0.6 0.59628524.5949624.1905924.3233923.11923 classes Bin Frequency24.47166 22.6 22.6 024.62146 22.8 22.8 324.72662 23 23 1124.14857 23.2 23.2 2323.62592 23.4 23.4 2323.90335 23.6 23.6 5324.57064 23.8 23.8 8324.05448 24 24 11023.71645 24.2 24.2 11325.07131 24.4 24.4 12124.32436 24.6 24.6 111

24.1024 24.8 24.8 13624.62523 25 25 9324.28823 25.2 25.2 55

24.7843 25.4 25.4 3724.04441 25.6 25.6 1624.54797 25.8 25.8 524.26204 26 26 423.68477 26.2 26.2 124.76348 26.4 26.4 124.05258 26.6 26.6 124.38633 26.8 26.8 023.26503 More 023.7828624.09225

23.6193223.3286723.0244423.9139724.42597

Page 119: Report Engineering Probability and Statistics Ahmedawad

24.42067

Notes:

Notice how close and are to the sample mean and standard deviation. The sample standard deviation is calculated as

where is the sample mean and xi for i=1,…,n is the ith observed value. Explanation for why this formula was used will be given in Chapter 8.

Here is an example of how to simulate a sample from a normal PDF using Maple:

> randomize(1514);1514

> data:=stats[random, normald[24.3, 0.6]](100);data 25.36372908 24.96025314 24.44663243 25.27318122 23.94262355, , , , , :=

23.69829609 23.96992063 23.72400640 24.06923492 24.38832186, , , , ,

24.54452405 23.47219191 24.51653894 24.22545826 24.58063212, , , , ,

24.40056631 24.22519976 24.73647509 23.04956592 24.94875357, , , , ,

24.02254401 24.35341391 24.67885308 24.81796173 23.60716054, , , , ,

24.15571156 24.48549168 23.84686372 25.62993784 24.95907390, , , , ,

24.13187013 24.40491872 25.04623787 23.81147131 23.04161664, , , , ,

25.57549338 23.34059716 24.46719408 24.23062843 23.80346201, , , , ,

25.20382342 23.72508178 23.35185260 23.99842442 24.55421301, , , , ,

24.06936962 23.50756715 24.22223306 24.28139128 24.47253728, , , , ,

Page 120: Report Engineering Probability and Statistics Ahmedawad

24.50969275 25.31179898 24.30883191 24.39745116 24.34240361, , , , ,

24.44507802 24.28610049 24.04085590 25.13232101 24.66322075, , , , ,

25.09714835 25.12040542 24.69746294 24.51272238 23.75350627, , , , ,

25.60826660 24.19990788 25.02525917 24.41097845 24.17714648, , , , ,

24.63990563 24.74360918 23.45013063 24.52780462 24.47851759, , , , ,

24.27232784 23.27915406 25.15368420 24.38724182 23.47378351, , , , ,

24.23063511 24.06653251 24.43778592 24.04812858 25.20231330, , , , ,

23.34198654 23.30621749 24.58547842 24.40825270 23.90859335, , , , ,

25.63674860 24.48445061 24.56049376 23.33174552 24.26972911, , , , ,

23.65460645 24.16122685 24.74861908 24.58956375 24.41964871, , , ,

> evalf(stats[describe,mean]([data]),4);

24.30

> evalf(stats[describe, standarddeviation]([data]),4);

.5871

Page 6.118 of the notes shows one possible frequency distribution for the sample. This gives information about how often observed values fell into chosen classes. In Excel, I originally entered in the values in the “classes” column. Through performing a few steps, Excel automatically generates a frequency distribution. One needs to be VERY careful with interpreting what Excel gives. Below is another representation of it:

Page 121: Report Engineering Probability and Statistics Ahmedawad

classesFrequenc

y22.6 0

>22.6 and 22.8 3

>22.8 and 23 11>23 and 23.2 23

>23.2 and 23.4 23

>23.4 and 23.6 53

>23.6 and 23.8 83

>23.8 and 24 110>24 and 24.2 113

>24.2 and 24.4 121

>24.4 and 24.6 111

>24.6 and 24.8 136

>24.8 and 25 93>25 and 25.2 55

>25.2 and 25.4 37

>25.4 and 25.6 16

>25.6 and 25.8 5

>25.8 and 26 4>26 and 26.2 1

>26.2 and 26.4 1

>26.4 and 1

Page 122: Report Engineering Probability and Statistics Ahmedawad

classesFrequenc

y26.6

>26.6 and 26.8 0>26.8 0

Thus, 136 sampled values are greater than 24.6 and less than or equal to 24.8.

Why were these classes chosen? There are more than one set of classes which can be used. Here are some guidelines:

a) Find the minimum and maximum observed values. You can use the MIN() and MAX() functions in Excel to do this.

b) Choose classes which are of equal size.

c) Choose the classes between the minimum and maximum values which make sense relative to the data set. You may need to choose a few different ones until you think the frequency distribution represents the data well.

Page 123: Report Engineering Probability and Statistics Ahmedawad

d) Note that 1, 2, or 3 classes do not work!

The frequency distribution is often plotted. This plot is called a histogram. Below is the histogram created by Excel.

22.6 23

23.4

23.8

24.2

24.6 25

25.4

25.8

26.2

26.6

More

0

20

40

60

80

100

120

140

160Histogram of 1,000 MPG observed values

x = MPG

Freq

uenc

y

Does the histogram have a similar shape to the normal PDF with =24.3 and =0.6? If so, a normal PDF approximation to the distribution of MPG would be appropriate.

Page 124: Report Engineering Probability and Statistics Ahmedawad

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal PDF Example

m=24.3 & s=0.6

x (MPG)

f(x)

Below is an outline of the steps to find the frequency distribution and histogram for this example. General information about how to find a frequency distribution and histogram are available at http://www.statsclass.com/excel/graphs/histogram.html.

1)Find the minimum and maximum values.min max

22.71475 26.46672

min max=MIN(B10:B1009) =MAX(B10:B1009)

2)In an empty area in the spreadsheet, create a column of classes.

Page 125: Report Engineering Probability and Statistics Ahmedawad

classes22.622.823

23.223.423.623.824

24.224.424.624.825

25.225.425.625.826

26.226.426.626.8

3)Select TOOLS > DATA ANALYSIS from the main Excel menu bar.

Page 126: Report Engineering Probability and Statistics Ahmedawad

4)Select HISTOGRAM and OK from the DATA ANALYSIS window.

5)The HISTOGRAM window will then appear. In the window, do the following:a)Input the cell range of the 1,000 observed

values in the INPUT RANGE. b)Input the cell range of the classes into the

BIN RANGE.c)Select an OUTPUT RANGE for the

corresponding frequency distribution to start at. I usually specify the first cell to the right of my classes.

d)Select the CHART OUTPUT option to have a histogram created.

e)Select OK to have the frequency distribution and the histogram created!

Page 127: Report Engineering Probability and Statistics Ahmedawad

Below is what my spreadsheet looks like immediately after OK is selected.

Page 128: Report Engineering Probability and Statistics Ahmedawad

6)Edit the histogram so that it looks nicer:

22.6

22.8 23

23.2

23.4

23.6

23.8 24

24.2

24.4

24.6

24.8 25

25.2

25.4

25.6

25.8 26

26.2

26.4

26.6

26.8

More

0

20

40

60

80

100

120

140

160

Histogram of 1,000 MPG observed values

x = MPG

Freq

uenc

y

Page 129: Report Engineering Probability and Statistics Ahmedawad

Chris Malone has created a spreadsheet called, data_summary.xls, which can be used when one wants to determine if a normal PDF approximation is appropriate. Below is the spreadsheet result when used with the 1,000 MPG observed values.

The curve drawn on the histogram is a normal PDF with mean 24.3211 and standard deviation of 0.5963. Thus, the sample mean and standard deviation are substituted in for the population mean and standard deviation. You are

Page 130: Report Engineering Probability and Statistics Ahmedawad

not responsible for knowing how this plot was created, but you will need to be able to use the spreadsheet. There are also other summary measures displayed (box plot and dot plot) which may be discussed in future chapters.

From the results in data_summary.xls, does a normal PDF approximation for MPG seem appropriate? Explain.

Please see p. 18-19 of the book for more information about frequency distributions and histograms.

Validity of the normal PDF assumption

All of the probabilities found using the normal PDF ASSUME the normal PDF is the correct PDF for the random variable. What if this assumption is incorrect? The probabilities found using this assumption are WRONG!

Page 131: Report Engineering Probability and Statistics Ahmedawad

Example: Grand Am (grand_am.xls)

Suppose X really has an uniform distribution with A=22.3 and B=26.3. The P(X<23) is baseheight = 0.70.25=0.175. With the normal assumption of =24.3 and =0.6, the probability was found to be 0.0151

20 21 22 23 24 25 26 27 28 29 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Grand Am Normal and Uniform PDF Example

Normal mean=24.3 s.d.=0.6

x (MPG)

f(x)

How does one know when the normal PDF assumption is valid?

Page 132: Report Engineering Probability and Statistics Ahmedawad

Rarely, if ever, will it be 100% correct.

If a sample from the population is possible, construct a histogram of the observed values and check to see if it has the shape of a normal PDF. In addition, calculate the sample mean and variance to see if they are close to the population mean and variance (if they are known). If the histogram does have a similar shape to a normal PDF and the sample and population mean and variance are about the same (if the population values are known), then the normal PDF assumption is a reasonable approximation.

Suppose a histogram was constructed and the data did not appear to come from a normal or other known PDF. What can you do?

You can still use the normal PDF with the sample mean provided the sample size is large enough. The central limit theorem is used here in order to make a normal PDF

Page 133: Report Engineering Probability and Statistics Ahmedawad

approximation. Chapter 8 talks about this in detail.

Page 134: Report Engineering Probability and Statistics Ahmedawad

6.5: Normal Approximation to the Binomial

Skip!

Theorem 6.2: Note that if X is a binomial random variable with mean = E(X) = np and variance Var(X) = 2 = np(1-p), then the limiting form of the PDF for

as n, is the standard normal PDF. Another way this can be worded is X can be approximated by a normal random variable with mean np and variance np(1-p).

Thus as the number of trials increases, Z increasingly becomes more like a normal random variable.

This information will be used in Section 9.10.

Page 135: Report Engineering Probability and Statistics Ahmedawad

6.6: Gamma and Exponential Distributions

We have already been using the Gamma and Exponential PDFs! These PDFs are often used in survival and reliability analysis. For example, these PDFs are used for modeling lifetimes of individuals or manufactured products.

Definition 6.2: The gamma function is defined by

for >0.

Notes: When is a positive integer, () = (-1)!;

for example, (3) = (3-1)! = 2! = 21 = 2 Through integrating by parts, one can

show () = (-1)(-1)

(1/2) = In Maple, this is represented by the

GAMMA() function where GAMMA needs to be in capital letters. For example,

> GAMMA(3);

Page 136: Report Engineering Probability and Statistics Ahmedawad

2

Gamma PDF: The continuous random variable X has a gamma PDF, with parameters and , if its PDF is given by

where >0 and >0.

Notes: In most realistic applications, and will not be known and we will need to estimate them. How to do this will be discussed in future chapters.

controls the shape of the PDF since it mostly influences the “peakedness” of the PDF.

controls the scale of the PDF since most of its influence is for the spread of the PDF.

In Maple, this can be programmed in as

> assume(x>0);> assume(alpha>0);> assume(beta>0);

Page 137: Report Engineering Probability and Statistics Ahmedawad

> about(x, alpha, beta);Originally x, renamed x~: is assumed to be: RealRange(Open(0),infinity)

Originally alpha, renamed alpha~: is assumed to be: RealRange(Open(0),infinity)

Originally beta, renamed beta~: is assumed to be: RealRange(Open(0),infinity)

> f(x):=1/(beta^alpha*GAMMA(alpha))* x^(alpha-1)*exp(-x/beta);

:= ( )f x~x~

( ) 1e

x~

( )

> simplify(int(f(x),x=0..infinity));

1

There are easier ways to use the gamma PDF in Maple that will be discussed later.

Below are a few comparative plots (gamma.xls). Notice the x- and y-axis scales are fixed for comparative purposes. Values of X could be greater than 24!

Page 138: Report Engineering Probability and Statistics Ahmedawad

=1, =1, =1, 2=1 =1, =2, =2, 2=4

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

=1, =3, =3, 2=9 =2, =1, =2, 2=2

Page 139: Report Engineering Probability and Statistics Ahmedawad

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

=4, =1, =4, 2=4 =4, =2, =8, 2=16

Page 140: Report Engineering Probability and Statistics Ahmedawad

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

Page 141: Report Engineering Probability and Statistics Ahmedawad

=2.5, =2.5, =6.25, 2=15.625

0 2 4 6 8 10 12 14 16 18 20 22 24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gamma PDF

x

f(x)

Questions:What happens if and/or are increased? What happens if and/or are decreased?

Why would someone want to use different values of and/or ?

Theorem 6.3: The mean and variance of the gamma PDF are: E(X) = = and Var(X) = 2 = 2.

pf:

Page 142: Report Engineering Probability and Statistics Ahmedawad

Notice that is a gamma PDF with +1 and as its parameters! Thus,

= 1 and

.

A similar proof can be done for the variance.

Maple code,

> E(X):=simplify(int(x*f(x), x=0..infinity));

Page 143: Report Engineering Probability and Statistics Ahmedawad

:= ( )E X

> Var(X):=simplify(int((x-E(X))^2*f(x), x=0..infinity));

:= ( )Var X 2

Examine what happens to the PDF as values of and 2 change the gamma PDF plots on the previous pages.

Example: Distribution of lifetimes (gamma_actuary.xls)

Let X be a random variable denoting the lifetime of a person in a particular population. An actuary uses the PDF for X below to model the lifetimes of all people in this population:

for x>0.

For this example, =15 and =2.

Page 144: Report Engineering Probability and Statistics Ahmedawad

0 25 50 75 100 125 1500

0.005

0.01

0.015

0.02

0.025

0.03

Gamma PDF for actuary example

x

f(x)

In Maple, the plot is> plot(eval(f(x),[alpha=2,beta=15]), x=0..150, title="Gamma PDF, alpha=2, beta=15", labels=["x", "f(x)"]);

Page 145: Report Engineering Probability and Statistics Ahmedawad

This particular PDF may not be realistic for what we would commonly perceive to be the distribution of lifetimes in the United States.

Questions:What are the mean and variance?

The mean and variance are = = 215 = 30 and 2 = 2152 = 450. Thus, one would expect to live 30 years on average for this population.

What is the probability a person in the population lives longer than 80 years?

Page 146: Report Engineering Probability and Statistics Ahmedawad

The probability can be found from

P(X>80) = . Notice that integration by parts would be needed here. If the integration was done in Maple,

> P(X>80):=int(eval(f(x), [alpha=2,beta=15]), x=80..infinity);

:= ( )P 80 X193

e( )/-16 3

> evalf(P(X>80),4);.03059

Also, note that P(X>80) = 1 - P(X<80) = 1 - F(80). Thus, the CDF can be used to find the probability. The GAMMADIST(x, , , TRUE) function in Excel can simply be used here. Thus,

=1-GAMMADIST(80,2,15,TRUE)

results in a value of 0.0306.

Using the stats package in Maple,

Page 147: Report Engineering Probability and Statistics Ahmedawad

> 1-stats[statevalf,cdf,gamma[2,15]](80);

.0305770166

What is the median lifetime?

The value c needs to found such that the probability of living less than c years is 0.5. Then we could use

and solve for c. If the integration and solving was done in Maple,

> solve(int(eval(f(x),[alpha=2, beta=15]), x=0..c) = 0.5, c);

,-11.52058571 25.17520485

Of course, the positive value for c would be the answer. The GAMMAINV(prob., , ) function can be used in Excel to find c. Thus,

=GAMMAINV(0.5,2,15)

results in c = 25.18.

Page 148: Report Engineering Probability and Statistics Ahmedawad

Using the stats package in Maple,

> stats[statevalf,icdf, gamma[2,15] ](0.5);

25.17520485

There are a few important special cases of the gamma PDF. One of them is the exponential PDF.

Exponential PDF: The continuous random variable X has an exponential PDF, with parameter , if its PDF is given by

where >0.

Notes:This is the gamma PDF with =1. In most realistic applications, will not be known and it will need to be estimated. How to do this will be discussed in future chapters.

Page 149: Report Engineering Probability and Statistics Ahmedawad

controls the scale of the PDF since most of its influence is for the spread of the PDF. In general, this is what a plot of the PDF looks like.

The height of the curve at a point xo is

. Notice that when xo=0,

since e0=1.

Theorem 6.3: The mean and variance for the exponential PDF are: E(X) = = and Var(X) = 2 = 2.

Page 150: Report Engineering Probability and Statistics Ahmedawad

pf: See the Chapter 4 examples with tire tread wear. Substitute in for 30. Also, see the proof used with the gamma PDF earlier.

Example: Tire life (tire_wear.xls from Chapter 3)

The number of miles an automobile tire lasts before it reaches a critical point in tread wear can be represented by a PDF. Let X = the number of miles (in thousands) an automobile is driven before it reaches the critical tread wear point for one tire. Suppose the PDF for X is

In Chapter 3 and 4, we used =30. Remember that we found in Chapter 4 that E(X) = = 30 and Var(X) = 2 = 302!

In the spreadsheet, different values of can be entered into the cell to see how it affects the PDF. Below is a screen capture of the spreadsheet.

Page 151: Report Engineering Probability and Statistics Ahmedawad

Note that the line on the plot should extend past x=225.

Questions:What happens if is increased? Explain why has this effect relative to it being called a “scale” parameter, E(X)=, and Var(X) = 2.

What happens if is decreased? Explain why has this effect relative to it being called a “scale” parameter, E(X)=, and Var(X) = 2.

Why would someone want to use different values of ?

Page 152: Report Engineering Probability and Statistics Ahmedawad

Find the probability that a random selected tire will last (will not get to the critical tread wear point) longer than 30,000 miles. In Chapter 3, we found the probability through integration:

Using the relationship between the gamma and exponential function, we can use the GAMMADIST() function:

=1-GAMMADIST(30,1,30,TRUE)

Notes: Remember that if FALSE is used instead of TRUE in the function, then f(x) is given as a result (the height of the curve).

Excel also has a function specifically for the Exponential PDF: EXPONDIST(x,1/beta,TRUE) which finds F(x). Please note that 1/beta corresponds to what Excel defines as . Thus, Excel uses a PDF of

Page 153: Report Engineering Probability and Statistics Ahmedawad

To find P(X>30), note that P(X>30) = 1 – P(X<30) = 1 – F(30). In Excel,

=1-EXPONDIST(30,1/30,TRUE)

To avoid confusion with = 1/, I recommend using the GAMMADIST() function instead.

Find the tire wear number of miles such that less than 0.95 of the total number of tires will reach the critical point. In Chapter 3, we found the value of c as a solution to

The value of c was 30ln(20) 89.87. We can use the relationship between the gamma and exponential PDFs to find the same answer with

=GAMMAINV(0.95,1,30)

Page 154: Report Engineering Probability and Statistics Ahmedawad

Example: Exponential distribution with =10/3 (exp.xls)

0 2 4 6 8 10 12 14 16 180

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Exponential PDF

x

f(x)

To find the probability P(2<X<4), find the area underneath that part of the plot. Note that P(2<X<4) = P(X<4) – P(X<2) = F(4) = F(2) since

Page 155: Report Engineering Probability and Statistics Ahmedawad

-

=The Excel functions to find the probability are

GAMMADIST(4,1,10/3,TRUE) – GAMMADIST(2,1,10/3,TRUE)

Page 156: Report Engineering Probability and Statistics Ahmedawad

and the answer is 0.2476.

Final notes:o Go back to Chapter 3 and examine

example_sample_tire.xls. Notice how choosing = 30 results in a very good fit of the PDF to the sampled values displayed in the histogram!

o If you are an engineering major, I recommend examining the Weibull PDF in Section 6.10.

o The chi-square PDF in Section 6.8 is an often used PDF which we will discuss later in the course.

Page 157: Report Engineering Probability and Statistics Ahmedawad

7. Inference Based on a Single Sample: Estimation with Confidence Intervals

In this chapter, we’ll put all the preceding material into practice; that is, we’ll estimate population means and proportions based on a single sample selected from the population of interest.

7.1 Large-Sample Confidence Interval for a Population Mean

According to the Central Limit Theorem, the sampling distribution of the sample mean is approximatley normal for large samples. Let us calculate the interval

x±2σ x= x±2σ

√nThat is, we form an interval 4 standard deviations wide – from 2 standard deviation below the sample mean to 2 standard deviations above the mean.

Page 158: Report Engineering Probability and Statistics Ahmedawad

Definition 7.1

An interval estimater (or confidence interval ) is a formula that tells us how to use sample data to calculate an interval that estimates a population parameter.

Definition 7.2

The confidence coefficient is the probability that an interval estimator encloses the population parameter – that is, the relative frequency with which the interval estimator encloses the population parameter when the estimator is used repeatedly a very large number of times. The confidence level is the confidence coefficient expressed as a percentage.

A confidence interval provides an estimate of an unknown parameter of a population or process along with an indication of how accurate this estimate is and how confident we are that the interval is correct. Confidence intervals have two parts. One is an interval computed from our data. This interval typically has the form

estimate ±margin of error

Page 159: Report Engineering Probability and Statistics Ahmedawad

Figure Twenty-five samples from the population gave these 95% confidence intervals. In the long run, 95% of all samples give an interval that covers μ .

Page 160: Report Engineering Probability and Statistics Ahmedawad

Large-Sample Confidence interval for a population mean

The precise formula for calculating a confidence interval for μ is:

x± zα /2

σ

√n

where x is the sample average, σ is the standard deviation of the population measurements, and n is the sample size and the zα /2 is a value from the standard normal table(Table IV).

Note: When σ is unknown (as is almost always the case) and n is large, (say n≥30 ) the confidence interval is approximately equal to

x± zα /2

s

√n

where s is the sample standard dviation.

Assumptions: None, since the Central Limit Theorem guarantees that the sampling distribution of x is approximately normal.

Page 161: Report Engineering Probability and Statistics Ahmedawad

For example if we want a 95 percent confidence

interval for μ , we use zα /2 =1.96.

Why is this the correct value?

Well the correct value of z is found by trying to capture probability .95 between two symmetric boundaries around zero in the standard normal curve. This means there is .025 in each tail and looking up the correct upper boundary with .475 to the left gives 1.96 as the correct value of z from table IV. Verify that a 90 percent

confidence interval will use zα /2 =1.645, and a 99 percent confidence inteval will use 2.576.

Here are the most important entries from that part of the table:

zα /2

1.645 1.96 2.576

100(1-α )% 90% 95% 99%So there is probability C that x lies between

μ−z α /2σ

√n and μ+zα /2

σ

√n

Page 162: Report Engineering Probability and Statistics Ahmedawad

Figure The area between - z* and z* under the standard normal curve is C.

Interpretation of a Confidence Interval for a P opulation M ean

When we form a 100(1-α )% confidence interval for μ , we usually express our confidence in the interval with a statement such as. “We can be 100(1-α )% confidence that μ lies between the lower and upper bounds of the confidence interval.”

Page 163: Report Engineering Probability and Statistics Ahmedawad

7.2 Small-Sample Confidence Interval for a Population Mean

In Chapter 6, we considered the (unrealistic) situation in which we knew the population standard deviation σ . In this section, we consider the more realistic case where σ is not known and we must estimate σ from our SRS by the sample standard deviation s . In Chapter 6 we used the one-sample z statistic

z= x−μσ /√n

which has the N(0,1) distribution.

Replacing σ by s , we now use the one sample t statistic

t= x−μs /√n

which has the t distribution with n-1 degrees of freedom.When σ is not known, we estimate it with the sample standard deviation s , and then we

estimate the standard deviation of x by s/√n .

Page 164: Report Engineering Probability and Statistics Ahmedawad

Standard Error

When the standard deviation of a statistic is estimated from the data, the result is called the standard error of the statistic. The standard error of the sample mean is

SE x=s

√n

The t DistributionsSuppose that an SRS of size n is drawn from an N(μ ,σ ) population. Then the one-sample t statistic

t= x−μs /√n

has the t distribution with n-1 degrees of freedom.

Page 165: Report Engineering Probability and Statistics Ahmedawad

Degrees of freedom

There is a different t distribution for each sample size. A particular t distribution is specified by giving the degrees of freedom. The degrees of freedom for this t statistic come from the sample standard deviation s in the denominator of t.

History of Statistics

The t distribution were discovered in 1908 by William S. Gosset. Gosset was a statistician employed by the Guinness brewing company, which required that he not publish his discoveries under his own name. He therefore wrote under the pen name “Student.” The t distribution is called “Student’s t” in his honor.

Page 166: Report Engineering Probability and Statistics Ahmedawad

Figure. Density Curve for the standard normal and t(5) distributions. Both are symmetric with center 0. The t distributions have more probability in the tails than does the standard normal distribution due to the extra variability caused by substituting the random variable s for the fixed parameter σ .

We use t(k) to stand for the t distribution with k degrees of freedom.

Page 167: Report Engineering Probability and Statistics Ahmedawad

The One-Sample t Confidence IntervalSuppose that an SRS of size n is drawn from a population having unknown mean μ . A level C confidence interval for μ is

x±tα /2s

√n

where tα /2 is the value for the t(n-1) density

curve with area C between –tα /2 and tα /2 . This interval is exact when the population distribution is normal and is approximately correct for large n in other cases.

So the Margin of error for the population mean when we use the data to estimate σ is

tα /2s

√n

Page 168: Report Engineering Probability and Statistics Ahmedawad

Example In fiscal year 1996, the U.S. Agency for international Development provided 238,300 metric tons of corn soy blend (CSB) for development programs and emergency relief in countries throughout the world. CSB is a high nutritious, low-cost fortified food that is partially precooked and can be incorporated into different food preparations by the recipients. As part of a study to evaluate appropriate vitamin C levels in this commodity, measurements were taken on samples of CSB produced in a factory.The following data are the amounts of vitamin C, measured in miligrams per 100 grams of blend (dry basis), for a random sample of size 8 from a production run:

2.1 26, 31, 23, 22, 11, 22, 14, 31We want to find 95% confidence interval for μ , the mean vitamin C content of the CSB produced during this run. This sample mean is x=22 .50 and the standard deviation is s=7 .19 with degrees of freedom n-1=7. The standard error is

SE x=s

√n=7 .19

√8=2 .54

From Table VI we find t*=2.365. The 95% confidence interval is

Page 169: Report Engineering Probability and Statistics Ahmedawad

x±t0 .05 /2

s

√n=22 .50±2 . 365

7 .19

√8 =22 . 50±2 . 365×(2. 54 ) =22 . 5±6 . 0=(16 .5 ,28 .5) .

We are 95% confident that the mean vitamin C content of the CSB for this run is between 16.5 and 28.5 mg/100 g.

7.3 Large-Sample Confidence Interval for a Population Proportion

Sample Distribution of p

Page 170: Report Engineering Probability and Statistics Ahmedawad

The mean of the sampling distribution of p

is p ; that is, p is an unbiased estimator of p .The standard deviation of the sampling

distribution of p is √ pq /n , where q=1−p .For large samples, the sampling distribution

of p is approximately normal. A sample size

is considered large if the interval p±3 σ p

does not include 0 or 1

Large-Sample Confidence Interval for p

p± zα /2σ p = p±

zα /2√ pqn

¿ p± zα /2√ p q

n

where p= x

n and q=1− p

Page 171: Report Engineering Probability and Statistics Ahmedawad

Note: When n is large, p can approximate the

value of p in the formula for σ p .

Adjusted (1-α )100% Confidence Interval for a Population Proportion, p

p¿± zα /2√ p¿ (1−p¿ )

n+4

where p¿= x+2

n+4 is the adjusted sample proportion of observations with the characteristic of interest, x is the number of successes in the sample, and n is the sample size.

7.4 Determining The Sample Size

Page 172: Report Engineering Probability and Statistics Ahmedawad

Sample Size Determination for (1-α )100% Confidence Intervals for μ

In order to estimate μ to within a bound B with (1-α )100% confidence, the required sample size is found as follows:

zα /2( σ

√n )=B

The solution can be written in terms of B as follows:

n=( zα /2)

2 σ2

B2

The value of σ is usually unknown. It can be estimated by the standard deviation s from a prior sample. Alternatively, we may approximate the range R of observations in the population, and (conservatively) estimate

σ≈ R4 .

Sample Size Determination for (1-α )100% Confidence Intervals for p

Page 173: Report Engineering Probability and Statistics Ahmedawad

In order to estimate a binomial probability p to within a bound B with (1-α )100% confidence, the required sample size is found by solving the following equation for n:

zα /2√ pqn

=B

The solution can be written in terms of B as follows:

n=( zα /2)

2 pq

B2

Since the value of the product pq is unknown, it can be estimated by using the sample fraction of

successes, p , form a prior sample.

Remember (Table 7.5) that the value of pq is at its maximum when p equals 0.5, so that you can obtain conservatively large values of n by approximating p by 0.5 or values close to 0.5. In any case, you should round the value of n obtained upward to ensure that the sample size will be sufficient to achieve the specified reliability.

Page 174: Report Engineering Probability and Statistics Ahmedawad

8. Inference Based on a Single Sample: Tests of Hypothesis

Page 175: Report Engineering Probability and Statistics Ahmedawad

We’ll see how to utilize sample information to test what the value of a population parameter may be. This type of inference is called a test of hypothesis. We’ll also see how to conduct a test of hypothesis about a population mean and a population proportion.

8.1- 8-3 Large-Sample Test of Hypothesis about a Population Mean

A test of significance consists of four steps:

1.Specify the null and alternative hypotheses. 2.Calculate the test statistic. 3.Calculate the P-value. 4.Give a complete conclusion.

Null Hypothesis

The statement being tested in a test of significance is called the null hypothesis. The test of significance is designed to assess the strength of the evidence against the null hypothesis. Usually the null hypothesis is a statement of “no effect” or “no difference”.

We abbreviate “null hypothesis” as H0 and

“alternative hypothesis” as Ha . These are

Page 176: Report Engineering Probability and Statistics Ahmedawad

statements about a parameter in the population, or beliefs about the truth. The alternative hypothesis is usually what the investigator wishes to establish or prove. The null hypothesis is just the logical opposite of the alternative.

Example Suppose we work for a consumer testing group that is to evaluate a new cigarette that the manufacturer claims has low tar (average less than 5mg per cig).

From our perspective,

the alternative hypothesis is Ha : μ>5 because we will only be concerned or care if the average is too high or not consistent with what the tobacco company claims.

The null hypothesis is then H0 : μ≤5 , or the opposite of the alternative. These are both statements about the true average tar content of the cigarettes, a parameter.

Three possible cases for hypotheses:

Page 177: Report Engineering Probability and Statistics Ahmedawad

Case 1 Case 2 Case 3H0 : μ=μ0 H0 : μ≤μ0 H0 : μ≥μ0

Ha : μ>μ0 Ha : μ<μ0

The symbol μ0 stands for the value of mu that is assumed under the null hypothesis.

Test statistics

In the second step we summarize the experimental evidence into a summary statistic.

From Example, suppose there were n=36 cigarettes tested and they had x=5 .5 mg and σ=1 .2 . We summarize this information with a z-statistic. The test statistic for this problem is:

Z=x−μ0

σ /√n= 5 . 5−5

1. 2/√36=2 .5

What does z-value mean?

Well, it is usually easier to discuss such things in terms of probabilities. The test statistic is used to compute a P-value which is the probability of getting a test statistic at least as extreme as the z-value observed, where the probability is

0: aH

Page 178: Report Engineering Probability and Statistics Ahmedawad

computed when the null hypothesis is true. This is what the third step in the process is about.

P-values

Null Hypothesis

The probability, computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed is called the P-value of the test. The smaller the P-value, the stronger the evidence

against H0 provided by the data.

The second definition is easier to understand. The P-value is the tail area associated with the calculated test statistic value in the distribution we know it has if the null hypothesis is a true. From both of these statements you can see that the P-value is a probability.

From our tobacco example the P-value is the probability of observing a value of z more extreme than 2.5. What does more extreme mean here? It is specified by the direction of the alternative hypothesis, in our problem it is greater than. This means that the P-value we want is

Page 179: Report Engineering Probability and Statistics Ahmedawad

P(Z ¿ 2.5) = 1-P(Z¿2.5) = 1-.9938 =.0062.

Now it is time for the fourth step in a test: the conclusion. We can compare the P-value we calculated with a fixed value that we regard as decisive. The decisive value of P is called the significance level. It is denoted by α , the Greek letter alpha.

α= P(Type I error)

= P( Rejecting the null hypothesis when

in fact the null hypothesis is true)

Statistical Significance

If the P-value is as small or smaller than α , we say that the data are statistically significant at level α or we say that “Reject Null hypothesis (H0 ) at level α ”.

If we choose α =0.05,from our tobacco example, the P-value is .0062.

Since P-value = .0062 is less than α =0.05, we

say that “Reject Null hypothesis (H0 : μ≤5 ) at level α =0.05”

Page 180: Report Engineering Probability and Statistics Ahmedawad

Note

We usually choose α =0.05 or α =0.01. But if we choose α =0.01 then we are insisting on

stronger evidence against H0 compared to the case of α =0.05. In our course, I will ask a statistical significance when α =0.05.

A test of significance is a recipe for assessing the significance of the evidence provided by data against a null hypothesis. The four steps common to all tests of significance are as follows:

1. State the null hypothesis H0 and the

alternative hypothesisHa . The test is designed to assess the strength of the evidence against H0 ; Ha is the statement that we will accept if

the evidence enables us to reject H0 .

2. Calculate the value of the test statistic on which the test will be based. This statistic

usually measures how far the data are from H0 .

3. Find the P-value for the observed data. This is

the probability, calculated assuming that H0 is

Page 181: Report Engineering Probability and Statistics Ahmedawad

true, that the test statistic will weigh against H0 at least as strongly as it does for these data.

4. State a conclusion. One way to do this is to choose a significance level α , how much

evidence against H0 you regard as decisive. If the P-value is less than or equal to α , you conclude that the alternative hypothesis is sufficient evidence to reject the null hypothesis.

Here is the conclusion for our example problem.

We have evidence for the alternative

hypothesis that μ

the average tar content is actually above 5 mg per cig. This contradicts the company claim that this is a low-tar cig. Let's dial lawyers and start the complaining process with the tobacco industry.

Z Test for a Population Mean

To test the hypothesis H0 : μ=μ0 based on an SRS of size n from a population with unknown mean and known standard deviation σ , compute the test statistics

Page 182: Report Engineering Probability and Statistics Ahmedawad

Z=x−μ0

σ /√n

In terms of a standard normal random variable

Z, the P-value for a test of H0

against

Ha : μ>μ0 is P(

Z≥z)

Ha : μ<μ0 is P(

Z≤z)

Ha : μ≠μ0 is 2P(

Z≥|z|)

Page 183: Report Engineering Probability and Statistics Ahmedawad

These P-values are exact if the population distribution is normal.

8.4 Small-Sample Test of Hypothesis about a Population Mean

The One-Sample t Test

Page 184: Report Engineering Probability and Statistics Ahmedawad

Suppose that an SRS of size n is drawn from a population having unknown mean μ . To test the

hypothesis H0 : μ=μ0 based on an SRS of size n, compute the one-sample t statistic

t=x−μ0

s/√n

In terms of a random variable T having t(n-1)

distribution, the P-value for a test of H0 against

Ha : μ>μ0 is P(T≥t )

Ha : μ<μ0 is P(T≤t )

Ha : μ≠μ0 is 2P(T≥|t|)

These P-values are exact if the population distribution is normal and are approximately correct for large n in other cases.

Page 185: Report Engineering Probability and Statistics Ahmedawad

Example The specifications for the CBS described in the previous Example state that the mixture should contain 2 pounds of vitamin permix for every 2000 pounds of product. These specifications are designed to produce a mean (μ ) vitamin C content in the final product of 40mg/100g. We can test a null hypothesis that the mean vitamin C content of the production run conforms to these specifications. Specifically, we test

H0 : μ=40Ha : μ≠40

Page 186: Report Engineering Probability and Statistics Ahmedawad

Recall that n=8 , x=22 .50 , and s=7 .19 . The t test statistics is

t=x−μ0

s/√n=22 . 5−40

7 . 2/√8=−6 .88

Because the degrees of freedom are n-1=7, this t statistic has the t(7) distribution.

Figure. The P-value for Example.Figure shows that the P-value is 2P(T≥6 .88 ), where T has the t(7) distribution. From Table, we see that P(T≥5 .408 )=0.0005. Therefore, we conclude that the P-value is less than 2×0. 0005

. Since P-value is smaller than α =0.05, we can

reject H0 and conclude that the vitamin C content for this run is below the specifications.

Page 187: Report Engineering Probability and Statistics Ahmedawad

Example For the vitamin C problem described in the previous example, we want to test whether or not vitamin C is lost or destroyed by the production process. Here we test

H0 : μ=40Ha : μ<40

The t test statistic does not change: t=−6 . 88 .

Figure. The P-value for Example.As Figure illustrates, however, the P-value is now P(T≤−6 . 88 ). From Table, we can determine that P≤0.0005 . We conclude that the production process has lost or destroyed some of the vitamin C.

Page 188: Report Engineering Probability and Statistics Ahmedawad

8.5 Large-Sample Test of Hypothesis about a Population Proportion

In this section we consider inference about a population proportion p from an SRS of size n

based on the sample proportion p=X /n where X is the number of successes in the sample.

Large-Sample Significance test for a Population Proportion

Draw an SRS of size n from a large population with unknown proportion p of successes.

To test the hypothesis H0 : p=p0 , compute the z-statistic

z=p−p0

√ p0 (1−p0 )n

Page 189: Report Engineering Probability and Statistics Ahmedawad

In terms of a standard normal random variable

Z, the appropriate P-value for a test of H0 against

Ha : p> p0 is P(Z≥z )

Ha : p< p0 is P(Z≤z )

Ha : p≠p0 is 2P(Z≥|z|)

Page 190: Report Engineering Probability and Statistics Ahmedawad

Example The French naturalist Count Buffon once tossed a coin 4040 times and obtained 2048 heads. This is a binomial experiment with n=4040. The sample proportion is

p=20484040

=0 .5069

If Buffon’s coin was balanced, then the probability of obtaining heads on any toss is 0.5. To assess whether the data provide evidence that the coin was not balanced, we test

H0 : p=0 .5

Ha : p≠0 .5

The test statistic is

z= p−0 .5

√ 0 . 5(1−0 . 5)4040

= 0 . 5069−0 . 5

√ 0 .5 (1−0 .5 )4040

=0. 88

Figure illustrates the calculation of the P-value. From Table IV we find

P( Z≤0 . 88)=0.8106 .

Page 191: Report Engineering Probability and Statistics Ahmedawad

The probability in each tail is 1-0.8106 = 0.1894, and the P-value is P=2×0 .1894=0.38 . Since P-value is larger then α=0 . 05 , we do not reject H0 : p=0 .5 at the level α=0 . 05 .

Figure The P-value for Example 8.2.

Page 192: Report Engineering Probability and Statistics Ahmedawad

Example A coin was tossed n=4040 times and we observed X=1992 tails. We want to test the null hypothesis that the coin is fair- that is, that the probability of tails is 0.5. So p is the probability that the coin comes up tails and we test

H0 : p=0 .5

Ha : p≠0 .5

The test statistic is

z= p−0 .5

√ 0 . 5(1−0 . 5)4040

= 0 . 4931−0 . 5

√ 0 .5 (1−0 .5 )4040

=−0. 88

Using Table IV, we find that

P=2×0 .1894=0.38 .

Since P-value is larger then α=0 . 05 , we do not

reject H0 : p=0 .5 at the level α=0 . 05 .

Page 193: Report Engineering Probability and Statistics Ahmedawad

9. Inferences Based on Two Samples

Now that we’ve learned to make inferences about a single population, we’ll learn how to compare two populations.

For example, we may wish to compare the mean gas mileages for two models of automobiles, or the mean reaction times of men and women to a visual stimulus.

In this chapter we’ll see how to decide whether differences exist and how to estimate the differences between population means and proportions.

9.1 Comparing two population means: Independent Sampling

One of the most commonly used significance tests is the comparison of two population means μ1 and μ2 .

Page 194: Report Engineering Probability and Statistics Ahmedawad

Two-sample Problems

The goal of inference is to compare the responses in two groups.Each group is considered to be a sample from a distinct population. The responses in each group are independent of those in the other group.

A two sample problem can arise from a randomized comparative experiment that randomly divides the subjects into two groups and exposes each group to a different treatment. The two samples may be of different sizes.

Two-Sample z Statistic

Suppose that x1 is the mean of an SRS of size n1 drawn from N( μ1 , σ 1 ) population and that x2

is the mean of an SRS of size n2 drawn from N( μ2 , σ 2 ) population. Then the two-sample z statistic

Page 195: Report Engineering Probability and Statistics Ahmedawad

z=( x1− x2 )−( μ1−μ2)

√ σ12

n1

+σ 2

2

n2

has the standard normal N(0,1) sampling distribution.

Large Sample Confidence Interval for μ1−μ2

( x1− x2)±zα /2√ σ 12

n1

+σ 2

2

n2

Assumptions: The two samples are randomly selected in an independent manner from the two

populations. The sample sizes n1 and n2 are large enough.

Example for C.I. of μ1−μ2

Example for Test of Significance

Page 196: Report Engineering Probability and Statistics Ahmedawad

In the unlikely event that both population standard deviations are known, the two-sample z statistic is the basis for inference about μ1−μ2 . Exact z procedures are seldom used

because σ 1and σ 2 are rarely known.

The two-sample t procedures

Suppose that the population standard deviations σ 1 and σ 2 are not known. We estimate them by

the sample standard deviations s1 and s2 from our two samples.

The Pooled two-sample t procedures

The pooled two-sample t procedures are used when we can safely assume that the two populations have equal variances. The modifications in the procedure are the use of the pooled estimator of the common unknown variance

sp2=

(n1−1 )s12+(n2−1 )s2

2

n1+n2−2 .

This is called the pooled estimator of σ2.

Page 197: Report Engineering Probability and Statistics Ahmedawad

When both populations have variance σ2, the

addition rule for variances says that x1− x2 has variance equal to the sum of the individual variances, which is

σ2

n1

+ σ2

n2

=σ2 ( 1n1

+ 1n2 )

The standardized difference of means in this equal-variance case is

z=( x1− x2 )−( μ1−μ2 )

σ √ 1n1

+ 1n2

This is a special two-sample z statistic for the case in which the populations have the same σ .

Replacing the unknown σ by the estimates sp gives a t statistic. The degrees of freedom are n1+n2−2 .

The Pooled Two-Sample t Procedures

Page 198: Report Engineering Probability and Statistics Ahmedawad

Suppose that an SRS of size n1 is drawn from a

normal population with unknown mean μ1 and

that an independent SRS of size n2 is drawn from another normal population with unknown

mean μ2 . Suppose also that the two populations have the same standard deviation. A level C

confidence interval μ1−μ2 given by

( x1− x2 )±t∗s p√ 1n1

+ 1n2

Here t* is the value for t (n1+n2−2 ) density curve with area C between –t* and t*.

To test the hypothesis H0 : μ1=μ2 ,compute the pooled two-sample t statistic

t=x1− x2

s p√ 1n1

+1n2

In terms of a random variable T having the t(n1+n2−2) distribution, the P-value for a test of H0 against

Page 199: Report Engineering Probability and Statistics Ahmedawad

Ha : μ>μ0 is P(T≥t )

Ha : μ<μ0 is P(T≤t )

Ha : μ≠μ0 is 2P(T≥|t|)

Page 200: Report Engineering Probability and Statistics Ahmedawad

Example Take Group 1 to be the calcium group and Group 2 to be the placebo group. The evidence that calcium lowers blood pressure more than a placebo is assessed by testing

H0 : μ1≤μ2

Ha : μ1>μ2

Here are the summary statistics for the decrease in blood pressure:

Group Treatment  n x   s  1 Calcium 10 5.000 8.743

Page 201: Report Engineering Probability and Statistics Ahmedawad

2 Placebo 11 -0.273 5.901

The calcium group shows a drop in blood pressure, and the placebo group has a small increase. The sample standard deviations do not rule out equal population standard deviations. A difference this large will often arise by chance in samples this small. We are willing to assume equal population standard deviations. The pooled sample variance is

sp

2=(n1−1 )s1

2+(n2−1 )s22

n1+n2−2

=

(10−1)(8 . 743)2+(11−1)(5 . 901)2

10+11−2

=54 .536 . So that sp=√54 . 536=7 .385

The pooled two-sample t statistic is

t=x1− x2

s p√ 1n1

+1n2

Page 202: Report Engineering Probability and Statistics Ahmedawad

=5 . 000−(−0 . 273 )

7 .385√ 110

+ 111

=5 .2733 .227

=1. 634

The P-value is P(T≥1 .634 ) , where T has t(19) distribution. From Table, we can see that P lies between 0.05 and 0.10. The experiment found no evidence that calcium reduces blood pressure (t=1.634, df=19, 0.05<P<0.10).

Example We estimate that the effect of calcium supplementation is the difference between the sample means of the calcium and the placebo

groups, x1− x2=5 .273 mm. A 90% confidence

interval for μ1−μ2 uses the critical value t*=1.729 from the t(19) distribution. The interval is

( x1− x2 )±t∗s p√ 1

n1

+ 1n2

=(5 .000−(−0 . 273))±(1 . 729)(7 . 385)√ 1

10+ 1

11

= 5.273±5.579

Page 203: Report Engineering Probability and Statistics Ahmedawad

We are 90% confident that the difference in means is in the interval (-0.306, 10.852). The calcium treatment reduced blood pressure by about 5.3mm more than a placebo on the average, but the margin of error for this estimate is 5.6mm.

Approximate Small-Sample Procedures when both populations have different variance (

σ 12≠σ2

2)

Suppose that the population standard

deviations σ 1 and σ 2 are not known. We estimate them by the sample standard

deviations s1 and s2 from our two samples.

Equal Sample Sizes (n1=n2=n )

The confidence interval for μ1−μ2 is given by

( x1− x2 )±tα /2√ s12+s2

2

n

Page 204: Report Engineering Probability and Statistics Ahmedawad

To test the hypothesis H0 : μ1=μ2 , compute the two-sample t statistic

t=x1− x2

√ s12+s2

2

n

where t is based on df v=n1+n2−2=2(n−1).

Unequal Sample Sizes (n1≠n2)

The confidence interval for μ1−μ2 is given by

( x1− x2 )±tα /2√ s12

n1

+s2

2

n2

To test the hypothesis H0 : μ1=μ2 , compute the two-sample t statistic

t=x1− x2

√ s12

n1

+s2

2

n2

where t is based on degree of freedom

Page 205: Report Engineering Probability and Statistics Ahmedawad

v=( s1

2

n1

+s2

2

n2)2

( s12

n1)2

n1−1+

( s22

n2)2

n2−2 .

Note: The value of v will generally not be an integer. Round v down to the nearest integer to use the t table.

The Two-Sample t Significance test

Suppose that an SRS of size n1 is drawn from a

normal population with unknown mean μ1 and

that an independent SRS of size n2 is drawn from another normal population with unknown

mean μ2 . To test the hypothesis H0 : μ1=μ2 , compute the two-sample t statistic

Page 206: Report Engineering Probability and Statistics Ahmedawad

t=x1− x2

√ s12

n1

+s2

2

n2

and use P-values or critical values for the t(k) distribution, where the degrees of freedom k are

the smaller n1−1 and n2−1 .

Page 207: Report Engineering Probability and Statistics Ahmedawad

Example An educator believes that new directed reading activities in the classroom will help elementary school pupils improve some aspects of their reading ability. She arranges for a third-grade class of 21 students to take part in these activities for an eight-week period. A control classroom of 23 third-graders follows the same curriculum without the activities. At the end of the eight weeks, all students are given a Degree of Reading Power (DRP) test, which measures the aspects of reading ability that the treatment is designed to improve. The summary statistics using Excel are

Page 208: Report Engineering Probability and Statistics Ahmedawad

 Treatment

Group Control Group     Mean 51.47619048 41.52173913Standard Error 2.402002188 3.575758061Median 53 42Mode 43 42Standard Deviation 11.00735685 17.14873323Sample Variance 121.1619048 294.0790514Kurtosis 0.803583546 0.614269919Skewness -0.626692173 0.309280608Range 47 75Minimum 24 10Maximum 71 85Sum 1081 955Count 21 23

Because we hope to show that the treatment (Group 1) is better than the control (Group 2), the hypotheses are

H0 : μ1=μ2 vs. Ha : μ1>μ2

The two-sample t statistic is

Page 209: Report Engineering Probability and Statistics Ahmedawad

t=x1− x2

√ s12

n1

+s2

2

n2

=51 .48−41.52

√11.012

21+17 .152

23

=2 .31

The P-value for the one-sided test is P(T≥2 .31 ). The degree of freedom k is equal

to the smaller of n1−1=21−1=20 and n2−1=23−1=22 . Comparing 2.31 with entries in Table for 20 degrees of freedom, we see that P lies between 0.02 and 0.01. The data strongly suggest that directed reading activity improves the DRP score (t=2.31, df=20, 0.01<P<0.02).

Example We will find a 95% confidence interval for the mean improvement in the entire population of third-graders. The interval is

( x1− x2 )±t∗√ s12

n1

+s2

2

n2

=(51. 48−41 .52)±t∗√11. 012

21+17 . 152

23

=9.96±4 .31×t∗¿ ¿

Page 210: Report Engineering Probability and Statistics Ahmedawad

From Example, we have the t(20) distribution.

Table D gives t∗¿ t0. 025 , 20=2. 086 . With this approximation we have

9 .96±4 .31×t∗¿ 9. 96±4 .31×2. 086

=9.96±8 .99=(1. 0 , 18 . 9)

We can see that zero is outside of the interval

(1.0, 18.9). We can say that “μ1−μ2 is not equal to zero”.

9.2 Comparing two population means: Paired Difference Experiments

Matched Pairs t procedures

One application of the one-sample t procedure is to the analysis of data from matched pairs studies. We compute the differences between the two values of a matched pair (often before and after measurements on the same unit) to produce a single sample value. The sample mean and standard deviation of these differences are computed.

Page 211: Report Engineering Probability and Statistics Ahmedawad

Paired Difference Confidence Interval for μD=μ1−μ2

Large Sample

xD±zα /2

σ D

√nD

≈ xD± zα /2

sD

√nD

Assumption: The sample differences are randomly selected from the population of differences.

Small Sample

xD±tα /2

s D

√nD

where tα /2 is based on (nD−1 ) degrees of

freedom.

Assumptions:1. The relative frequency distribution of the

population of differences is normal.

Page 212: Report Engineering Probability and Statistics Ahmedawad

2. The sample differences are randomly selected from the population of differences.

Paired Difference Test of Hypothesis for μD=μ1−μ2

One-Tailed Test

H0 : μD=D0 H0 : μD=D0

Ha : μD<D0 or Ha : μD>D0

Two-Tailed Test

H0 : μD=D0 Ha : μD≠D0

Large Sample

Test statistics

z=xD−D0

σ D

√nD

=xD−D0

s D

√nD

Page 213: Report Engineering Probability and Statistics Ahmedawad

Assumption: The sample differences are randomly selected from the population of differences.Small Sample

Test statistics

t=xD−D0

sD

√nD

where tα /2 is based on (nD−1 ) degrees of

freedom.

Assumptions:1.The relative frequency distribution of the population of differences is normal.2.The sample differences are randomly selected from the population of differences.

Page 214: Report Engineering Probability and Statistics Ahmedawad

Example To analyze these data, we first substract the pretest score from the posttest score to obtain the improvement for each student. These 20 differences form a single sample. They appear in the “Gain” columns in Table 7.1. The first teacher, for example, improved from 32 to 34, so the gain is 34-32=2.To assess whether the institute significantly improved the teachers’ comprehension of spoken French, we test

H0 : μD=0Ha : μD>0

Here μ is the mean improvement that would be achieved if the entire population of French teachers attended a summer institute. The null

Page 215: Report Engineering Probability and Statistics Ahmedawad

hypothesis says that no improvement occurs,

and Ha says that posttest scores are higher on the average. The 20 differences havexD=2 .5 and sD=2 . 893

The one-sample t statistic is

t=xD−D0

sD /√nD

= 2. 5−02 .893 /√20

=3. 86

The P-value is found from the t(19) distribution (n-1=20-1=19). Table shows that 3.86 lies between the upper 0.001 and 0.0005 critical values of the t(19) distribution. The P-value lies between 0.0005 and 0.001. “The improvement in score was significant (t=3.86, df=19, p=0.00053).”

Example A 90% confidence interval for the mean improvement in the entire population

requires the critical value tα /2=1.729

from Table. The confidence interval is

xD±tα /2

s D

√nD

=2 .5±1 .7292.893√20

=2 .5±1 .12=(1 .38 , 3 .62)

Page 216: Report Engineering Probability and Statistics Ahmedawad

The estimated average improvement is 2.5 points, with margin of error 1.12 for 90% confidence. Though statistically significant, the effect of the institute was rather small.

9.3 Comparing two population proportions: Independent Sampling

Suppose a presidential candidate wants to compare the preference of registered voters in the northeastern United States (NE) to those in the southeastern United States (SE). Such a comparison would help determine where to concentrate campaign efforts.

Properties of the Sampling Distribution of ( p1− p2 )

1.The mean of the sampling distribution of ( p1− p2 ) is ( p1−p2 ) that is,

E( p1− p2 ) = ( p1−p2 )

Page 217: Report Engineering Probability and Statistics Ahmedawad

which means that ( p1− p2 ) is an unbiased

estimator of ( p1−p2 ).2.The standard deviation of the sampling

distribution of ( p1− p2 ) is

σ ( p1− p2)=√ p1(1−p1 )

n1

+p1(1−p1)

n2

3. If the sample sizes n1 and n2 are large, the

sampling distribution of ( p1− p2 ) is approximately normal.

Large-Sample Confidence Interval for ( p1−p2 )

( p1− p2 )±zα /2√ p1 (1−p1 )

n1

+p1(1−p1 )

n2

¿( p1− p2 )±zα /2√ p1 (1− p1 )

n1

+p1(1− p1 )

n2

Assumption: The two samples are independent random samples. Both samples should be large

Page 218: Report Engineering Probability and Statistics Ahmedawad

enough that the normal distribution provides an adequate approximation to the sampling

distribution of p1 and p2 .

Large-Sample Test of Hypothesis about ( p1−p2 )

One-Tailed Test

H0 :( p1−p2)=0 H0 :( p1−p2)=0Ha :( p1−p2 )<0 or Ha :( p1−p2 )>0

Two-Tailed Test

H0 :( p1−p2)=0 Ha :( p1−p2 )≠0

Large Sample

Page 219: Report Engineering Probability and Statistics Ahmedawad

Test statistics

z=( p1− p2 )σ ( p1− p2)

Note:

σ ( p1− p2)

=√ p1(1−p1 )n1

+p1(1−p1)

n2

¿√ p (1− p )( 1n1

+ 1n2

)

where p=

x1+x2

n1+n2

Assumption: Same as for large-sample

confidence interval for ( p1−p2 ).

Page 220: Report Engineering Probability and Statistics Ahmedawad

9.4 Determining the Sample Size

Determination of Sample Size for Estimating ( μ1−μ2 )

To estimate ( μ1−μ2 ) to within a given bound

B with probability (1−α ) , use the following

Page 221: Report Engineering Probability and Statistics Ahmedawad

formula to solve for equal sample sizes that will achieve the desired reliability:

n1=n2=( zα /2 )

2( σ12+σ 2

2 )

B2

You will need to substitute estimates for the

values of σ 12

and σ 22

before solving for the sample size. These estimates might be sample

variances s12

and s22

from prior sampling, or from an educated guess based on the range-

that is, s≈R/4 .

Determination of Sample Size for Estimating ( p1−p2 )

To estimate ( p1−p2 ) to within a given bound

B with probability (1−α ) , use the following formula to solve for equal sample sizes that will achieve the desired reliability:

Page 222: Report Engineering Probability and Statistics Ahmedawad

n1=n2=( zα /2 )

2( p1 q1+ p2 q2 )

B2

You will need to substitute estimates for the

values of p1 and p2 before solving for the sample size. These estimates might be based on prior samples, obtained from educated guesses or, most conservatively, specified as p1=p2=0 .5 .

10. Analysis of Variance

In this chapter we extend the methodology of chapters 7-9 in two important ways. First, we discuss the critical elements in the design of a sampling experiment. Then we see how to analyze the experiment in order to compare more than two populations. We’ll look at several of the more popular experimental designs.

Page 223: Report Engineering Probability and Statistics Ahmedawad

Sampling selects a part of population of interest to represent the whole.

Done properly, sampling can yield reliable information about a population.

Two basic types of studies are observational studies and designed experiments. An observational study observes individuals and measures variables of interest and does not attempt to influence the responses. A sample survey is a type of observational study. A designed experiment deliberately imposes some treatment or conditions on individuals to observe their responses. A drug study were some patients get drug A and others get drug B

Page 224: Report Engineering Probability and Statistics Ahmedawad

is an example of an experiment. The study investigator imposes which subjects get which drug. When we wish to assess cause and effect relationships, an experiment is the only true way to evaluate the effects of the experimental conditions. Observational studies can shed light, but they tend to not be as convincing as a good designed experiment.

10.1 Elements of a Designed Experiment

Definition 10.1

The response variable is the variable of interest to be measure in the experiment. We also refer to the response as the dependent variable.

The response might be the SAT scores of a high school senior, the total sales of a firm last year, or the total income of a particular household this year.

Experimental Units, Subjects, Treatment

Page 225: Report Engineering Probability and Statistics Ahmedawad

The individuals on which the experiment is done are the experimental units. When the units are human beings, they are called subjects. A specific experimental condition applied to the units is called a treatment.

Example Does regularly taking aspirin help protect people against heart attack?

The Physicians’ Health Study looked at the effects of two drugs: aspirin and beta carotine.

The subjects were 21,996 male physicians. There were two factors, each having two levels: aspirin (yes or no) and beta carotine (yes or no). Combinations of the levels of these factors form the four treatment shown in Figure. One-fourth of the subjects were assigned to each of these treatments. So there are four treatments in the experiment.

Page 226: Report Engineering Probability and Statistics Ahmedawad

Figure The treatments in the Physicians’ Health Study.

The result shows that 239 of the placebo group but only 139 of the aspirin group had suffered heart attack.

Definition 10.2

Factors are those variables whose effect on the response is of interest to the experimenter.

Definition 10.3

Factor levels are the values of the factor utilized in the experiment.

Page 227: Report Engineering Probability and Statistics Ahmedawad

Definition 10.4

The treatments of an experiment are the factor-level combination utilized.

10.2 The Completely Randomized Design

This design assigns experimental units to treatments with random assignment.

Figure Outline of a completely randomized design comparing three treatments.

General ANOVA Summary Table for a Completely Randomized Design

Page 228: Report Engineering Probability and Statistics Ahmedawad

Source df SS MS FTreatment

s p-1 SSTMST=SST/(p-

1)MST/MSE

Error n-p SSEMSE=SSE/(n-

p)  

Total n-1SS(Total

)    

The variation between the treatment means is measured by the Sum of Squares for Treatments (SST).

SST=

∑i=1

3

ni( x i− x )2

where x

is the overall mean number and x i

is the ith treatment mean number.

The sampling variability within the treatments is measured by the Sum of Squares for Error (SSE).

SSE= (n1−1)s1

2+(n2−1)s22+(n3−1) s3

2

Page 229: Report Engineering Probability and Statistics Ahmedawad

where s1

2

, s2

2

, and s3

2

are the sample variances associated with the three treatments.

Test to Compare p Treatment Means for a Completely Randomized Design

H0 : μ1=μ2=⋯=μ p

Ha: At least two treatment means differ

Test statistic

F= MSTMSE

Assumptions:

1. Samples are selected randomly and independently from the respective populations.

2. All p population probability distributions are normal.

3. The p population variances are equal.

Rejection region: F> Fα , where Fα is based

on v1=( p−1) numerator degrees of freedom

Page 230: Report Engineering Probability and Statistics Ahmedawad

Variation due to treatment

Variation due to random sampling

Total variation

(associated with MST) and v2=(n−p ) denominator d.f. (associated with MSE). Also,

Reject H0 if p-value is less than α .

One-Way ANOVA

Partitions Total Variation

Sum of Squares Among Sum of Squares Within

Sum of Squares Between Sum of Squares Error

Sum of Squares Treatment Within Groups Variation

Among Groups Variation

Page 231: Report Engineering Probability and Statistics Ahmedawad

X

Group 1

Group 2

Group 3

Response, X

Total Variation

SS (Total )=( X11− X )2+( X21− X )2+…+( X ij−X )2

Page 232: Report Engineering Probability and Statistics Ahmedawad

X

X 3

X 2

X 1 Group 1

Group 2

Group 3

Response, X

Treatment Variation

SST=n1 ( X1− X )2+n2 ( X 2− X )2+…+np ( X p−X )2

Page 233: Report Engineering Probability and Statistics Ahmedawad

X 2

X 1

X 3

Group 1

Group 2

Group 3

Response, X

Random (Error) Variation

SSE=( X11−X1)2+( X21−X1 )2+…+( X pj− X p)2

Page 234: Report Engineering Probability and Statistics Ahmedawad

One-Way ANOVA F-Test Test Statistic

1. Test Statistic

F = MST / MSE MST Is Mean Square for

Treatment MSE Is Mean Square for Error

2. Degrees of Freedom

1 = p -12 = n - p

p = # Populations, Groups, or Levels

n = Total Sample Size

Page 235: Report Engineering Probability and Statistics Ahmedawad

F

a

p

n

p

(

,

)

1

0

Reject H

0

Do Not Reject

H 0

F

One-Way ANOVA F-Test Example

As production manager, you want to see if 3 filling machines have different mean filling times. You assign 15 similarly trained & experienced workers, 5 per machine, to the machines. At the .05 level, is there a difference in mean filling times?

Mach1 Mach2 Mach325.40 23.40 20.0026.31 21.80 22.2024.10 23.50 19.7523.74 22.75 20.6025.10 21.60 20.40

Page 236: Report Engineering Probability and Statistics Ahmedawad

F

0

3.89

H 0 : 1 = 2 = 3 H a : Not All Equal = .05 1 = 2 2 = 12 Critical Value(s):

Test Statistic:

Decision : Reject at = .05

= .05

F

MST MSE

23

5820

9211

25

6

.

.

.

There Is Evidence Pop. Means Are Different

Source of Variation

Degrees of Freedom

Sum of

Squares

Mean

Square (Variance)

F

Treatment (Machines)

3 - 1 = 2

47.1640

23.5820

25.60

Error

15 - 3 = 12

11.0532

.9211

Total

15 - 1 = 14

58.2172

Summary Table

10.3 Multiple Comparisons of Means

Page 237: Report Engineering Probability and Statistics Ahmedawad

Consider a completely randomized design with three treatments, A, B, and C. Suppose we determine that the treatment means are statistically different via the ANOVA F-test of Section 10.2. To complete the analysis, we want to rank the three treatment means.

In the three-treatment experiment, for example, we would construct confidence intervals for the

following differences: μA−μB , μA−μC , and μB−μC . In general, if there are p treatment means, there are

c=p ( p−1)/2

pairs of means that can be compared.

Tukey Procedure

X

f(X)

1 = 2 3

Page 238: Report Engineering Probability and Statistics Ahmedawad

1. Tells Which Population Means Are Significantly Different Example: 1 = 2 3 2. Post Hoc Procedure Done After Rejection of Equal Means in ANOVA 3. Output From Many Statistical Computer Programs

ExperimentalDesigns

One-Way Anova

Completely Randomized

Randomized Block

Two-Way Anova

Factorial

Page 239: Report Engineering Probability and Statistics Ahmedawad

10.4 The Randomized Block Design

Block Design. A bolck is a group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a block design, the random assignment of units to treatments is carried out separately within each block.

Figure Outline of a block design. The blocks consist of male and female subjects. The treatments are three therapies for cancer.

Page 240: Report Engineering Probability and Statistics Ahmedawad

1. Experimental Units (Subjects) Are Assigned Randomly to Blocks

Blocks are Assumed Homogeneous

2. One Factor or Independent Variable of Interest

2 or More Treatment Levels or Classifications

3. One Blocking Factor

General ANOVA Summary Table for a Randomized Block Design

Source df SS MS FTreatment

s p-1 SST MST=SST/(p-1)MST/MSE

Block b-1 SSB MSB=SSB/(b-1)

Error n-p-b+1 SSE MSE  

Total n-1SS(Total

)    

The variation between the treatment means is measured by the Sum of Squares for Treatments (SST).

SST=

∑i=1

p

b( xT i− x )2

Page 241: Report Engineering Probability and Statistics Ahmedawad

where

xT i

represents the sample mean for the i-th treatment, b (the number of blocks) is the number of measurements for each treatment, and p is the number of treatments.

The blocks also account for some of the variation among the different responses. The sampling variability between the blocks is measured by the Sum of Squares for Blocks (SSB):

SSB=

∑i=1

b

p ( x Bi− x )2

where

xBi

represents the sample mean for the i-th block and p (the number of treatments) is the number of measurements in each block.

The total variation is

SS(Total)=

∑i=1

n

( x i− x )2

Then the variation attributable to sampling error is found by subtraction:

Page 242: Report Engineering Probability and Statistics Ahmedawad

H 0 : 1 = 2 = 3 = ... = p All Population Means are Equal No Treatment Effect H a : Not All j Are Equal At Least 1 Pop. Mean is Different Treatment Effect 1 2 ... p Is Wrong

X

f(X)

1

=

2

=

3

X

f(X)

1

=

2

3

SSE=SS(Total)-SST-SSB

Test to Compare p Treatment Means for a Completely Randomized Design

H0 : μ1=μ2=⋯=μ p

Ha: At least two treatment means differ

Page 243: Report Engineering Probability and Statistics Ahmedawad

Test statistic

F= MSTMSE

Rejection region: F> Fα , where Fα is based

on v1=( p−1) numerator degrees of freedom

(associated with MST) and v2=(n−b−p+1 ) denominator d.f. (associated with MSE). Also,

Reject H0 if p-value is less than α .

Assumptions:

1. The probability distributions of observations corresponding to all the block-treatment combinations are normal.

2. The variances of all probability distributions are equal.

Page 244: Report Engineering Probability and Statistics Ahmedawad

Randomized Block F-Test Example

Golfer(Block)

Brand A

Brand B

Brand C

Brand D

B1202.

4203.

2223.

7203.

6

B2 242248.

7259.

8240.

7

B3220.

4227.

3 240207.

4

B4 230243.

1247.

7226.

9

B5191.

6211.

4218.

7200.

1

B6247.

7 253268.

1 244

B7214.

8214.

8233.

9195.

8

B8245.

4243.

6257.

8227.

9

B9 224231.

5238.

2215.

7

B10252.

2255.

2265.

4245.

2

Excel Output

Page 245: Report Engineering Probability and Statistics Ahmedawad

Anova: Two-Factor Without Replication                     

SUMMARY Count Sum Average Variance    B1 4 832.9 208.225 106.6825    B2 4 991.2 247.8 76.28667    B3 4 895.1 223.775 185.0692    B4 4 947.7 236.925 100.8958    B5 4 821.8 205.45 143.8033    

B6 41012.

8 253.2 112.3133    B7 4 859.3 214.825 241.9358    B8 4 974.7 243.675 150.4492    B9 4 909.4 227.35 93.96333    B10 4 1018 254.5 70.36                 

Brand A 102270.

5 227.05 410.6428    

Brand B 102331.

8 233.18 341.5507    

Brand C 102453.

3 245.33 297.6312    

Brand D 102207.

3 220.73 352.4534                              ANOVA            

Source of Variation SS df MS F P-value F critRows 12073.88 9 1341.542 66.26468 4.504E-16 2.250133Columns 3298.657 3 1099.552 54.31172 1.448E-11 2.960348Error 546.6208 27 20.24521                   Total 15919.16 39        

10.5 Factorial Experiments

All the experiments discussed in Sections 10.2 and 10.4 were single-factor experiments. The treatments were levels of a single factor, with the sampling of experimental units performed using either a completely randomized or randomized block design.

Page 246: Report Engineering Probability and Statistics Ahmedawad

ExperimentalDesigns

One-Way Anova

Completely Randomized

Randomized Block

Two-Way Anova

Factorial

Factorial Design

1. Experimental Units (Subjects) Are Assigned Randomly to Treatments

Subjects are Assumed Homogeneous

2. Two or More Factors or Independent Variables

Each Has 2 or More Treatments (Levels)

3. Analyzed by Two-Way ANOVA

Page 247: Report Engineering Probability and Statistics Ahmedawad

Definition 10.9

A complete factorial experiment is one in which every factor-level combination is utilized. That is, the number of treatments in the experiment equals the total number of factor-level combinations.

Factorial Design Example

Page 248: Report Engineering Probability and Statistics Ahmedawad

Factor 2 (Training Method)

Factor Levels

Level 1

Level 2

Level 3

Level 1

19 hr.

20 hr.

22 hr.

Factor 1

(High)

11 hr.

17 hr.

31 hr.

(Motivation)

Level 2

27 hr.

25 hr.

31 hr.

(Low)

29 hr.

30 hr.

49 hr.

Advantages of Factorial Designs

1. Saves Time & Effort

e.g., Could Use Separate Completely Randomized Designs for Each Variable

2. Controls Confounding Effects by Putting Other Variables into Model

3. Can Explore Interaction Between Variables

6 Treatments

Page 249: Report Engineering Probability and Statistics Ahmedawad

Two-Way ANOVA

1. Tests the Equality of 2 or More Population Means When Several Independent Variables Are Used

2. Same Results as Separate One-Way ANOVA on Each Variable

No Interaction Can Be Tested

3. Used to Analyze Factorial Designs

Two-Way ANOVA Data Table

Page 250: Report Engineering Probability and Statistics Ahmedawad

X i j k

Level i

Factor A

Level j

Factor B

Observation k

Factor

Factor B

A 1

2

...

b

1 X

111

X

121

...

X

1b1

X 112

X

122

...

X

1b2

2

X

211

X

221

...

X

2b1

X 212

X

222

...

X

2b2

:

:

:

:

:

a X

a11

X

a21

...

X

ab1

X a12

X

a22

...

X

ab2

Total Variation

Variation Due to Treatment A

Variation Due to Random Sampling

Variation Due to Interaction

SSA

SS(Total)

Variation Due to Treatment B

SSB

Two-Way ANOVA

Total Variation Partitioning

Two-Way ANOVA Summary Table

Page 251: Report Engineering Probability and Statistics Ahmedawad

Source of Variation

Degrees of Freedom

Sum of Squares

Mean

Square

F

A (Row)

a - 1

SS(A)

MS(A)

MS(A) MSE

B (Column)

b - 1

SS(B)

MS(B)

MS(B) MSE

AB (Interaction)

(a-1)(b-1)

SS(AB)

MS(AB)

MS(AB)

MSE Error

n - ab

SSE

MSE

Total

n - 1

SS(Total)

Same as Other Designs

Test Conducted in Analyses of Factorial Experiments: Completely Randomized Design, r Replicates per treatment

Test for Treatment Means

H0: No difference among the ab treatment means

Ha: At least two treatment means differ

Test statistic

Page 252: Report Engineering Probability and Statistics Ahmedawad

F= MSTMSE

Rejection region: F> Fα , where Fα is based

on v1=(ab−1) numerator degrees of freedom

and v2=(n−ab) denominator d.f. (Note: n=abr).

Test for Factor Interaction

H0: Factors A and B do not interact to affect the response mean

Ha: Factors A and B do interact to affect the response mean

Test statistic

F=MS( AB)

MSE

Rejection region: F> Fα , where Fα is based

on v1=(a−1 )(b−1) numerator degrees of

freedom and v2=(n−ab) denominator d.f.

Page 253: Report Engineering Probability and Statistics Ahmedawad

Test for Main Effect Factor A

H0: No difference among the a mean levels of factor A

Ha: At least two factor A mean levels differ

Test statistic

F=MS( A )MSE

Rejection region: F> Fα , where Fα is based

on v1=(a−1 ) numerator degrees of freedom

and v2=(n−ab) denominator d.f.

Test for Main Effect Factor B

H0: No difference among the a mean levels of factor B

Ha: At least two factor B mean levels differ

Test statistic

F=MS( B )MSE

Page 254: Report Engineering Probability and Statistics Ahmedawad

Effects of Motivation (High or Low) & Training Method (A, B, C) on Mean Learning Time

Interaction No Interaction

Average Response

A

B

C

High

Low

Average Response

A

B

C

High

Low

Rejection region: F> Fα , where Fα is based

on v1=(b−1 ) numerator degrees of freedom

and v2=(n−ab) denominator d.f.

Assumptions for All Tests

The response distribution for each factor-level combination (treatment) is normal.The response variance is constant for all treatmentsRandom and independent samples of experimental units are associated with each treatment.

Graphs of Interaction

Page 255: Report Engineering Probability and Statistics Ahmedawad

11. Simple Linear Regression

Learning Objectives

Describe the Linear Regression ModelState the Regression Modeling StepsExplain Ordinary Least SquaresCompute Regression CoefficientsPredict Response VariableInterpret Computer Output

Models

Representation of Some Phenomenon

Mathematical Model Is a Mathematical Expression of Some Phenomenon

Often Describe Relationships between Variables

Types

o Deterministic Modelso Probabilistic Models

Page 256: Report Engineering Probability and Statistics Ahmedawad

Deterministic Models

1. Hypothesize Exact Relationships

2. Suitable When Prediction Error is Negligible

3.Example: Force Is Exactly Mass Times Acceleration

F = m·a

Page 257: Report Engineering Probability and Statistics Ahmedawad

Probabilistic Models

1. Hypothesize 2 Components

DeterministicRandom Error

Y = Deterministic component + Random Error

where Y is the variable of interest.

Page 258: Report Engineering Probability and Statistics Ahmedawad

2. Example: Sales Volume Is 10 Times Advertising Spending + Random Error

Y = 10X + Random Error May Be Due to Factors

Other Than Advertising

Types of Probabilistic Models

A First-Order (Straight-Line) Probabilistic Model

y=β0+β1 x+ε

where

y is the dependent (or response) variable,

x is the independent (or predictor) variable,

ProbabilisticModels

RegressionModels

CorrelationModels

OtherModels

Page 259: Report Engineering Probability and Statistics Ahmedawad

Y

Y = mX + b

b = Y-intercept

X

Changein Y

Change in X

m = Slope

E( y )=β0+β1 x is the deterministic portion of the model,

β0 is y-intercept of the line, and

β1 is slope of the line.

Regression Models

1. Answer ‘What Is the Relationship Between the Variables?’

2. Equation Used

1 Numerical Dependent (Response) Variable

Page 260: Report Engineering Probability and Statistics Ahmedawad

What Is to Be Predicted

1 or More Numerical or Categorical Independent (Explanatory) Variables

3. Used Mainly for Prediction & Estimation

Regression Modeling Steps

1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

RegressionModels

LinearNon-

Linear

2+ ExplanatoryVariables

Simple

Non-Linear

Multiple

Linear

1 ExplanatoryVariable

Page 261: Report Engineering Probability and Statistics Ahmedawad

Unknown Relationship

Population

Random Sample

ii10i XY

ii10i ˆXˆˆY

$ $

$

$

$

$

$

Estimate Standard Deviation of Error

4. Evaluate Model

5. Use Model for Prediction & Estimation

Sample Linear Regression Model

Page 262: Report Engineering Probability and Statistics Ahmedawad

Y

X

ii10i ˆXˆˆY

i10i XˆˆY

Unsampled observation

i = Random error

Observed value

^

Origins of Regression:

“Regression Analysis was first developed by Sir Francis Galton in the latter part of the 19th Century. Galton had studied the relation between heights of fathers and sons and noted that the heights of sons of both tall & short fathers appeared to ‘revert’ or ‘regress’ to the mean of the group. He considered this tendency to be a regression to ‘mediocrity.’ Galton developed a mathematical description of this tendency, the precursor to today’s regression models.” (From page 6 of Neter, Kutner, Nachtsheim, and Wasserman, 1996).Regression Line

Page 263: Report Engineering Probability and Statistics Ahmedawad

The regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.

Table. Mean height of children in Kalama, Egypt, age from 18 to 29 months.

Scattergram

Page 264: Report Engineering Probability and Statistics Ahmedawad

1. Plot of All (Xi, Yi) Pairs2. Suggests How Well Model Will Fit

Figure. Mean height of children in Kalama, Egypt, plotted against age from 18 to 29 months, from Table 2.7.

Page 265: Report Engineering Probability and Statistics Ahmedawad

Figure. The regression line fitted to the Kalama data and used to predict height at age 32 months.

In Figure, we have drawn the regression line with the equation

Height = 64.93+(0.635 ¿ age)

It means that b=0.635 is the slope of the line and a=64.93 is the intercept.

Page 266: Report Engineering Probability and Statistics Ahmedawad

If we substitute 32 for the age in the equation,

Height = 64.93+(0.635 ¿ 32)=85.25 centimeters.

How would you draw a line through the points? How do you determine which line ‘fits best’?

Least Squares

1. ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum

But Positive Differences Off-Set Negative

The least squares regression line is the

straight line y=a+bx which minimizes the sum of the squares of the vertical distances between the line and the observed values y.

∑ (error )2=∑( observed y−predicted y )2

2. LS Minimizes the Sum of the Squared Differences (SSE)

Least Square regression

Page 267: Report Engineering Probability and Statistics Ahmedawad

If we predict 85.25 centimeters for the mean height at age 32 months and the actual mean turns out to be 84 centimeters, our error is

Error = observed height – predicted height

= 84 -85.25 = -1.25 centimeters

Figure The least-squares idea: make the errors in predicting y as small as possible by minimizing the sum of their squares.

Least Squares Graphically

Page 268: Report Engineering Probability and Statistics Ahmedawad

2

Y

X

1 3

4

^^

^^

22102 ˆXˆˆY

i10i XˆˆY

Coefficient Equations

LS minimizes ∑i=1

n

εi2= ε1

2+ ε22+ ε3

2+ ε42

Page 269: Report Engineering Probability and Statistics Ahmedawad

XˆYˆ

n

XX

n

YXYX

ˆ

10

n

1i

2n

1i i2i

n

1i i

n

1i in

1i ii

1

Sample Slope

Sample Y-intercept

i10i XˆˆY

Prediction Equation

Interpretation of Coefficients

1. Slope (1_hat)

Estimated Y Changes by 1_hat for Each 1 Unit Increase in X

If 1 hat = 2, then Sales (Y) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising (X)

2. Y-Intercept (0_hat)

Average Value of Y When X = 0

Page 270: Report Engineering Probability and Statistics Ahmedawad

If 0_hat = 4, then Average Sales (Y) Is Expected to Be 4 When Advertising (X) Is 0

Parameter Estimation Example

You’re a marketing analyst for Hasbro Toys. You gather the following data:

Page 271: Report Engineering Probability and Statistics Ahmedawad

Ad $ Sales (Units)1 12 13 24 25 4

What is the relationship between sales & advertising?

Parameter Estimation Solution

Page 272: Report Engineering Probability and Statistics Ahmedawad

Coefficient Interpretation

1. Slope (1_hat)

Sales Volume (Y) Is Expected to Increase by .7 Units for Each $1 Increase in Advertising (X)

2. Y-Intercept (0_hat)

Average Value of Sales Volume (Y) Is -.10 Units When Advertising (X) Is 0

Difficult to Explain to Marketing Manager

Expect Some Sales Without Advertising

Parameter Estimation Computer Output

β1=∑i=1

n

X i Y i−(∑

i=1

n

X i)(∑i=1

n

Y i)n

∑i=1

n

X i2−

(∑i=1

n

X i)2

n

=37−

(15 ) (10 )5

55−(15 )2

5

=0 .70

¿ 0 { bar { hat 1 { bar (0 . 70 )¿¿

¿

Page 273: Report Engineering Probability and Statistics Ahmedawad

Parameter Estimates

Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 -0.1000 0.6350 -0.157 0.8849 ADVERT 1 0.7000 0.1914 3.656 0.0354

0 ^ 1 ^

k ^

  Barry Bonds Least-Squares Regression line

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80

  HR

 RBI

Page 274: Report Engineering Probability and Statistics Ahmedawad

Typically, the equation of the least squares regression line is obtained by computer software with a regression function.

Excel output from Barry Bonds Statistics

  CoefficientsIntercept 39.7618446  Slope 1.568414403

RESIDUAL OUTPUT

Observation

Predicted RBI Residuals

1 64.85647505 -16.85647502 78.97220467 -19.97220463 77.40379027 -19.40379024 69.56171826 -11.56171825 91.5195199 22.48048016 78.97220467 37.027795337 93.0879343 9.9120656988 111.9089071 11.091092869 97.79317751 -16.7931775

10 91.5195199 12.480480111 105.6352495 23.3647504712 102.4984207 -1.4984207213 97.79317751 24.2068224914 93.0879343 -10.087934315 116.6141503 -10.614150316 154.256096 -17.256096017 91.5195199 -16.5195199

Page 275: Report Engineering Probability and Statistics Ahmedawad

From Dr. Chris Bilder’s website.

Select Tools > Data Analysis from the main Excel menu bar to bring up the Data Analysis window. Select Regression and OK to produce the Regression window. Below is the finished window.

The Residual option produces the residuals in the output. The Line Fit Plots option produces a

Page 276: Report Engineering Probability and Statistics Ahmedawad

plot similar to a scatter plot with an estimated regression line plotted upon it. Notice the above output does not look exactly like a scatter plot with estimated regression line plotted upon it. Below is one way to fix the plot. Note that other steps are often necessary to make the plot more “professional” looking (changing the scale on the axes, adding tick marks, changing graph titles, etc…)

1) Change background from grey to whitea) Right click on the grey background (a

menu should appear)b) Select format plot area to bring up the

following window:

Page 277: Report Engineering Probability and Statistics Ahmedawad

i) Select None as the areaii) Select OK

2) Remove legenda) Right click in the legendb) Select Clear

3) Create the regression linea) Right click on one of the estimated Y

values (should be in pink) and a menu should appear

b) Select Format Data Series to bring up the following window:

Page 278: Report Engineering Probability and Statistics Ahmedawad

i) Under Marker, select Noneii) Under Line, select Automatic iii)Select OK

Linear Regression Assumptions

1. Mean of Probability Distribution of Error Is 0

2. Probability Distribution of Error Has Constant Variance

3. Probability Distribution of Error is Normal

4. Errors Are Independent

Page 279: Report Engineering Probability and Statistics Ahmedawad

Error Probability Distribution

Measures of Variation in Regression

1. Total Sum of Squares (SSyy)

Measures Variation of Observed Yi Around the Mean¯Y

2. Explained Variation (SSR)

Variation Due to Relationship Between X & Y

3. Unexplained Variation (SSE)

Y

f()

X

X 1X 2

Page 280: Report Engineering Probability and Statistics Ahmedawad

Y

X

¯Y

X i

i10i XˆˆY Total sum of squares (Y i - Y) 2

Unexplained sum of squares (Y i - Y i ) 2

^

Explained sum of squares (Y i - Y) 2

^

Y i

Variation Due to Other Factors

Variation Measures

Estimaton of σ 2

for a Straight-Line Model

s2= SSEn−2

where

Page 281: Report Engineering Probability and Statistics Ahmedawad

SSE=

∑i=1

n

(Y i−Y )2

s=√s2=√ SSEn−2

We will refer to s as the estimated standard error of the regression model.

Interpretation of s, the estimated Standard

Deviation of ε

We expect most (95%) of the observed y values to lie within 2s of their respective least squares

predicted value, y

.

Test of Slope Coefficient

Page 282: Report Engineering Probability and Statistics Ahmedawad

1. Shows If There Is a Linear Relationship Between X & Y 2. Involves Population Slope 1 3. Hypotheses H 0 : 1 = 0 (No Linear Relationship) H a : 1 0 (Linear Relationship) 4. Theoretical Basis Is Sampling Distribution of Slope

Y

Population LineX

Sample 1 Line

Sample 2 Line

1

All Possible Sample Slopes Sample 1: 2.5

Sample 2: 1.6 Sample 3: 1.8 Sample 4: 2.1

: : Very large number of

sample slopes

Sampling Distribution

1

1 S

^

^

Sampling Distribution of Sample Slopes

Slope Coefficient Test Statistic

Page 283: Report Engineering Probability and Statistics Ahmedawad

Test of an Individual parameter Coefficient in the Simple Linear Regression Model

One-Tailed Test

H0 : β1=0

Ha : β1<0 [or Ha : β1>0 ]

Two-Tailed Test

H0 : β1=0

Ha : β1≠0

Test statistic

t=β1

sβ1

where

sβ1= s

√∑i=1

n

( X i−X )2

Rejection region

Page 284: Report Engineering Probability and Statistics Ahmedawad

t <−tα or t >tα or |t|>tα /2

where tα and tα /2 are based on (n-2) df.

or

Reject H0 : β1=0 if p-value is less than α , (for example, α =0.05 )

A 100(1-α )% Confidence Interval for a β1parameter

β1±tα /2⋅sβ1=( β1−t α /2 ¿ s β1

, β1+ tα /2 ¿ s β 1)

where tα /2 is based on (n-2) degree of freedom.

Coefficient of Correlation

Scatterplots provide a visual tool for looking at the relationship between two variables. Unfortunately our eyes are not good tools for judging the strength of the relationship. Changes in the scale or the amount of white space in the graph can easily affect our judgement as to the strength of the relationship. Correlation is a numerical measure we will use to show the strength of linear association.

Page 285: Report Engineering Probability and Statistics Ahmedawad

Figure 2.9 Two scatterplots of the same data

Correlation The correlation measures the direction and strength of the linear relationship between two quantitative variables. Correlation is usually denoted by r .

Suppose that we have data on variables x and y for n individuals. The mean and standard

deviations of the two variables are x and Sx for

the x-values, and y and S y for the y-values.

Page 286: Report Engineering Probability and Statistics Ahmedawad

The correlation coefficient r between x and y is

r= 1n−1

∑ ( x i− x

Sx)( y i− y

S y)

.

The correlation coefficient r has possible values between negative one and positive one. That is, −1≤r≤1 .

When r is positive it means that there is a positive linear association between the variables and when it is negative there is a negative linear association. A scatterplot for a dataset with r=1 would have points on a perfectly straight upward sloping pattern. All points would fall on a straight line. A scatterplot for a datset with r=−1 would have points on a perfectly straight downward sloping line. A value of r like r=0 would give a scatterplot with a blob shape and no apparent upward or downward trend.

Page 287: Report Engineering Probability and Statistics Ahmedawad

Figure How the correlation r measures the direction and strength of linear association.

Page 288: Report Engineering Probability and Statistics Ahmedawad

Y

X

Y

X

Y

X

Y

X

r2 = 1 r2 = 1

r2 = .8 r2 = 0

Coefficient of Determination

Proportion of Variation “Explained” by Relationship between X and Y

r2 = Explained Variation / Total variation

=∑i=1

n

(Y i−Y )2−∑i=1

n

(Y i−Y )2

∑i=1

n

(Y i−Y )2

Practical Interpretation of the Coefficient of Determination

100 (r2) % of the variation in y can be explained by using x to predict y in the straight-line model.

Coefficient of Determination Examples

Page 289: Report Engineering Probability and Statistics Ahmedawad

You’re a marketing analyst for Hasbro Toys. You find 0 = -0.1 & 1 = 0.7. Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Interpret a coefficient of determination of 0.8167.

^

^

Coefficient of Determination Example

Prediction With Regression Models

1. Types of Predictions

Point Estimates Interval Estimates

2. What Is Predicted

Population Mean Response E(Y) for Given X

Point on Population Regression Line

Individual Response (Yi) for Given X

Page 290: Report Engineering Probability and Statistics Ahmedawad

What Is Predicted

Confidence Interval Estimate of Mean Y

Mean Y, E(Y)

YY Individual

Prediction, Y

E(Y) = 0 + 1X

^

XXP

Y−tn−2, α /2⋅SY≤E(Y )≤Y +tn−2 , α /2⋅S

Y

where

SY=S√ 1n

+( X p−X )2

∑i=1

n

( X i−X )2

Page 291: Report Engineering Probability and Statistics Ahmedawad

1. Level of Confidence (1 - ) Width Increases as Confidence Increases 2. Data Dispersion ( s ) Width Increases as Variation Increases 3. Sample Size Width Decreases as Sample Size Increases 4. Distance of X p from Mean X Width Increases as Distance Increases

Sample 2 Line

Y

XX1 X2

Y_ Sample 1 Line

Greater dispersion than X 1

X

Factors Affecting Interval Width

Why Distance from Mean?

Page 292: Report Engineering Probability and Statistics Ahmedawad

You’re a marketing analyst for Hasbro Toys. You find b 0 = -.1, b 1 = .7 & s = .60553 . Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Estimate the mean sales when advertising is $4 at the .05 level.

Confidence Interval Estimate Example

Confidence Interval Estimate Solution

Page 293: Report Engineering Probability and Statistics Ahmedawad

7553.3)Y(E6445.1

3316.01824.37.2)Y(E3316.01824.37.2

3316.010

3451

60553.S

7.247.01.0Y

StY)Y(EStY

2

Y

Y2/,2nY2/,2n

X to be predicted

n

1i

2i

2P

YY

YY2/,2nPYY2/,2n

XX

XXn1

1SS

where

StYYStY

Note!

Prediction Interval of Individual Response

Why the Extra ‘S ’ ?

Page 294: Report Engineering Probability and Statistics Ahmedawad

Expected(Mean) Y

Y

Y i= 0

+ 1X i

^

Y we're trying to predict

Prediction, Y

E(Y) = 0 + 1X

^

XXP

^

^

Dep Var Pred Std Err Low95% Upp95% Low95% Upp95% Obs SALES Value Predict Mean Mean Predict Predict 1 1.000 0.600 0.469 -0.892 2.092 -1.837 3.037 2 1.000 1.300 0.332 0.244 2.355 -0.897 3.497 3 2.000 2.000 0.271 1.138 2.861 -0.111 4.111 4 2.000 2.700 0.332 1.644 3.755 0.502 4.897 5 4.000 3.400 0.469 1.907 4.892 0.962 5.837

Predicted Y when X = 4

Confidence Interval

S Y ^

Prediction Interval

Interval Estimate Computer Output

Hyperbolic Interval Bands

Page 295: Report Engineering Probability and Statistics Ahmedawad

X

Y

X

Y i= 0

+ 1X i

^

XP

_

^^

12. Multiple Regression Models

RegressionModels

LinearNon-

Linear

2+ ExplanatoryVariables

Simple

Non-Linear

Multiple

Linear

1 ExplanatoryVariable

Page 296: Report Engineering Probability and Statistics Ahmedawad

Learning Objectives

1. Explain the Linear Multiple Regression Model

2. Test Overall Significance3. Describe Various Types of Models4. Evaluate Portions of a Regression Model5. Interpret Linear Multiple Regression

Computer Output6. Describe Stepwise Regression7. Explain Residual Analysis8. Describe Regression Pitfalls

Most practical applications of regression analysis utilize models that are more complex than the simple straight-line model. For example, a realistic probabilistic model for reaction time would include more than just the amount of a particular drug in the bloodstream. Factors such as age, a measure of visual perception, and sex of the subjects are a few of the many variables that might be related to reaction time.

Regression Modeling Steps

Page 297: Report Engineering Probability and Statistics Ahmedawad

Dependent (response) variable

Independent (explanatory) variables

Population slopes

Population Y-intercept

Random error

ikiki22i110i XXXY ⋯

1. Relationship between 1 dependent & 2 or more independent variables is a linear function

1. Hypothesize Deterministic Component2. Estimate Unknown Model Parameters3. Specify Probability Distribution of Random

Error TermEstimate Standard Deviation of Error

4. Evaluate Model5. Use Model for Prediction & Estimation

Probabilistic models that include more than one independent variable are called multiple regression models. The general form of these models is

Page 298: Report Engineering Probability and Statistics Ahmedawad

The dependent variable Y is now written as a

function of k independent variables,x1 , x2 ,. .. , xk .

The random error term is added to make the model probabilistic rather than deterministic. The

value of the coefficient β i determines the

contribution of the independent variable x i , and β0 is the y-intercept. The coefficients β0 , β1 ,. . . , βk are usually unknown because they represent population parameters.

Actually, x1 , x2 ,. .. , xk can be functions of variables as long as the functions do not contain unknown parameters. For example, the reaction time, Y, of a subject to a visual stimulus could be a function of the independent variables

x1= Age of the subject

x2=( Age)2=x12

Page 299: Report Engineering Probability and Statistics Ahmedawad

x3=1 if male subject, 0 if female subject

The x2 term is called a higher-order term, since

it is the value of a quantitative variable (x1 ) squared (i.e., raised to the second power). The x3 term is an indicator variable representing a qualitative variable (gender).

The General Multiple Regression Model

Y=β0+β1 x1+β2 x2+⋯+βk xk+ε

where

Y is the dependent (or response) variable,

Page 300: Report Engineering Probability and Statistics Ahmedawad

X2

Y

X1E(Y) = 0 + 1X 1i + 2X 2i

0

Y i = 0 + 1X 1i + 2X 2i + i

ResponsePlane

(X 1i,X 2i)

(Observed Y )

i

Bivariate model

x1 , x2 ,. .. , xk are the independent (or predictor) variables,

E(Y )=β0+ β1 x1+β2 x2+⋯+βk xk is the deterministic portion of the model,

β i determines the contibution of the

independent variable x i .

Population Multiple Regression Model

Analyzing a Multiple Regression Model

Page 301: Report Engineering Probability and Statistics Ahmedawad

6. Hypothesize the deteministic component of the model. This component relates the mean, E(Y), to the independent variables x1 , x2 ,. .. , xk . This involves the choice of the independent variables to be included in the model.

7. Use the sample data to estimate the

unknown model parameters β0 , β1 ,. . . , βk in the model

8. Specify the probability distribution of the random error term, ε , and estimate the standard deviation of this distribution, σ .

9. Check that the assumption on ε are satisfied, and make model modification if necessary.

10. Statistically evaluate the usefulness of the model

11. When satisfied that the model is useful, use it for prediction, estimation, and other purposes.

Multiple linear regression

Page 302: Report Engineering Probability and Statistics Ahmedawad

Two or more independent variables are used to estimate 1 dependent variable.

Notes:1) i~independent N(0,2)2) 0, 1, …,p-1 are parameters with

corresponding estimates of 3) Xi1,…,Xi,p-1 are known constants 4) The second subscript on Xij denotes the jth

independent variable. 5) i=1,…n

Parameter Estimation Example

Page 303: Report Engineering Probability and Statistics Ahmedawad

1. Slope ( k ) Estimated Y Changes by k for Each 1 Unit Increase in X k Holding All Other Variables Constant Example: If 1 = 2, then Sales ( Y ) Is Expected to Increase by 2 for Each 1 Unit Increase in Advertising ( X 1 ) Given the Number of Sales Rep’s ( X 2 )

2. Y-Intercept ( 0 ) Average Value of Y When X k = 0

^

^

^

You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) & newspaper circulation (000) on the number of ad responses (00).

You’ve collected the following data: Resp Size Circ 1 1 2 4 8 8 1 3 1 3 5 7 2 6 4 4 10 6

^

Page 304: Report Engineering Probability and Statistics Ahmedawad

Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214

ADSIZE 1 0.2049 0.0588 3.656 0.0399

CIRC 1 0.2805 0.0686 4.089 0.0264

P

2

0

1 ^

^

^

^

1. Slope ( 1 ) # Responses to Ad Is Expected to Increase by .2049 (20.49) for Each 1 Sq. In. Increase in Ad Size Holding Circulation Constant

2. Slope ( 2 ) # Responses to Ad Is Expected to Increase by .2805 (28.05) for Each 1 Unit (1,000) Increase in Circulation Holding Ad Size Constant

^

Interpretation of Coefficients Solution

^

Page 305: Report Engineering Probability and Statistics Ahmedawad

Assumptions for Random Error ε

1. For any given of values of x1 , x2 ,. .. , xk , the random error ε has a normal probability distribution with mean equal to 0 and variance

equal to σ2

.2. The random errors are independent.

Estimator of σ2

for a Multiple Regression Model with k Independent Variables

s2= SSEn−(k+1 )

=∑ ( y i− y i )

2

n−(k+1 ) ,

where k= Number of estimated β parameters

Test of an Individual parameter Coefficient in the Multiple Regression Model

One-Tailed Test

H0 : β i=0

Ha : βi<0 [or Ha : βi>0 ]

Page 306: Report Engineering Probability and Statistics Ahmedawad

Two-Tailed Test

H0 : β i=0

Ha : βi≠0

Test statistic

t=β i

sβ i

Reject H0 : β i=0 if p-value is less than α , (for example, α =0.05 )

A 100(1-α )% Confidence Interval for a β parameter

β i±tα /2⋅sβ i=( βi−t α /2¿ s βi

, β i+t α /2¿ s βi)

where tα /2 is based on n-(k+1) degree of freedom.

n= Number of observations

k+1=Number of β parameters in the model.

Page 307: Report Engineering Probability and Statistics Ahmedawad

1. Shows If There Is a Linear Relationship Between All X Variables Together & Y 2. Uses F Test Statistic 3. Hypotheses H 0 : 1 = 2 = ... = k = 0 No Linear Relationship H a : At Least One Coefficient Is Not 0 At Least One X Variable Affects Y

Testing Overall Significance

Testing Global Usefulness of the Model of the Model: The Analysis of Variance F -Test

H0 : β1=β2=⋯=βk=0

(All model terms are unimportant for predicting y)

Ha: At least one β i≠0

(At least one model term is useful for predicting y)

Test statistic

F= MSRMSE

Reject H0 if p-value is less than α .

Page 308: Report Engineering Probability and Statistics Ahmedawad

Analysis of Variance Sum of Mean

Source DF Squares Square F Value Prob>F

Model 2 9.2497 4.6249 55.440 0.0043

Error 3 0.2503 0.0834

C Total 5 9.5000k n - k -1

n - 1 P-Value

MS(Model) MS(Error)

Testing Overall Significance Computer Output

Types of Regression Models

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Page 309: Report Engineering Probability and Statistics Ahmedawad

1 < 01 > 0Y

X1

Y

X1

First-Order Model With 1 Independent Variable

1. Relationship Between 1 Dependent & 1 Independent Variable is Linear

2. Used When Expected Rate of Change in Y Per Unit Change in X is Stable

3. Used With Curvilinear Relationships If Relevant Range Is Linear

First-Order Model Relationships

E(Y )=β0+ β1 X1i

Page 310: Report Engineering Probability and Statistics Ahmedawad

2i12i110 XX)Y(E

Linear effect

Curvilinear effect

Y

X1

Y

X1

Y

X1

Y

X1

2 > 02 > 0

2 < 02 < 0

Second-Order Model With 1 Independent Variable

1. Relationship Between 1 Dependent & 1 Independent Variables Is a Quadratic Function

2. Useful 1St Model If Non-Linear Relationship Suspected

3. Model

Page 311: Report Engineering Probability and Statistics Ahmedawad

3i13

2i12i110 XXX)Y(E

Linear effect Curvilinear effects

Y

X1

Y

X1

3 < 03 > 0

Third-Order Model With 1 Independent Variable

1. Relationship Between 1 Dependent & 1 Independent Variable Has a ‘Wave’

2. Used If 1 Reversal in Curvature

3. Model

Third-Order Model Relationships

Page 312: Report Engineering Probability and Statistics Ahmedawad

Effect (slope) of X 1 on E ( Y ) does not depend on X 2 value

E(Y)

X 1

4

8

12

0 0 1 0.5 1.5

E ( Y ) = 1 + 2 X 1 + 3(2) = 7 + 2 X 1

E ( Y ) = 1 + 2 X 1 + 3 X 2

E ( Y ) = 1 + 2 X 1 + 3(1) = 4 + 2 X 1 E ( Y ) = 1 + 2 X 1 + 3(0) = 1 + 2 X 1

E ( Y ) = 1 + 2 X 1 + 3(3) = 10 + 2 X 1

First-Order Model With 2 Independent Variables

1. Relationship Between 1 Dependent & 2 Independent Variables Is a Linear Function

2. Assumes No Interaction Between X1 & X2

Effect of X1 on E(Y) Is the Same Regardless of X2 Values

3. Model

No Interaction

E(Y )=β0+ β1 X1i+β2 X 2i

Page 313: Report Engineering Probability and Statistics Ahmedawad

First-Order Model Relationships

Interaction Model With 2 Independent Variables

1. Hypothesizes Interaction Between Pairs of X Variables

Response to One X Variable Varies at Different Levels of Another X Variable

2. Contains Two-Way Cross Product Terms

3. Can Be Combined With Other Models

Example: Dummy-Variable Model

E(Y )=β0+ β1 X1i+β2 X 2i+β3 X1 i X2 i

Page 314: Report Engineering Probability and Statistics Ahmedawad

Effect (slope) of X 1 on E ( Y ) does depend on X 2 value

E(Y)

X 1

4

8

12

0 0 1 0.5 1.5

E ( Y ) = 1 + 2 X 1 + 3 X 2 + 4 X 1 X 2 E ( Y ) = 1 + 2 X 1 + 3(1) + 4 X 1 (1) = 4 + 6 X 1

E ( Y ) = 1 + 2 X 1 + 3(0) + 4 X 1 (0) = 1 + 2 X 1

Effect of Interaction

1. Given:

2. Without Interaction Term, Effect of X1 on Y Is Measured by 1

3. With Interaction Term, Effect of X1 on Y Is Measured by 1 + 3X2

Effect Increases As X2i Increases

Interaction Model Relationships

E(Y )=β0+ β1 X1i+β2 X 2i+β3 X1 i X2 i

Page 315: Report Engineering Probability and Statistics Ahmedawad

Y

X2X1

Y

X2X1

Y

X2X1

4 + 5 > 0

4 + 5 < 0

32 > 4 4 5

2i25

2i14

i2i13

i22i110

XX

XX

XX)Y(E

Second-Order Model With 2 Independent Variables

1. Relationship Between 1 Dependent & 2 or More Independent Variables Is a Quadratic Function

2. Useful 1St Model If Non-Linear Relationship Suspected

3. Model

Second-Order Model Relationships

E (Y )=β0+ β1 X1i+β2 X 2i+β3 X 1i X2 i

+β4 X1i2 + β5 X2i

2

Page 316: Report Engineering Probability and Statistics Ahmedawad

Types of Regression Models

Dummy-Variable Model

1. Involves Categorical X Variable With 2 Levels

e.g., Male-Female; College-No College

2. Variable Levels Coded 0 & 1

3. Number of Dummy Variables Is 1 Less Than Number of Levels of Variable

4. May Be Combined With Quantitative Variable (1st Order or 2nd Order Model)

ExplanatoryVariable

1stOrderModel

3rdOrderModel

2 or MoreQuantitative

Variables

2ndOrderModel

1stOrderModel

2ndOrderModel

Inter-ActionModel

1Qualitative

Variable

DummyVariable

Model

1Quantitative

Variable

Page 317: Report Engineering Probability and Statistics Ahmedawad

Same slopes

Given: Starting salary of college grad'sGPA

Males (

):

Y X XYX

Y X X

i i i

i i i

X

0 1 1 2 2

1

0 1 1 2 0 1 1(0)

2 0

iif Femalef Male

X 201

Y X Xi i i 0 1 1 2 0 1 1(1)

Females ( ):X 2 1 )2

Y

X 1

0 0

Same Slopes 1

0

0 + 2

^

^

^ ^

Females

Males

Interpreting Dummy-Variable Model Equation

Dummy-Variable Model Relationships

Page 318: Report Engineering Probability and Statistics Ahmedawad

Same slopes

Computer Output:

Males (

):

Y X X

Y X X

i i i

i i i

X

3 5 7

3 5 7(0) 3 5

1 2

1 1

2 0

f Maleif Femalei

X 012

Females Y X Xi i i

3 5 7(1) (3 + 7)

51 1

):(X 2 1

Dummy-Variable Model Example

Residual Analysis

1. Graphical Analysis of Residuals

Plot Estimated Errors vs. Xi Values Difference Between Actual Yi &

Predicted Yi

Estimated Errors Are Called Residuals Plot Histogram or Stem-&-Leaf of Residuals

2. Purposes

Examine Functional Form (Linear vs. Non-Linear Model)

Evaluate Violations of Assumptions

Page 319: Report Engineering Probability and Statistics Ahmedawad

Linear Regression Assumptions

1. Mean of Probability Distribution of Error Is 0

2. Probability Distribution of Error Has Constant Variance

3. Probability Distribution of Error is Normal

4. Errors Are Independent

Multicollinearity

1. High Correlation Between X Variables

2. Coefficients Measure Combined Effect

3. Leads to Unstable Coefficients Depending on X Variables in Model

4. Always Exists -- Matter of Degree

5. Example: Using Both Age & Height as Explanatory Variables in Same Model

Page 320: Report Engineering Probability and Statistics Ahmedawad

Correlation Analysis Pearson Corr Coeff /Prob>|R| under HO: Rho

=0/ N=6 RESPONSE ADSIZE CIRC

RESPONSE 1.00000 0.90932 0.93117 0.0 0.0120 0.0069

ADSIZE 0.90932 1.00000 0.74118 0.0120 0.0 0.0918

CIRC 0.93117 0.74118 1.00000 0.0069 0.0918 0.0

rY1 rY2All 1’s

r12

Detecting Multicollinearity

1. Examine Correlation Matrix

Correlations Between Pairs of X Variables Are More than With Y Variable

2. Examine Variance Inflation Factor (VIF)

If VIFj > 5, Multicollinearity Exists

3. Few Remedies

Obtain New Sample Data Eliminate One Correlated X Variable

Correlation Matrix Computer Output

Page 321: Report Engineering Probability and Statistics Ahmedawad

Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP 1 0.0640 0.2599 0.246 0.8214 ADSIZE 1 0.2049 0.0588 3.656 0.0399 CIRC 1 0.2805 0.0686 4.089 0.0264 Variance Variable DF Inflation INTERCEP 1 0.0000 ADSIZE 1 2.2190 CIRC 1 2.2190

VIF1 5

Y

Interpolation

X

Extrapolation

Extrapolation

Relevant Range

Variance Inflation Factors Computer Output

Extrapolation

Page 322: Report Engineering Probability and Statistics Ahmedawad

13. Categorical Data Analysis

Learning Objectives

1. Explain 2 Test for Proportions

2. Explain 2 Test of Independence

3. Solve Hypothesis Testing ProblemsTwo or More Population Proportions Independence

Data Types

Qualitative Data

Data

Quantitative Qualitative

Discrete Continuous

Page 323: Report Engineering Probability and Statistics Ahmedawad

1. Qualitative Random Variables Yield Responses That Classify

Example: Gender (Male, Female)2. Measurement Reflects # in Category3. Nominal or Ordinal Scale4. Examples

Do You Own Savings Bonds? Do You Live On-Campus or Off-

Campus?

Hypothesis Tests Qualitative Data

Chi-Square (2) Test for k Proportions

QualitativeData

Z Test Z Test 2 Test

Proportion Independence1 pop.

2 Test

2 or morepop.

2 pop.

Page 324: Report Engineering Probability and Statistics Ahmedawad

1. Tests Equality (=) of Proportions OnlyExample: p1 = .2, p2=.3, p3 = .5

2. One Variable With Several Levels

3. AssumptionsMultinomial ExperimentLarge Sample Size

All Expected Counts 5

4. Uses One-Way Contingency Table

Multinomial Experiment

1. n Identical Trial2. k Outcomes to Each Trial3. Constant Outcome Probability, pk

4. Independent Trials5. Random Variable is Count, nk

6. Example: Ask 100 People (n) Which of 3 Candidates (k) They Will Vote For

One-Way Contingency Table

Page 325: Report Engineering Probability and Statistics Ahmedawad

Candidate

Tom Bill Mary Total

35 20 45 100

Outcomes ( k = 3)

Number of responses

1. Shows # Observations in k Independent Groups (Outcomes or Variable Levels)

2 Test for k Proportions

Page 326: Report Engineering Probability and Statistics Ahmedawad

1. Hypotheses H 0 : p 1 = p 1,0 , p 2 = p 2,0 , ..., p k = p k ,0 H a : Not all p i are equal 2. Test Statistic

3. Degrees of Freedom: k - 1

Observed count

Expected count

Number of outcomes

Hypothesized probability

cells alli

2ii2

nEnEn

2 Test Basic Idea

1. Compares Observed Count to Expected Count If Null Hypothesis Is True

2. Closer Observed Count to Expected Count, the More Likely the H0 Is True

Page 327: Report Engineering Probability and Statistics Ahmedawad

Upper Tail Area DF .995 … .95 … .05 1 ... … 0.004 … 3.841 2 0.010 … 0.103 … 5.991

20 5.991

Reject

What is the critical 2 value if k = 3, & =.05?

= .05

2 Table (Portion)

df = k - 1 = 2

If n i = E ( n i ) , 2 = 0. Do not reject H 0

Measured by Squared Difference Relative to Expected Count

Reject Large Values

Finding Critical Value Example

2 Test for k Proportions Example

Page 328: Report Engineering Probability and Statistics Ahmedawad

As personnel director, you want to test the perception of fairness of three methods of performance evaluation. Of 180 employees, 63 rated Method 1 as fair. 45 rated Method 2 as fair. 72 rated Method 3 as fair. At the .05 level, is there a difference in perceptions?

H 0 : p 1 = p 2 = p 3 = 1/3 H a : At least 1 is different = .05 n 1 = 63 n 2 = 45 n 3 = 72 Critical Value(s):

Test Statistic:

Decision:

Conclusion:

Reject at = .05

There is evidence of a difference in proportions 20 5.991

Reject

= .05

2 = 6.3

2 Test for k Proportions Solution

E (n i )=npi ,0

E (n1)=E (n2 )=E (n3)=180 (1/3 )=60

χ2= ∑all cells

[ ni−E (ni ) ]2E (ni )

¿[n1−60 ]260

+[n2−60 ]260

+[ n3−60 ]260

¿[ 63−60 ]2

60+

[ 45−60 ]2

60+

[72−60 ]2

60=6 .3

Page 329: Report Engineering Probability and Statistics Ahmedawad

2 Test of Independence

1. Shows If a Relationship Exists Between 2 Qualitative Variables

One Sample Is DrawnDoes Not Show Causality

2. Assumptions

Multinomial ExperimentAll Expected Counts 5

3. Uses Two-Way Contingency Table

2 Test of Independence Contingency Table

E (n i )=npi ,0

E (n1)=E (n2 )=E (n3)=180 (1/3 )=60

χ2= ∑all cells

[ ni−E (ni ) ]2E (ni )

¿[n1−60 ]260

+[n2−60 ]260

+[ n3−60 ]260

¿[ 63−60 ]2

60+

[ 45−60 ]2

60+

[72−60 ]2

60=6 .3

Page 330: Report Engineering Probability and Statistics Ahmedawad

House Location House Style Urban Rural Total Split-Level 63 49 112 Ranch 15 33 48 Total 78 82 160

1. Shows # Observations From 1 Sample Jointly in 2 Qualitative Variables

Levels of variable 2

Levels of variable 1

2 Test of Independence

1. Hypotheses

H0: Variables Are Independent Ha: Variables Are Related (Dependent)

2. Test Statistic

Degrees of Freedom: (r - 1)(c - 1)

χ2= ∑all cells

[nij−E (nij ) ]2E (n ij )

Page 331: Report Engineering Probability and Statistics Ahmedawad

Computing expected cell counts

The null hypothesis is that there is no relationship between row variable and column variable in the population. The alternative hypothesis is that these two variables are related.

Here is the formula for the expected cell counts under the hypothesis of “no relationship”.

Expected Cell Counts

Expected count = row total×column total

n

The null hypothesis is tested by the chi-square statistic, which compares the observed counts with the expected counts:

X 2=∑ ( observed−expected )2

exp ected

Page 332: Report Engineering Probability and Statistics Ahmedawad

Under the null hypothesis, X2 has

approximately the χ2

distribution with (r-1)(c-1) degrees of freedom. The P-value for the test is

3. P( χ2≥X 2)

where χ2

is a random variable having the χ2

(df) distribution with df=(r-1)(c-1).

Figure. Chi-Square Test for Two-Way Tables

Page 333: Report Engineering Probability and Statistics Ahmedawad

Example In a study of heart disease in male federal employees, researchers classified 356 volunteer subjects according to their socioeconomic status (SES) and their smoking habits. There were three categories of SES: high, middle, and low. Individuals were asked whether they were current smokers, former smokers, or had never smoked, producing three categories for smoking habits as well. Here is the two-way table that summarizes the data:

This is a 3¿3 table, to which we have added the marginal totals obtained by summing across rows and columns. For example, the first-row total is 51+22+43=116. The grand total, the number of subjects in the study, can be computed by summing the row totals, 116+141+99=356, or the column totals, 211+52+93=356.

observed counts for smoking and SES

  SES  

Smoking High Middle Low Total

Current 51 22 43 116

Former 92 21 28 141

Never 68 9 22 99

Total 211 52 93 356

Page 334: Report Engineering Probability and Statistics Ahmedawad

Example What is the expected count in the upper-left cell in the table of Example, corresponding to high-SES current smokers, under the null hypothesis that smoking and SES are independent?

The row total, the count of current smokers, is 116. The column total, the count of high-SES subjects, is 211. The total sample size is n=356. The expected number of high-SES current smokers is therefore

(116 )(211 )356

=68 .75

We summarize these calculations in a table of expected counts:

Expected counts for smoking and SES

  SES  

Smoking High Middle Low All

Current 68.75 16.9430.3

0115.9

9

Former 83.57 20.6036.8

3141.0

0

Never 58.68 14.4625.8

6 99.00

Total 211.0 52.092.9

9355.9

9

Page 335: Report Engineering Probability and Statistics Ahmedawad

Computing the chi-square statistic

The expected counts are all large, so we preceed with the chi-square test. We compare the table of observed counts with the table of

expected counts using the X2 statistic. We must

calculate the term for each, then sum over all nine cells. For the high-SES current smokers, the observed count is 51 and the expected count

is 68.75. The contribution to the X2 statistic for

this cell is

(51−68 . 75)2

68 . 75=4 . 583

Similarly, the calculation for the middle-SES current smokers is

(22−16 . 94 )2

16 . 94=1 . 511

Page 336: Report Engineering Probability and Statistics Ahmedawad

The X2 statistic is the sum of nine such terms:

X 2=∑ ( observed−expected )2

exp ected

=(51−68 .75 )2

68 . 75+(22−16 . 94 )2

16 .94+(43−30 .30 )2

30 .30

+(92−83 .57 )2

83 . 57+(21−20 .60 )2

20 .60+(28−36 . 83)2

36 . 83

+(68−58 .68 )2

58 . 68+(9−14 . 46 )2

14 .46+(22−25 .86 )2

25.86

=4 . 583+1. 511+5 . 323+0 . 850+0 .008+2 .117

+1 . 480+2 .062+0. 576=18 .51

Because there are r=3 smoking categories and c=3 SES groups, the degrees of freedom for this statistic are

(r-1)(c-1)=(3-1)(3-1)=4

Under the null hypothesis that smoking and

SES are independent, the test statistic X2 has

χ2 (4 ) distribution. To obtain the P-value, refer to the row in Table corresponding to 4 df.

Page 337: Report Engineering Probability and Statistics Ahmedawad

The calculated value X2=18.51 lies between

upper critical points corresponding to probabilities 0.001 and 0.0005. The P-value is therefore between 0.001 and 0.0005. Because the expected cell counts are all large, the P-value from Table F will be quite accurate. There

is strong evidence (X2=18.51, df=4, P<0.001) of

an association between smoking and SES in the population of federal employees.

Page 338: Report Engineering Probability and Statistics Ahmedawad

House Location Urban Rural

House Style Obs. Exp. Obs. Exp. Total

Split-Level 63 54.6 49 57.4 112

Ranch 15 23.4 33 24.6 48

Total 78 78 82 82 160

112·82 160

48·78 160

48·82 160

112·78 160

size Sample

alColumn tot totalRow =count Expected

Expected Count Calculation

Page 339: Report Engineering Probability and Statistics Ahmedawad

2 Test of Independence Example

2 Test of Independence Solution

You’re a marketing research analyst. You ask a random sample of 286 consumers if they purchase Diet Pepsi or Diet Coke. At the .05 level, is there evidence of a relationship?

Page 340: Report Engineering Probability and Statistics Ahmedawad

H 0 : No Relationship H a : Relationship = .05 df = (2 - 1)(2 - 1) = 1 Critical Value(s):

Test Statistic:

Decision:

Conclusion:

Reject at = .05

There is evidence of a relationship20 3.841

Reject

= .05

2 = 54.29

2 Test of Independence Thinking Challenge

OK. There is a statistically significant relationship between purchasing Diet Coke & Diet Pepsi. So what do you think the relationship is? Aren’t they competitors?

You Re-Analyze the Data

Page 341: Report Engineering Probability and Statistics Ahmedawad

Diet Pepsi Diet Coke No Yes Total

No 80 2 82 Yes 8 120 128 Total 88 122 210

Diet Pepsi Diet Coke No Yes Total

No 4 30 34 Yes 40 2 42 Total 44 32 76

Low Income

High Income

True Relationships*

Page 342: Report Engineering Probability and Statistics Ahmedawad

Apparent relation

Underlying causal relation

Control or intervening variable

(true cause)

Diet Coke

Diet Pepsi

Conclusion

1. Explained 2 Test for Proportions

2. Explained 2 Test of Independence

3. Solved Hypothesis Testing Problems

Two or More Population Proportions Independence

Page 343: Report Engineering Probability and Statistics Ahmedawad

Using R-Web Software

Consider University of Illinois business school data:

Major Female Male

Accounting 68 56

Administration 91 40

Economics 5 6

Finance 61 59

We wish to determine if the proportion female differs between the four majors.

This is a test of the null hypothesis Ho: p_ac=p_ad=p_e=p_f

We use the Pearson 2 statistic, as in previous problems.

If the test gives a small p-value, how do we determine if the groups differ?

Page 344: Report Engineering Probability and Statistics Ahmedawad

2 Contributions

Answer: We look at a table of contributions to the 2 statistic.

Cells with large values are contributing greatly to the overall discrepancy between the observed and expected counts.

Large values tell us which cells to examine more closely.

Residuals

As we have seen previously in regression problems, we can measure the deviation from what was observed from what is expected under the Ho by using a residual.

Re siduali=Oi−Ei

√E i

Page 345: Report Engineering Probability and Statistics Ahmedawad

Residual Usage

Think of these residuals as being on a standard normal scale.

This means a residual of -3.26 means the observed count was far less (neg) than what would be expected under the Ho.

A residual of 2.58 means the cell’s observed value was far above what would be expected under Ho.

A residual like .24 or -.39 means the cell is not far from what would be expected under Ho.

The sign + or – of the residual tells if the observed cell count was above or below what is expected under Ho.

Abnormally large (in absolute value) residuals will also have large contributions to 2.

Input the Table

Page 346: Report Engineering Probability and Statistics Ahmedawad

The R-Web command for inputting the Illinois student table data is:

x <- matrix(c(68, 56, 91, 40, 5 , 6, 61, 59), nc = 2, byrow=T)

This means input the cell counts by rows, where the table has 2 columns, (nc=2).

Obtaining Test Statistic & P-Val

chisq.test(x)This command produces the Pearson 2 test

statistic, p-value, and degrees of freedom.

Contributions to 2

To find the cells that contribute most to the rejection of the Ho, type :

chisq.test(x)$residuals^2

Residuals

Type: chisq.test(x)$residuals

Observed & Expected Tables

Page 347: Report Engineering Probability and Statistics Ahmedawad

Type: chisq.test(x)$observed

chisq.test(x)$expected These will help you understand the table

behavior.

Example

Submit these commands:

x <- matrix(c(68, 56, 91, 40, 5 , 6, 61, 59), nc = 2, byrow=T)

chisq.test(x)

chisq.test(x)$residuals^2

chisq.test(x)$residuals

chisq.test(x)$observed

chisq.test(x)$expected

Pearson's Chi-squared test data: x X-squared = 10.8267, df = 3, p-value = 0.0127 Rweb:> chisq.test(x)$residuals^2 [,1] [,2] [1,] 0.2534128 0.3541483 [2,] 2.8067873 3.9225288 [3,] 0.3109070 0.4344974 [4,] 1.1447050 1.5997431 Rweb:> chisq.test(x)$residuals [,1] [,2] [1,] -0.5034012 0.5951036 [2,] 1.6753469 -1.9805375

Page 348: Report Engineering Probability and Statistics Ahmedawad

[3,] -0.5575903 0.6591641 [4,] -1.0699089 1.2648095 Rweb:> chisq.test(x)$observed [,1] [,2] [1,] 68 56 [2,] 91 40 [3,] 5 6 [4,] 61 59 Rweb:> chisq.test(x)$expected [,1] [,2] [1,] 72.279793 51.720207 [2,] 76.360104 54.639896 [3,] 6.411917 4.588083 [4,] 69.948187 50.051813

Example Conclusion

First, note the p-value for the test is small and this means evidence the proportions female differ between the four majors.

How do they differ?From the contributions to 2 and the residuals we

see the second row (Administration) has the biggest discrepancy between observed and expected counts.

From either the residuals or the observed vs expected tables we see that females are much more likely to major in administration than would be expected and males less likely than expected under the Ho.

The administration proportion is much higher than the others for females, and this is the primary major that produces the evidence that the majors differ.

Page 349: Report Engineering Probability and Statistics Ahmedawad

14. Nonparametric Statistics

Learning Objectives

1. Distinguish Parametric & Nonparametric Test Procedures

2. Explain a Variety of Nonparametric Test Procedures

3. Solve Hypothesis Testing Problems Using Nonparametric Tests

4. Compute Spearman’s Rank Correlation

Hypothesis Testing Procedures

HypothesisTesting

Procedures

NonparametricParametric

Z Test

Kruskal-WallisH-Test

WilcoxonRank Sum

Test

t Test One-WayANOVA

Page 350: Report Engineering Probability and Statistics Ahmedawad

Parametric Test Procedures

1. Involve Population ParametersExample: Population Mean

2. Require Interval Scale or Ratio ScaleWhole Numbers or FractionsExample: Height in Inches (72, 60.5,

54.7)3. Have Stringent Assumptions

Example: Normal Distribution4. Examples: Z Test, t Test, 2 Test

Nonparametric Test Procedures

1. Do Not Involve Population ParametersExample: Probability Distributions,

Independence2. Data Measured on Any Scale

Ratio or Interval

Page 351: Report Engineering Probability and Statistics Ahmedawad

Ordinal Example: Good-Better-Best

Nominal Example: Male-Female

3. Example: Wilcoxon Rank Sum Test

Advantages of Nonparametric Tests

1. Used With All Scales2. Easier to Compute

Developed Originally Before Wide Computer Use

3. Make Fewer Assumptions4. Need Not Involve Population

Parameters5. Results May Be as Exact as Parametric

Procedures

Page 352: Report Engineering Probability and Statistics Ahmedawad