Download pdf - Math 321 - Statistics

Math 321 - Dr. Minnotte 1


Introduction: What is Statistics?

Definition: Statistics is the science of measurement and decision-making under conditions of uncertainty, randomness, and variability.

More briefly: Statistics is the field of dealing with data.

In statistics, we make observations, to collect information, to help make decisions.

If that sounds familiar, it should. We do that sort of thing every day, in every field of study, and in our everyday life.

In statistics, we simply formalize this process mathematically. This allows us to recognize smaller differences than might otherwise be found, and to make decisions under conditions of greater uncertainty.


The term statistic is also used to describe any bit of numerical information, like the 6.3% unemployment rate in April, 2014 or the 15,143 students enrolled at UND in Fall, 2013.

These numerical bits of data are thrown at us every time we read the newspaper, or watch TV news, or read a journal in our field.

Just as words should be read with understanding, so should statistics. If we uncritically accept the numbers others give us, we open ourselves to believing misinformation.



Statistics are an important tool in almost every field. In this class, well look at examples like:

How can doctors tell if a new vaccine really works?

How can irrigation engineers use past river flow rates to predict future flows?

How can polltakers use responses from a few thousand voters to predict the results of an election in which more than a hundred million people vote?

What are some other examples of statistics in practice?


The Challenger Disaster: A Statistical Cautionary Tale

In 1986, a lack of statistical thinking contributed to a tragedy: the explosion of the space shuttle Challenger.

The destruction of the Challenger killed seven astronauts, including Christa McAuliffe, a 37-year-old teacher selected to be the first teacher in space, and set the U.S. manned space program back several years.




The solid rocket motors used to launch the space shuttles are shipped to the Kennedy Space Center in four pieces. Large rubber O-rings are used to seal the three joints between the pieces.

The Challenger explosion occurred when one of the O-rings failed to seal quickly enough to prevent hot gasses from escaping from the rocket and igniting the large external fuel tank.

Implicated in the failure was the unusually cold (for Florida) launch temperature of 29F.


The night before the launch, forecasters predicted a temperature of 31F for the launch time.

A three-hour teleconference took place between people at:

Morton Thiokol (manufacturer of the rocket motors)

Marshall Space Flight Center (NASA center for motor design control), and

Kennedy Space Center.


There was concern that the cold temperatures could lead to problems with the O-rings.

In 7 out of 23 previous launches, some O-ring damage had occurred.

Some participants recommended delaying the launch until the temperature rose above 53F, the lowest previous launch temperature, in which the greatest number of damaged O-rings occurred.



In the end, the recommendation was made to launch on schedule, in part because of the following plot.

The plot shows temperature vs. number of damaged O-rings for the 7 affected launches.

The relationship seems limited, at most.

What error was made preparing this plot?





By only including the launches in which incidents occurred, the investigators left out some important information!

When the data from all 23 launches is plotted, a temperature dependence becomes obvious.

All of the 4 launches below 66F had damage. Only 3 out 16 flights above that temperature

suffered damage.

Note where 31F or 29F would appear on that plot.


More sophisticated analyses are possible, but unnecessary.

Had the concerned engineers presented the complete data in such a format, they might well have convinced the decision-makers to delay the launch and prevented the tragedy.

Theres more to this story, so well return to it later in the semester.



Chapter 1: Univariate Data -Populations and Samples Definition: A population consists of all

potential observations from a distribution of interest.

In an enumerative study, the population will be tangible, real and finite, and might be represented by a sampling frame listing the members of the population.

o Examples include populations of people, or corporations, or items in a shipment.



In an analytic study, we study an ongoing process, and the conceptual population is infinite and simply a useful theoretical construct. No sampling frame is possible.

o Examples include populations of rainfall over time, or objects coming off an ongoing assembly line, or repeated measurements of the same underlying weight.

As an investigator, you have a great deal of flexibility in defining the population of interest.


Example: We are interested in the ages of UND students. What are some possible relevant populations?

Example: A quality engineer wishes to study the volume of milk in containers coming off a production line. What are possible populations?

Example: We wish to examine the incidence of obesity in preteen children. What is an appropriate population?


Once we have defined our population, we take a sample from that population.

Measurements from each member of the sample will be the observations which make up the dataset we will analyze.

Example: Student ages.


Experiments Suppose that a chemical engineer wants

to determine how the concentration of a catalyst affects the yield of a process.

The engineer can run the process several times, changing the concentration each time and compare the yields that result.

This sort of experiment is called a controlled experiment because the values of the concentration variable are under the control of the experimenter.


Observational Studies There are many situations in which scientists

cannot control the variables of interest. Many studies have been conducted to

determine the effect of cigarette smoking on the risk of lung cancer. In these studies, rates of cancer among smokers are compared with rates among nonsmokers.

The experimenter cannot control who smokes and who doesnt.

This kind of study is called an observational study.



When we study a sample, we must make sure it is representative of the population.

One option is a census, or complete enumeration, of everyone in the population. What are some problems with this approach?



Usually, the best solution is to take a random sample, choosing your sample with planned probability methods.

The most basic such method is called a simple random sample (SRS).

In a SRS, we draw individuals out of the population with the equivalent of drawing names out of a (well-mixed) hat.

Each subset of the population of the appropriate size is equally likely to make up the sample.

This is theoretically convenient, but often hard to arrange in practice.


When viewed in order, or over time, the observations of a SRS should not show any noticeable pattern or trend.

A SRS is not guaranteed to reflect the population perfectly.

SRSs always differ in some ways from each other; occasionally a sample is substantially different from the population.

This phenomenon is known as sampling variation.



The items in a sample are independent if knowing the values of some of the items does not help to predict the values of the others.

Items in a simple random sample may be treated as independent in most cases encountered in practice. The exception occurs when the population is finite and the sample comprises a large fraction (more than 5%) of the population.



Samples of Convenience A nonrandom sample, or sample of

convenience, may be easier to collect, but may be nonrepresentative in some important ways.

Such a sample may bias your results, making them worthless (or at least a whole lot less trustworthy).


Example: We are interested in the size of hometowns for all U.S. college students, but only sample at UND.

Example: We want to survey UND students on math anxiety, and pick a class to interview:

Math 321? Upper-division English?



Example: Not everyone will consent to test a new AIDS vaccine. We could give those who consent the vaccine, and leave those who dont alone to be the control group.

What about a historical control (compare vaccinated group with past infection rates)?

Terminology and Notation From each individual person or object in

our sample, we are generally interested only in a small number of characteristics.

Each characteristic we record will be called a variable, and assigned a letter from the end of the alphabet.



Data that we collect may be of two main types:

1) Categorical classifying the subject into one of several distinct groups.

o X = Sexo T = Hair Coloro W = Zip Code

2) Numerical data recorded as a number, where operations like averages make sense.

o Y = Ageo U = Rainfallo Z = Volume of milk


We also classify datasets based on how many variables we measure on each individual.

If we only collect a single variable (e.g. age), we say the dataset is univariate.

If we collect two variables for each individual (e.g. age and sex), we say it is bivariate.

With still more variables, we say that it is trivariate, quadrivariate, and so on, or more commonly, that it is multivariate.



We often use subscripts on the variable name (letter) to indicate specific observations in a dataset, such as X1, X2, , Xn.

A subscript of i (occasionally j or k) indicates a specific, but arbitrary, observation.

We usually reserve the label n for the number of observations (the sample size).


There are two primary branches of statistics:

1) Descriptive statistics simply attempts to simplify and understand a dataset.

2) Inferential statistics attempts to say (infer) something about the broader population or distribution from which the data was drawn.

Descriptive statistics are simpler, so well start there.



Summary Statistics (1.2) Given data X1, X2, , Xn, we frequently use

sample statistics to summarize the dataset.

A statistic is anything which may be calculated from a dataset. A sample statistic simply makes clear that it derives from a sample.

Use of sample statistics can improve our understanding of the data, as well as make it easier to communicate with others about it.


The Sample Mean The most important feature of a dataset to

describe is generally its location, or the location of its center.

The most commonly used statistic for center is the familiar average, or sample mean.

Definition: The sample mean of data X1, X2, , Xn is


Example: Stocks:


To understand how the mean works, suppose we were to take a very thin yardstick or similarly marked board, and place a small (equal) weight at the mark for each observations value.

The mean may be thought of as the point where this would balance.


Outliers An outlier is an observation which is very

different from the rest of the sample. For univariate data, this means it is much larger or much smaller than the rest.

Outliers should be carefully examined. Often they are the result of measurement or recording errors.

If so, they should be fixed or deleted. Correct but unusual values, however, should be kept.



The sample mean is not robust (resistant to outliers). Changing even one observation can change the sample mean as much as we want.

Example: Mistype the final stock return as 374 (instead of 37.4). What is the sample mean now?



Measures of Variability After center, the second-most-used

feature to describe a sample is its variability, or spread.

The simplest measure of variability is the range, the difference between the maximum and minimum values.

R = max(X) min(X)

Unfortunately, the range both wastes most of the data, and is maximally non-robust, using only the two extreme data points, so it is rarely used.


A better solution looks at the deviations from the mean, This removes the effect of the mean (location), and looks only at the variability around the mean.

One option: Look at the average deviation from the mean.

Problem: Positive deviations cancel out negative ones, and the average deviation from the mean is always 0.




We could take absolute values of the deviations, but for a few theoretical reasons, its better to look at the squared deviations instead.

Definition: The sample variance, s2, measures the spread of a dataset.

Definition: The sample standard deviation, s, is the square root of the sample variance.


Use of the definition formula is tedious, as it requires finding and squaring each of the n deviations from the mean.

It is usually simpler to calculate s2 using the following computation formula.


Example: What are the variance and standard deviation of the stocks data?



The sample variance and standard deviation are measures of the spread of a dataset, and estimates of the variance and standard deviation of the underlying population or distribution.

Like the sample mean, they are not robust.

Example: Stocks, replace 37.4 with 374: s2 = ? s = ?


While very useful practically and theoretically, the variance and standard deviation are a little tricky intuitively.

One helpful rule of thumb: About 2/3 of data should fall in About 95% of data should fall in Almost all data should fall in

Example: Stock data:

If X1, , Xn is a sample, and Yi = a + b Xi,where a and b are constants, then

This is most commonly needed if we change units for our data.



Example: Let X1,,Xn be a sample of temperatures measured in degrees Celsius, with = 30. Let Y1,,Yn be the same temperatures in degrees Fahrenheit, Yi = 9/5 Xi + 32. What is ?

Example: Let the variance of the Celsius temperatures be = 25.

What is the standard deviation? What is the variance of the Fahrenheit

temperatures? The s.d.?



Definition: The ith order statistic, X(i), is the ith smallest value when the Xs are sorted. The minimum is X(1), the second smallest X(2), and so on up to the maximum, X(n).

Order Statistics and Robust Measures of Center and Spread


Example: Stock data (sorted):

X(1) = -7.2, X(4) = 1.3, X(20) = 37.4, and so on.

Because outliers will always be in the first or last few order statistics, values computed from middle order statistics will be very robust.



Definition: The sample median, , is the middle of the sorted data.

If n is odd, the sample median is the (n+1)/2thorder statistic.

If n is even, it is the average of the n/2th and (n+2)/2th order statistics.

Example: Stocks: = ?


The sample median has 50% of the data on either side of it.

The sample median is very robust; changing one or a few observations wont change it much, if at all.

Example: Stocks: Replace 37.4 with 374, and the sample median remains 17.6


Quartiles The quartiles of the data divide the sample

into quarters. The first quartile, Q1, splits the lowest quarter

of the sample from the rest. If (n+1)/4 is an integer, Q1 is the (n+1)/4 order

statistic. If (n+1)/4 is not an integer, Q1 is the average of

the two order statistics on either side.

The third quartile, Q3, splits the highest quarter from the rest.

Find it as Q1, but using 3(n+1)/4.



Example: Sorted stocks:

Q1 = ? Q3 = ?


Definition: The sample interquartile range is a robust measure of spread, found as the difference between the sample quartiles, IQR = Q3 Q1.

Example: Stocks: IQR = ?

Note: Changing 37.4 to 374 doesnt change Q1, Q3, or IQR.


Percentiles Definition: The pth sample percentile, has

(roughly) p% of the data below it, and (100-p)% above it.

Compute p(n + 1)/100. If this is an integer, use that order statistic. If not, average the two closest order statistics.

The median and quartiles are just special names for the 50th, 25th, and 75thpercentiles.


Example: Descriptive Statistics in Minitab

Descriptive Statistics: Stock Returns 1976-1995

Variable Mean StDev Variance Minimum Q1 Median Q3 MaximumStock Returns 19 15.37 13.66 186.49 -7.20 5.48 17.60 28.90 37.40

Variable IQRStock Returns 19 23.43



Basic Statistical Graphics (1.3) Some of the most powerful tools available

for understanding a dataset are graphics which we can use to look at our data.

Its very hard to get much useful out of large tables or long columns of numbers. But the human eye is very good at picking out patterns in pictures.


Bar Charts Given categorical data, the most useful

plot available is usually a simple bar chart.

A bar is drawn for each category, with the height proportional to the count (frequency) or percentage found in that category.

Other measurements for each category may also be compared.



Example: Television Picture Grades Perfect, Good, Satisfactory, Fail


Category CountPerfect 64Good 47Satisfactory 33Fail 6Total 150


Spaces between the bars show categories.

Bars should start at 0 and show full height (no truncation!). Otherwise, relative heights get distorted.




Unless there is a strong natural ordering (e.g. poor-fair-good-excellent; notalphabetical), bars should be sorted in ascending or descending order. This makes comparisons between close values much easier.





Many categories or long category names may be better served by horizontal bars.


3-D perspective looks fancy but hurts clarity usually a bad idea.


A stacked bar chart includes a second categorical variable, but focuses on the totals for the main category of the bars.


0100200300400500600700800900

1000

1st Class 2ndClass

3rdClass

Crew

Individuals on the Titanic

SurvivedDied

A clustered bar chart focuses on the counts of the specific combinations of categories, and is useful for comparing the distribution of one variable for different values of the other.


0100200300400500600700800

1stClass

2ndClass

3rdClass

Crew

DiedSurvived

Example Minitab Bar Charts






Pie Charts The other common chart for categorical

data.

A pie chart should only be used when the categories represent (all of the) parts of some whole, and so should always plot percentages.



Each categorys slice gets an angle equal to


Comparing angles is much more difficult than comparing heights or lengths. Bar charts are almost always more effective.

3-D pie charts are the work of the devil. (Probably worse than no chart.)

Minitab:



Dotplots Dotplots are simple plots which are very

useful for looking at univariate numeric data, especially when the sample size is small or there are many ties in the data.

Each observation is plotted at its location above an appropriate number line. If there are ties, one dot is stacked for each tied observation.


Example: Temperature (F) at launch of the first 25 space shuttle launches.


66 70 69 80 6867 72 73 70 5763 78 70 67 5375 67 70 81 7679 75 76 58 31


Histograms A histogram is a bar chart for numerical

data.

The shape of the histogram describes the shape of the distribution of the data.

If you have a large, randomly collected sample, the shape is also descriptive of the population the sample was taken from.

Your book also describes stem-and-leafplots, which are similar, but rarely used.



Constructing a Histogram1) Find the minimum and maximum of the

data.

2) Break that interval into class intervals. 5-20 classes is often a good start. More for

large samples, less for small ones. A reasonable rule of thumb is

Select your classes so that each is of equal width.


3) Find the frequencies (counts, ni) and relative frequencies (fi = ni/n) in each class.

4) Plot the bar chart with a bar over each class whose height equals fi or ni.


Example: Stock Data (Annual Rate of Return, 1976-1995):



The shape of the histogram tells us about the distribution. Some things to look for include:

Is the distribution left-skewed? Symmetric? Right-skewed?


Is the distribution bimodal?

Multimodal?

Are there any outliers?


Its a good idea to look at several choices of bin width and location, as different choices here can produce dramatically different histograms.

Features that remain in many histograms are likely to be trustworthy; those that only appear sometimes are less certain.



Example: Milk Fill Weights Data







Definition: A boxplot is another graphical tool for displaying a sample:

Boxplots



The box goes from the first to the third quartile, with a line at the median.

For boxplots, outliers are usually defined as any values below

Q1 1.5 IQRor above Q3 + 1.5 IQR.Those points are marked individually.

The whiskers go from the quartiles to the least and greatest values among the non-outliers.


Boxplots are much less informative than histograms for a single distribution, so the histogram is usually preferable.

On the other hand, comparing histograms is difficult, while comparing boxplots is easy.

Use boxplots to compare 2-20 (or more) distributions.


Example: Fish length data



Example: Circuit board data by board.


Ch. 2: Bivariate Data

Statistics is most powerful when looking at relationships between variables.

In the simplest case, this involves looking at pairs of measurements made on the same subjects, (x, y).

Recall, such data is called bivariate (two variables).


Examples: Heights and weights of a group of people. ACT score and Freshman GPA for college

students. January and April average temperatures for

many years at a specified location. January and February inflows of the Nile river

at a location.



We usually picture our variables in a cause-and-effect relationship.

The explanatory (independent, predictor) variable, x, is assumed to play some role in determining the value of the response(dependent) variable, y.

x y


Scatterplots (2.1)

Definition: A scatterplot is the most common graph for displaying bivariatedata. It consists of plotting each point at (xi, yi), on a standard x-y graph.

The pattern formed by the points describes the relationship between the variables.




Minitab Scatterplot:

Correlation Suppose we have a sample of (x, y) pairs

and compute the sample means, and

For each observation (xi, yi), compute the product of the two deviations from the means.

Dividing the scatterplot at the means results in two quadrants where the product is positive, and two where it is negative.




For a scatterplot with a positive relationship, most of the products will have a positive sign, and the sum will be positive.

Likewise, if the picture shows a negative relationship, the sum of the products will be negative.

Unfortunately, the exact value of the sum depends on the units and spread (as measured by standard deviation) of the variables.



Dividing by measures of spread for x and ysolves this issue.

Then is a good, unitlessmeasure of the linear relationship between xand y called the correlation coefficient.


Example: Nile flow data: n=115

What is r?



Properties of r1. The value of r does not depend on the units of x

or y. We will not change r if we multiply all xs, all ys, or both by a positive constant or if we add any constant to all xs, all ys, or both.

2. The value of r does not depend on which variable is labeled x.

3. Correlation is always between -1 and +1.

4. The sign of r shows whether the relationship between x and y is positive or negative.


Properties of r (continued)5. The absolute value of r measures the strength of the

linear relationship between x and y. Roughly speaking:

a. If |r| < 0.5, the relationship (if any) is weak.b. If 0.5 < |r| < 0.8, the association is moderate.c. If 0.8 < |r| < 1.0, the association is strong.d. If |r| = 1.0, the association is perfect. This occurs only

when all (x, y) points fall in a perfect line.

Note that strength is often context- and discipline-dependent. An engineer might find any correlation less than .95 to be weak, while a social scientist might find a correlation of .3 to be very strong.





Properties of r (continued)6. The correlation coefficient cannot measure the

strength of a nonlinear (curved) relationship.


7. Outliers can also lead to an inappropriate value -in either direction!



High correlation indicates strong association, not necessarily causality.

If |r| is large, there are at least 3 possible explanations:

1) x determines y2) y determines x3) Some third value, z, (called a confounding

factor) determines both x and y.


Example: Weekly surveys show that per capita chocolate consumption is strongly correlated with traffic fatalities.

Should driving under the influence of chocolate be outlawed?

Do people eat a lot of chocolate at funerals? Is there a third explanation that makes more

sense?


Example: Over time, ministers salaries in Massachusetts are strongly correlated with the price of rum in Havana. What is the causal relationship here?

Example: Childrens shoe size is correlated with size of vocabulary. What is the causal relationship?


One advantage of well-designed randomized, controlled experiments is that potential confounding factors should be (roughly) balanced between levels of the independent variable we are investigating, so should be much less likely to produce a spurious correlation.



Linear Regression (2.2 2.3) Definition: Regression involves modeling

and predicting the values of one response variable, based on the observed values of one or more other explanatory variables.

Well focus on the case of simple linear regression, where a straight line is fit to a scatterplot of x and y.


We want an equation for a line of the form

The most common way to estimate and uses the least squares fit, minimizing

This leads to the least squares estimates,


Deviations from a potential regression line:



The least squares line best fits the scatter plot.


Example: Nile flow data

What is the least-squares line for this data, and what should we predict the flow for February to be if Januarys was 3?



What would we predict for February from a January value of 10?

Is this likely to be a valid prediction? (Recall, Januarys mean is about 4, and its standard deviation is about 1.)

Extrapolation outside the range of the data is dangerous.



Residuals and Goodness-of-Fit Definition: Given a data set (xi, yi) and an

associated fitted regression model, the fitted value for observation i is

Definition: The residual for i is

The smaller the residuals, the better x and the regression line are at predicting y.


The error sum of squares (SSE) is

SSE is usually compared to the total sum of squares, SST:

and the regression sum of squares, SSR:

To avoid having to calculate all the residuals, we may use the computing formula:

SSE = SST - SSR



The coefficient of determination, r2, measures the proportion of the total variation of y which is explained by x:

The closer r2 is to 1, the more successful the relationship is at explaining the variation in y.

As the notation suggests, the coefficient of determination is the square of the correlation coefficient.




Example: Nile flow data:

Find SST, SSR, SSE, and r2.

What do these say about our predictions?

Note: r = 0.933.

The coefficient of determination r2 is found as R-Sq in Minitab output.

The sums of squares may be found in the SS column of the Analysis of Variance table.

The regression equation isFebruary Inflow = - 0.4698 + 0.8362 January Inflow

S = 0.330519 R-Sq = 87.1% R-Sq(adj) = 87.0%

Analysis of VarianceSource DF SS MS F PRegression 1 83.3794 83.3794 763.25 0.000Error 113 12.3444 0.1092Total 114 95.7238



Chapter 3: Probability

Definition: Probability is the branch of mathematics dealing with chance, randomness, and uncertainty.

Probability provides most of the mathematical foundation for inferential statistics.



Definition: A situation for which the outcome cannot be determined in advance is called an experiment.

Examples: The roll of a die. The draw of a card. The lifetime of an electronic component.


Definition: The sample space, S, of an experiment is the set of all possible outcomes.

Examples: Die: S = {1, 2, 3, 4, 5, 6} Card: S = ? Component: S = ?

An experiment with several steps can be visually represented by a tree diagram:

Example: Toss a coin three times:




Events Definition: Set A is a subset of set B

(A B) if every element of A is also in B. Example: S = {1, 2, 3, 4, 5, 6}

A = {1, 3, 5} S B = {1, 2, 6, 7} S

Every set is a subset of itself. The empty set, , consisting of no

elements, is a subset of every set.


Definition: Any interesting subset of the sample space can be called an event.

Examples: Die: A = odd numbers = {1, 3, 5} Card: B = ? Component: C = ?

The individual outcomes which make up Sare sometimes called simple events.


For subsets of S, A and B (A S, B S):1) The union of A and B (A B) is the set

consisting of all elements found in A, B, or both.

Keyword: or

Example: S = {1, 2, 3, 4, 5, 6} A = {1, 3, 5} S B = {1, 2, 3} S A B = ?

Combining Events



2) The intersection of A and B (A B) is the set consisting of all elements found in bothA and B.

Keywords: and, both

Example: S = {1, 2, 3, 4, 5, 6} A = {1, 3, 5} B = {1, 2, 3} A B = ?


3) The complement of A (Ac) is the set consisting of all elements of S not found in A.

Keyword: not

Example: S = {1, 2, 3, 4, 5, 6} A = {1, 3, 5} Ac = ?


4) Sets A and B are said to be mutually exclusive if there are no elements in both A and B. That is, if A B = (the empty set).

Example: S = {1, 2, 3, 4, 5, 6} A = {1, 3, 5} C = {4, 6} A and C = , so A and C are mutually

exclusive.


Example: Three coin tosses.S={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

Let A = First toss is a head = ? Let B = Last toss is a head = ? What simple events make up the event A and

B?

A or B? Not A? Are A and B mutually exclusive?



The Axioms of Probability Definition: A probability function P() is a

function from subsets of S (events) to the real numbers which satisfies the following axioms of probability:

1) P(S) = 1.2) 0 P(A) 1 for all events A.3) If A and B are mutually exclusive,

P(A B) = P(A) + P(B).


Example: A fair die. P(1) = 1/6, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 1/6.

Probabilities of bigger events are found by axiom 3:

P({1,3}) = P(1) + P(3) = 1/6 + 1/6 = 2/6 = 1/3 P({1,3,5}) = ?



Example: A biased die. P(1) = 1/12, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 3/12 = 1/4.

Note:

(as required by axiom 2) P({1,3}) = P(1) + P(3) = 1/12 + 1/6 = 1/4 P({1,3,5}) = ?


When applied to real experiments, probability measures (long-term) likelihood: if the experiment is repeated many times, event A should occur roughly P(A) fraction of the time.


Additional Properties of Probability The axioms of probability imply some

additional properties:

1) For any event A, P(Ac) = 1 P(A). This is sometimes called the complementary

events rule, or the opposites rule.

Show:

Note: Since Sc = , P() = 0.



2) For any events A and B, P(A B) = P(A) + P(B) P(A B).

This is sometimes called the general addition rule.

Show:

Note: if A and B are mutually exclusive, P(A B) = P() = 0, so this is the same as axiom 3.


Example: A fair die. P(1) = 1/6, P(2) = 1/6, P(3) = 1/6, P(4) = 1/6,

P(5) = 1/6, P(6) = 1/6. A = {1, 3, 5}, P(A) = 3/6 = 1/2. B = {1, 2}, P(B) = 2/6 = 1/3. P(Ac) = ?

A B = {1}, P(A B) = 1/6. P(A B) = ?


We dont need to know the entire probability function to use these.

Example: Lifetime of a component (T). Suppose we know:

P(A) = P(T 60) = .47 P(B) = P(40 T 80) = .34 P(A B) = P(40 T 60) = .26Then:

P(T 60) = ? P(lifetime no more than 80) = ?



Example: Suppose the probability that an integrated circuit chip has defective etching is 0.12. The probability that the chip has a crack defect is 0.29. And the probability of both defects is 0.07.

What is the probability the chip does not have defective etching?

What is the probability it has at least one defect?

What is the probability it has neither defect?


Equally Likely Outcomes If S consists of N equally likely outcomes,

and event A consists of k of them, P(A) = k/N.

Example: A fair die (see slides 148, 153). Example: Draw a card at random from a

standard deck (52 cards, 13 spades). What is the probability of drawing a spade?

Example: A shipment of 1000 hard drives contains 6 which do not work. If we draw one at random, what is the probability of selecting a defective drive?


Conditional Probability (3.2) Suppose we have partial information about

the outcome of an experiment. In particular, suppose we know that the event B has occurred.

We may use this information to revise the probability of another event, A.

We call the revised probability a conditional probability, as it depends on the condition of B being true.



Example: Fair die. Let A = {1, 3, 5} P(A) = 3/6 = 1/2 B = {1, 2, 3} P(B) = 3/6 = 1/2 P(A B) = P({1, 3}) = 2/6 = 1/3 If I roll the die and, without showing you, tell

you event B has occurred (I rolled no greater than 3), now what is the probability of event A?


Since B has occurred, the sample space reduces to B: {1, 2, 3}.

Two of the three possibilities are odd (in A), and the chances are still equal. So P(A|B) = 2/3.

Once we know the roll is 3 or less, the probability increases to 2/3 that its odd.


Definition: The conditional probability of A given B is

(undefined if P(B) = 0).

This is the probability, given that event Bhas occurred, that event A has also occurred.

Die:


Example (continued from slide 155): P(defective etching) = 0.12. P(crack defect) = 0.29. P(etching and crack defects) = 0.07.

If a chip has a crack defect, what is the (conditional) probability that it also has defective etching?


What is the probability that a chip has a crack defect but satisfactory etching?

If a chip has a crack defect, what is the probability that it has satisfactory etching?

Note: P(A|B) = 1 P(Ac|B) , just like P(A) =1 - P(Ac).


If a chip has defective etching, what is the probability that it also has a crack defect?

No relationship between P(A|B), P(B|A).




Independence Definition: If P(A B) = P(A) P(B), we say

A and B are independent.

If A and B are independent, P(A)>0, P(B)>0, then

Likewise, P(B|A) = P(B). Your book uses this as the definition of independence.


Assuming P(A)>0, P(B)>0, any one of P(A B) = P(A) P(B) P(A|B) = P(A) P(B|A) = P(B)proves independence and the other two.


Example: Draw one card at random from a well-shuffled deck. Define:

A = {draw a club} B = {draw an ace} C = {draw a red card}

Are A and B independent? A and C?


Note that events being mutually exclusive and their being independent is not the same thing.

Show: If P(A) > 0, P(B) > 0, and A and B are mutually exclusive, they cannot be independent!



Well often assume independence to calculate probabilities of intersections.

Example: Roll a red die and a black die. A = {red 6} P(A) = 1/6 (fair dice) B = {black 6} P(B) = 1/6

Results on one die shouldnt influence the other, so we assume independence.

P(double-sixes) = P(A B) = P(A) P(B)= (1/6)(1/6) = 1/36.


This extends to more than 2 events.

The multiplication law for independent events says that if events A1, A2, , Anare independent (that is, knowledge of any combination of the Ais does not change the probabilities of the remainder), then

P(A1 A2 An) = P(A1) P(A2) P(An). Note: this is the probability that all n

events occur.



Example: Flip a fair coin 4 times. Let Ai = {Flip i is a head}. P(Ai) = 1/2, i = 1, 2, 3, 4 Separate flips are independent. (Why?) P(4 heads) = P(A1 A2 A3 A4)

= P(A1) P(A2) P(A3) P(A4)= (1/2) (1/2) (1/2) (1/2)= 1/16.


Example: Draw a card from a standard deck 3 times with replacement (replace and reshuffle after each draw).

Let Ai = {Draw i is a spade}. P(Ai) = 13/52 = 1/4, i = 1, 2, 3 Separate draws are independent. (Why?) P(3 spades) = ?


What if events arent independent?

Recall,

Therefore, P(A B) = P(A|B) P(B).

The general multiplication law:

P(A1 and A2) = P(A1) P(A2|A1).



Example: Suppose we have 4 cards, labeled 1, 2, 3, and 4. Suppose we draw two at random without replacement. What is the probability both cards are odd?


Example: Suppose we draw two cards at random without replacement from a standard deck. What is the probability both cards are spades?


Random Variables (3.3)

Definition: A random variable is a random number. It is obtained by assigning a number to each outcome of an experiment.

Example: Roll a die. The number rolled is a random variable.



Example: Flip a coin 5 times. Is the sequence of heads and tails a random variable (Example: HHTHT)?

Some random variables we could generate from 5 coin flips:

X = # H Y = # H # T Z = # H before first T

We usually denote random variables by capital letters from the end of the alphabet.


Example: Select a rat at random from a large colony. What are some possible random variables?


There are two main types of random variables: discrete and continuous.

Definition: A discrete random variable can only take on a specified (countable) list of values. There is a gap between any two elements in its sample space.

In practice, these are usually counts of some sort, and thus whole numbers.

Example: Number of heads in 5 coin flips.



Definition: A continuous random variable may take any real number in some (set of) interval(s).

Examples: Weight, lifetime.

We will need to deal differently with discrete and continuous random variables.


Discrete Random Variables

Definition: The probability mass function (p.m.f.) of a discrete random variable X is a function p() from the support of X to the real numbers, where

p(x) = P(X = x) .

Notation: X: capital letter, indicates a random variable. x: lowercase letter, indicates a specific value.


Example: Let X be the roll of a fair die. S = {1, 2, 3, 4, 5, 6} p(1) = P(X = 1) = 1/6 p(2) = P(X = 2) = 1/6 and so on.

We might write

p(x) = 1/6 x {1, 2, 3, 4, 5, 6}



Example: An industrial plant has 3 machines. The probability that X are operating at a given random time may be found from

x 0 1 2 3p(x) 0.12 0.27 0.46 0.15


The laws of probability tell us that:

1) ? p(x) ? for all p(x)

2) x S p(x) = ?


A p.m.f. is plotted as spikes:


Or as a probability histogram, with areas equal to probabilities:






Continuous Random Variables

Recall, a continuous random variable may take any value in some real interval.

Continuous random variables are typically measurements (length, weight, lifetime, etc.).


With continuous random variables, we cant use a p.m.f. to find probabilities. Instead:

Definition: A probability density function (density, p.d.f.), f(x), is a function which determines the probability properties of a continuous random variable. If X f(x), then


If f(x) is a p.d.f.:

f(x) ? for all x, and

Note: for a continuous random variable,

Why?



Example: a continuous random variable has p.d.f.

Is f(x) a true p.d.f.?


Example (continued): What is the probability that X will be between 0.5 and 1.0?

P(2.5 X 3.0) = ?

P(0.2 X 0.2) = ?

P(X < 1.0) = ?




Definition: The cumulative distribution function (c.d.f.), F(x), of a random variable is defined as

F(x) = P(X x). If X is continuous,


Properties of continuous c.d.f.s:1) limx-F(x) = 0

2) limxF(x) = 1

3) F is nondecreasing (if x < y, F(x) F(y) ).4) P(a X b) = P(X b) P(X a)

= F(b) F(a).

This is often easier than integrating f(x).


Example (back to earlier p.d.f.):

P(0.5 X 1.0) = ?(Compare to slide 192.)



The Population Mean Definition: The population mean (expectation,

expected value) of random variable X is

if X is discrete, and

if X is continuous. It can be thought of as the long-term average

of X, or the mean of a sample that follows the distribution of X perfectly.


Example: Die roll p(x) = 1/6 x{1, 2, , 6} = ?

Example: Machines

= ?

x 0 1 2 3p(x) 0.12 0.27 0.46 0.15


Example:

= ?

Example:

= ?



Expectations of Functions of Random Variables Given a random variable, X, suppose we

are really interested in a function, h(X).

The expected value of h(X) is

if X is discrete, and

if X is continuous.

Example: X ~ p(x) = , x = 1, 2. What is E(X2)?

Note: In general, E[h(X)] h[E(X)]. Example: For the above p.m.f., what is

E(X)? [E(X)]2?

Is E(X2) = [E(X)]2?



The Population Variance and Standard Deviation Just as we have a population mean to

measure of the center of a distribution, the population variance and standard deviation measure a distributions spread.



Definition: Let X be a random variable with mean . Then the population variance of X, 2, is

Definition: The population standard deviation, , of random variable X is the square root of the variance of X.


Example: Die roll p(x) = 1/6 x{1, 2, , 6} = ? E(X2) = ? V(X) = ? = ?

Example: p(x) = 1/2 x{3, 4} = ? E(X2) = ? V(X) = ? = ?


Example: Machines

= ? E(X2) = ?

V(X) = ?

= ?

x 0 1 2 3p(x) 0.12 0.27 0.46 0.15



Example:

= ? E(X2) = ?

V(X) = ?

= ?


Linear Functions of Random Variables (3.4)

Recall, a linear function (or linear combination) of variables x1, x2, , xn, is a function of the form

f(x1,x2,,xn) = a1x1 + a2x2 + +anxn + bwhere b and all of the ais are fixed constants.


Given any random variables X1, X2, , Xnand known constants a1, a2, , an, and b, then

E(a1X1 + a2X2 + + anXn + b) = a1E(X1) + a2E(X2) + + anE(Xn) + b .

To find the expectation of a linear combination of random variables, we need only know the constants and the expectation of each random variable individually.



Example: Let X be a random temperature measured in degrees Celsius, with E(X) = 10. Let Y be the same temperature in degrees Fahrenheit, Y = 9/5 X + 32. What is E(Y)?

Example: The expectation of the roll of a fair die is 3.5. What is the expectation of the sum of four such rolls?

Independent Random Variables Recall, events are said to be independent

if knowledge of one does not affect the probability of the other.

Likewise, random variables X and Y are independent if knowing the value of Xdoes not affect probabilities of Y, no matter what value X takes (and vice-versa).



If X and Y are independent, any event involving X alone will be independent from any event involving Y alone.

P(X A and Y B) = P(X A)P(Y B) for any A and B.

Draws with replacement are independent. Draws in a simple random sample are not

independent, but may be treated as though they are if the sample size is much smaller than the population size.



If the random variables are independent, then

V(a1X1 + a2X2 + + anXn + b) = a12V(X1) + a22V(X2) + + an2V(Xn) .

Notes: The shift b does not affect the variance. The coefficients ai are squared. Dependent random variables require a more

complex formula.


Example: Let the variance of the Celsius temperature X be V(X) = 25.

What is the standard deviation of X?

What is the variance of Y = 9/5 X + 32?

What is the standard deviation of Y?


Example: The variance of the roll of a fair die is 35/12. What is the variance of the sum of four such rolls?

If we take a single roll and multiply it by 4, what is the variance of the result? Why is this different?



Suppose X and Y each have mean 10 and variance 4. What are the mean and variance of Z = X Y?


Mean and Variance of the Sample Mean

An important special case concerns the sample mean of the Xis,

Note that is a linear combination of the Xis.


Theorem: If X1, X2, Xn are independent random variables, each with E(Xi) = and V(Xi) = 2, then

and

Proof:



Example: A (possibly biased) coin has probability p of coming up heads. We flip it and let X = 1 if heads, 0 if tails.

What are E(X) and V(X)?

Suppose we flip it n times, and look at

Chapter 4: Common Distributions Often we will have useful mathematical forms

which represent entire families of distributions.

These distributions include one or more constants (called parameters) which must be specified to define a specific distribution.

We will concentrate on two especially important families, the binomial and normal distributions.


The Binomial Distribution (4.1) The binomial distribution is the most

important common named family of discrete distributions.

Recall, a discrete distribution is described by a probability mass function p(), where

p(0) = P(X = 0) p(1) = P(X = 1) and so on.



Suppose our experiment consists of trials with only two possible outcomes.

One outcome called a success occurs with probability p.

The other outcome is called a failure, and occurs with probability (1 p).

Such a process is called a Bernoulli trial(after 17th-century probabilist James Bernoulli).

The binomial distribution looks at a fixed number of independent identical Bernoulli trials, and counts the number of successes.


Example: Suppose silicon computer chips are made in pairs, and that 30% of all chips produced are defective.

Also assume that the chips in a pair are independent of each other.

Out of pairs in which the first chip is good, the second is defective in 30% of pairs. This remains true for pairs in which the first chip is defective.


Out of all pairs, 70% will have a good first chip. Out of those, 70% will also have a good second chip. Overall, 70% of 70%, or 49% (.7*.7 = .49) will have two good chips.

Likewise, 30% of that 70%, or 21% overall (.7*.3 = .21) will have a good first chip and a defective second chip.

By the same reasoning, 30% will have a defective first chip, and 70% of those (21% overall) will have a good second chip.

Finally, 30% of 30%, or 9% will have both chips defective.



If we let the letter S (for success) represent a good chip, and F (for failure) represent a defective one, we can summarize as:

P(SS) = .7*.7 = .49 P(SF) = .7*.3 = .21 P(FS) = .3*.7 = .21 P(FF) = .3*.3 = .09


Now let X be the number of good chips produced in a pair.

Then X can take the values 0, 1, or 2.

From the above, p(0) = P(X = 0) = P(FF) = .09 p(2) = P(X = 2) = P(SS) = .49 p(1) = P(X = 1) = P(SF or FS) = .21 + .21

= .42


What if the chips are produced in sets of 4?

If we want the probability of a set consisting of 2 good and 2 defective chips, we can think about the case of SSFF the first and second chips are good, while the third and fourth are defective.

The probability of this particular outcome will be .7*.7*.3*.3 = .0441 or 4.41%.



But there are other ways we can have two successes and two failures 5 other ways, in this case:

P(SSFF) = .7*.7*.3*.3 = .0441 P(SFSF) = .7*.3*.7*.3 = .0441 P(SFFS) = .7*.3*.3*.7 = .0441 P(FSSF) = .3*.7*.7*.3 = .0441 P(FSFS) = .3*.7*.3*.7 = .0441 P(FFSS) = .3*.3*.7*.7 = .0441

Overall, p(2) = P(X = 2) = 6*.0441 =.2646.



In general, suppose we have an experiment consisting of n independent Bernoulli trials.

Those trials which satisfy the condition we wish to count are called successes, and occur with probability p.

The remaining trials are called failures; these occur with probability (1 p).

Let X be the number of successes in the full experiment.


If these conditions are true, we say that X, the number of successes in the experiment, has a binomial distribution with parameters n and p.

X Binomial(n, p) or X Bin(n, p) . The mass function for X is:



Note: the exclamation mark is pronounced factorial.

Given n items, n! is the number of arrangements, and is found as

n! n (n-1) (n-2) 2 1.

Since there is one (empty) way to arrange 0 objects, we define 0! = 1.

Example: The chips (30% defective) are produced in batches of 4. Let X be the number of good chips in a batch.

What distribution does X follow?

What is p(2)?

What is the probability that a random batch will contain no more than one good chip?


Example: In a genetics study, a second-generation cross of pure green peas with pure yellow peas leads to pods where p = P(yellow) = .

If pods contain 8 seeds, what is the probability that a random pod will contain 6 yellow seeds?

What is the probability that a random pod will contain at least 6 yellow seeds?



Table A.1 in your book can save calculations by providing probabilities of P(X x) for n 20 and certain values of p.

Example: Draw 16 times with replacement from a standard deck, and let X = number of spades drawn.

Find P(X > 6).


With standard distributions, the mean and variance may generally be found as a function of the parameters.

If X Binomial(n, p), then = np. Example: If 75% of all seeds are yellow, and

each pod contains 8 seeds, what is the mean number of yellow seeds per pod?

Example: If we have 4 fair coins which we flip as a batch, what is the mean number of heads?


Additionally, if X Bin(n, p), then 2 = np(1 p).

Example: X = # yellow seeds ~ Bin(8, .75). What are the variance and standard deviation of X?

Example: X = # heads in 4 flips ~ Bin(4, .5). What are the variance and standard deviation of X?



Recall, draws without replacement (simple random samples) are not independent.

However, we may do calculations as though they are independent (including binomial calculations) as long as the sample size is small (less than 5%) compared to the population size.


Example: A lot of several thousand components contains 7% defective. We sample 8 at random.

What is the probability of no defective components in our sample?

What is the probability of at least one defective?

What is the expected number of defectives in our sample?



The Normal Distribution (4.3) The continuous normal (or Gaussian)

distribution has two parameters, and 2. If X ~ N(, 2),

This distribution is often seen in practice, and is also very important theoretically.



The normal p.d.f. is a bell-shaped curve, symmetric around, and with its peak at, . E(X) = .

Its width is determined by 2; large values of 2imply a wide, low curve, while small values imply a narrow, tall one. V(X) = 2.


An important special case is the standard normal distribution, with = 0 and 2 = 1.

We usually identify standard normal variables with the letter Z.

If Z is standard normal, Z~N(0,1) and the density of Z is


There is no closed-form integral for the normal probability density function, so we cant find probabilities that way.

To find normal probabilities, we must use computer programs (which themselves use numeric integration), or tables such as Table A.2 (p. 521-522, and inside the front cover of your book) of the standard normal distribution.




Examples: P(Z 1.00) = ?

P(Z > 1.00) = ?

P(-2.00 Z 0.75) = ?


For X ~ N(, 2), we find proportions by converting to standard units.

If X ~ N(, 2), then Z = (X - )/ ~ N(0,1). Remember to convert both sides of any

inequality the same way.



Examples: Let X ~ N(3, 4). P(X 6.00) = ?

P(X > 4.00) = ?


Normal Percentiles Just as for samples, the pth percentile of a

distribution has p% of the probability below it, and (100 p)% above.

We find percentiles for the normal distribution using Table A.2 again, but reading from the inside out.

Since probabilities are in the middle of the table, start there.

Read to the outside to find the percentile.


Example: Z ~ N(0, 1). What is the 70thpercentile of Z?

Example: What is the 25th percentile of Z?



For non-standard normal variables, first find the desired percentile for the standard normal, then use the fact that since

Z = (X - )/, therefore X = + Z. Example: X ~ N(10, 25). What is the 95th

percentile of X?

Besides the binomial and normal distributions, there are a number of other named families of distributions with useful properties.

For example, the Poisson distribution (Section 4.2) is useful for modeling random counts in a fixed interval of time or space.

See Sections 4.4-4.6 for discussion of the lognormal, exponential, gamma, and Weibulldistributions, which are useful for modeling continuous histograms which are positively skewed and unimodal.


Sampling Distributions (4.8) Suppose random variable X is drawn from

some distribution f. (X ~ f )

Now suppose we generate n of these random variables, X1, Xn, independently from f.

We say that X1, Xn make a random sample from f.

Sometimes we say that X1, Xn are i.i.d. (independent and identically distributed) from f.



Since the Xs make a sample, we can compute sample statistics such as the mean,

Recall (3.4), since the Xs are random, so is and since it is a number, is itself a random variable with a distribution.

This distribution is referred to as the sampling distribution of and plays a large role in inferential statistics.


Example: Let pX(x) = 1/3, x = 1, 2, 3, and let X1 and X2 be independent draws from pX(x).

Now let = (X1 + X2)/2 be the average of X1 and X2.

Note that is also a discrete random variable, and therefore has a probability mass function.

What is the mass function (sampling distribution) of ?


Example: Suppose X ~ N(50, 4). A histogram of 1000 Xs looks like this:



Sample 25 Xs and compute

If we repeat this process 1000 times, we get a histogram such as this:


Note that has a distribution that: Is centered on 50 (); Is narrower than the solid normal curve for the

individual Xs the variance and standard deviation of are smaller than those of X.

Remains bell-shaped and (roughly?) normal.

Understanding the distributions of sample statistics and their relationships to the associated population parameters is the basis of most of inferential statistics.


In general, if a sample statistic is used to estimate a population parameter:

The sampling distribution of the statistic is centered on (or at least near) the parameter.

The spread of the sampling distribution will decrease as the sample size gets larger.

As the sample size gets larger, the shape of the sampling distribution will usually get more and more bell-shaped (normal).



Let be the sample mean of a random sample X1, X2, Xn, from a population or process with mean and standard deviation . Then (recall, Section 3.4):

The mean of the sampling distribution of , , is , the population mean, regardless of

sample size n. The standard deviation of the sampling

distribution of , , is , the population standard deviation divided by the square root of the sample size.


Sampling Distributions of the Mean

The standard deviation of the sample mean, , is often called the standard error of the sample mean.

This emphasizes that it describes a sampling distribution, not a population.


As the sample size gets larger, we have more information and can make better estimates, so the standard error decreases.

(Note, however, that the square root means we have diminishing returns; each new observation provides less new information than the previous one.)

The larger the sample, the closer is likely to be to .




If our original population has a normal distribution, the sampling distribution of is also normal, regardless of sample size.

Example: An automated filling machine fills soft drink cans with a volume that has a normal distribution with = 0.05 ounces.

If we sample 4 cans and take the sample mean, what is the probability that will be within 0.04 ounces of the population mean ?


The Central Limit Theorem The Central Limit Theorem is the most

important theorem in statistics.

It shows the importance of the normal distribution, and provides the justification of many of the most fundamental statistical methods.




If we know that a population or process has a normal distribution, we know that the sampling distribution of will also be normal. This allows us to compute useful probabilities.

Unfortunately, we often do not know the population distribution (or perhaps we know that it is not normal).

Fortunately, this is not always required.


The sample mean (or sum) of a large number of independent random variables has a sampling distribution which is approximately normal, no matter what distribution the original random variables come from.

This important result is the Central Limit Theorem.


Theorem (Central Limit Theorem): If X1, X2, Xn are independent random variables, from a population or process with mean and standard deviation , then as long as n is sufficiently large,

We can use this to find probabilities for sums or averages, without knowing the distribution of the Xis!




Example: The (population) mean time required for maintenance on an air-conditioning unit is 1 hour, and the standard deviation is also 1 hour. A company operates 50 such units.

Could we find the probability that the maintenance on a single unit requires more than 2 hours from the information given?


What is the probability that the average time for maintenance will be more than 75 minutes?

What is the probability that the total time for maintenance will be less than 40 hours?


How large is large? As a general rule, n 30 is usually large

enough that the Central Limit Theorem is reasonable.

Symmetric populations can get by with much less, often as few as 10, or even fewer.

Highly skewed populations require more. 50 or more should be fairly safe in all but the worst cases.


The Normal Approximation to the Binomial Distribution Recall, if X ~ B(n, p), then E(X) = np and

V(X) = np(1-p). If the particular values of n and p lead to a

binomial distribution which is not very skewed, the distribution can be a good approximation to the B(n,p) distribution.

We usually require that np 10 and n(1-p) 10 .


Example: Roll a die 120 times and count the number of 6s rolled (X).

What distribution does X follow?

What are E(X) and V(X)?

What is P(X 25)?



The true binomial probability is 0.136.

Were pretty close, but we can do better.

Binomial probabilities are located entirely on the integers, but normal probabilities are smeared out over the whole real line (remember the probability histogram).

Well get a better approximation if we use a continuity correction, by taking the normal probability from (x - .5) to (x + .5) to approximate the binomial P(X = x).



So, for X ~ B(120, 1/6),

P(X 25) = P(X 24.5) =

Example: If X~Bin(120, 1/6), use the normal approximation to estimate P(15 < X < 25).




Chapter 5: Statistical Estimation

The remainder of the course will focus on inferential statistics.

Recall, in probability, we generally know the distribution in question and wish to calculate something about particular outcomes or events.

In inferential statistics, we have a sample, and wish to use that information to say something about the population or distribution the sample was drawn from.


Population Sample

Probability

Inferential Statistics


Recall: A parameter is an unknown quantity related to a population or distribution.

A statistic is a known quantity which can be calculated from a dataset.

Estimation uses a statistic (what we know) to tell us something about an unknown parameter (what we wish we knew).



Definition: A point estimate of a parameter , is a statistic, , which represents a best guess for .

Example: We have an unknown distribution, X ~ f(x), and we wish to know the unknown parameter = E(X). We take a sample X1, X2, Xn, and estimate with the known statistic .

Point Estimation (5.1)


Other common point estimates:

Estimate V(X) = 2 with . If X ~ Binomial(n, p) (n known, p

unknown), estimate p with .

All of our standard sample statistics (median, quartiles, etc.) are good estimates of the corresponding population or distribution parameters.

Properties of Estimates There are a few properties that we like to

see in a parameter estimate.

On average (over many samples), an estimate should give the correct value for the parameter. If the mean of the sampling distribution of our estimate is the parameter we are estimating, that is,

we say that is an unbiasedestimate of .



Example: We know that so is an unbiased estimate of .

Also, and (proof:)

so the sample variance and proportion are unbiased estimates of the population variance and proportion.

This is why we divide by (n 1) instead of n to find s2.


On the other hand, the sample standard deviation, s, has so s is a biasedestimate for .

Fortunately, the bias (defined asor more generally, ) is small, especially as n gets large.


Note that just because an estimate is unbiased, does not guarantee that it will give you the exact parameter on this (or possibly, any) sample.

Example: X ~ Binomial(n = 25, p = 0.3). Even though is unbiased for p, there is no value of X that will give

Remember our sampling distributions; an unbiased estimates distribution will be centered correctly, but it will still have some spread.



The variance of the sampling distribution of our estimate measures that spread and is also important in measuring how well it performs.


We combine these two aspects into a single measure, the mean squared error:

A small MSE means that both bias and variance are small.


Example: Suppose X1 and X2 are independent, with E(X1) = E(X2) = and V(X1) = V(X2) = 2.

Let

Find:



Example (continued): Let

Find:

For what values of and 2 is



Confidence Intervals (5.2) Having a good estimate is a good first step

in learning about a population parameter.

We should also be interested in how close our estimate is likely to be to the parameter.

One approach is to calculate the standard error, remembering that we will usually be within 2-3 standard errors of the parameter (if we use an unbiased estimate).


Another way to look at this issue is that we know our estimate is incorrect. (We just dont know by exactly how much.)

We can improve this situation by expanding our point estimate to an interval estimate, providing a range of plausible values for .

Done carefully, we can identify how likely it is that our interval includes .



If our sample size, n, is large, we can use the Central Limit Theorem to give us the following.


Therefore, the interval

is a random interval which covers the population mean with probability 0.95.

We call such an interval a 95% confidence interval.

This represents a set of plausible values of that are consistent with the data.

Example: A random sample of 80 auto body shops for cost to repair a particular kind of damage have mean $472.36 and standard deviation $62.35.

What is the 95% confidence interval for the mean of this population?




Is it correct to say P(458.70 486.02) = 0.95 ?

No! Nothing inside the probability statement is random. Recall:

The random parts are the sample statistics.

The interval is random, not the population parameter, .


If we constructed many 95% confidence intervals from independent datasets, wed get many different sample means and sample standard deviations, and each would lead to a different confidence interval.

In the long run, about 95% of these different confidence intervals would contain the true parameter .

Remember, randomness is in the sample and the interval, not in the parameter!




We call the value 95% the confidence level. We say we are 95% confident that the population mean lies within the computed interval.

We can select other confidence levels if desired, by replacing the critical value 1.96 with the Z-percentile that gives the appropriate center probability.

A confidence level of 95% (1.96) is most common, but levels of 90% (1.645) and 99% (2.575) are also often used.

In general, define zp to be the value, above which there is probability p in the tail of the standard normal distribution.

Then zp will be the 100(1-p)th percentile of the standard normal distribution.

For a 100(1-)% confidence interval, we use the critical value z/2.

Example: What critical value would we use for an 80% confidence interval?






What factors affect the length (precision) of the confidence interval?

s If s is bigger, is less accurate, and the interval must be wider.

Confidence level To be more confident of including the true value, we must make the interval wider.

n as n gets bigger, the standard error of gets smaller, and the interval gets narrower.


If we require a 95% confidence interval of error width (interval half-width) no more than w, we can compute a (rough) minimum sample size if we have an estimate or upper bound for s.

Of course, we can substitute the appropriate Zcritical value to find sample sizes for other confidence levels.



Example: Milk fill weights. n = 50, = 2.0727, s = 0.0711 Find a 95% confidence interval for .

w = ?

If we require w 0.01, how big should n be?

Confidence Bounds Sometimes, we only wish to know a lower

(or upper) bound on . We can generate one-sided confidence

intervals, also called confidence bounds, in a similar way to the usual two-sided case.


If we have a large sample, then: A 95% lower confidence bound for is

A 95% upper confidence bound for is

To get 90%, 99%, or 100(1-)% bounds, replace 1.645 with 1.28, 2.33, or z,respectively.



Example: A sample of 48 Shear strength measurements give a mean of 17.17 N/mm2 and a standard deviation of 3.28 N/mm2.

If we only care that the population mean shear strength is great enough, find a 90% lower bound on .



For our normal-based confidence interval and level to be valid, we must know (or at least assume) that:

The sample is a random draw from the population.

The sample size n is large enough that the sample mean is approximately normally distributed and that s is a good estimate of .


Chapter 6: Hypothesis Testing Estimation (both point and interval) is

useful for providing an idea of the value of a population parameter.

Frequently, we may wish to investigate a more specific question about a parameter. For this purpose, we use the other major branch of inferential statistics, hypothesis testing.


One-Sample Z-Tests (6.1-6.2) Example: (Milk data) Suppose our bottle-

filling machine is supposed to dispense 2.04 L of milk. Recall, a sample of size 50 gave = 2.0727, s = 0.0711. Does the machine need to be recalibrated?

To answer this, lets assume that the machine is working properly, and see how likely we are to get a sample mean as far or further from the expected value as the sample mean we actually saw (2.0727).



More formally, we choose a null hypothesis, H0.

This is a statement about a population parameter (say, ), generally that it is equal to the value of interest (denoted 0).

Usually, the null hypothesis means everything is as it should be, or nothing interesting is happening.

Here: H0: = 2.04 (= 0)


We also choose an alternative hypothesis, H1, that the null is incorrect.

H1: 2.04

The alternative is literally simply that the null is incorrect, but this is often the more interesting or important result.



Next, we compute a test statistic, under the assumption that H0 is correct.

For large-sample tests on the population mean, , we usually use the z-statistic:

Here: z = ? If H0 is true, and z ~ N(0, 1). Is z a typical value from a N(0, 1)

distribution?


Formally, we find a P-value, the probability that a sample from the null distribution would give a test statistic as or more unusual as the one we just saw.

Since H1: 2.04, we use a two-sidedP-value: P = P(|z| 3.25) (z ~ N(0,1)).

From our table, if z ~ N(0,1),P (|z| 3.25) = .0012.


So we have two possibilities:1) H0 is correct, = 2.04, and we got very

unlucky to happen to get the (roughly) 1 in 800 chance to get 2.0727 (or the equally unusual 2.0073), or

2) H0 is wrong.

Which seems more reasonable to believe? Since P is so small, we reject H0 and

decide the filling machine does require recalibration.



All hypothesis tests follow this general pattern:

1) We observe some difference in a sample and wish to decide if it reflects a true difference in the population.

2) Identify the null and alternative hypotheses.3) Compute a test statistic which has a known

distribution when the null hypothesis is true.4) Find a P-value: the probability of a statistic as

or more unusual than the one we observed, when the null hypothesis is true.

5) If P is small, reject the null hypothesis. Otherwise, fail to reject it.


This basic pattern holds for many different tests on different parameters with different assumptions.

For questions about the population mean for a single population, we often use the one sample z-test demonstrated above.


Details on the one-sample z-test:1) We have a single population, and a

specific value, 0, we wish to consider for the population mean.

This may be a known population mean for some related population (see next example).

Or it may be a desired population mean (example: milk data).

A sample from the population will give a sample mean different from 0, even if that is the actual population mean.



2) Identify H0 and H1. H1 is a statement that something interesting is

going on. It is usually what we wish to prove. We should decide if we care about a one-

sided or two-sided alternative, ideally before we ever see data.

Two-sided: H0: = 0 vs. H1: 0. One-sided: H0: 0 vs. H1: > 0

or: H0: 0 vs. H1: < 0 We always compute z and P using 0, so = 0 is always part of H0.


Example: Example: A newspaper article says that college freshmen average 7.5 hours per week at parties.

We suspect the number is lower at our college.

H0 = ?

H1 = ?

Math