Chapter 1

STAT 505

PROBABILITY & STATISTICS FOR ENGINEERING AND

SCIENCE

PROBABILITY

Informally, probable is one of several words applied to uncertain events or knowledge, being closely related in meaning to likely, risky, hazardous, and doubtful.

Chance, odds, and bet are other words expressing similar notions.

The theory of probability attempts to quantify the notion of probable.

Probability always lies between 0 and 1.

If probability is equal to 1 then that event is certain to happen and if the probability is 0 then that event will never occur.

STATISTICS Statistics is a mathematical science pertaining to collection,

analysis, interpretation and presentation of data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government, medicine and industry.

Given a collection of data, statistics may be employed to summarize or describe the data; this use is called descriptive statistics.

EXAMPLES: optical character recognition, speech recognition, genomics, computational biology, survival analysis, statistical genetics, portfolio optimization and management, financial risk management, credit rating/scoring,

Statistical Packages

R http://cran.r-project.org/bin/windows/base/ http://stat.ethz.ch

/R-manual/R-patched/doc/html/ http://www.omegahat.org/REventLoop/man.pdf http://www.r-project.org/other-docs.html

SAS http://v8doc.sas.com/sashtml/

Descriptive Statistics

Can be used to summarize the data, either numerically or graphically, to describe the sample.

Basic examples of numerical descriptions include the mean and standard deviation.

Graphical summarization include various kinds of charts and graphs.

Pareto Charts

A Pareto chart is a bar graph for qualitative data, with the bars arranged in order according to frequencies.

How To Construct A Pareto Chart

A Pareto chart can be constructed by segmenting the range of the data into groups (also called segments, bins or categories).

For example, if your business was investigating the delay associated with processing credit card applications, you could group the data into the following categories:

No signature Residential address not valid Non-legible handwriting Already a customer Other

The left-side vertical axis of the Pareto chart is labeled Frequency (the number of counts for each category), the right-side vertical axis of the Pareto chart is the cumulative percentage, and the horizontal axis of the Pareto chart is labeled with the group names of your response variables

What Questions The Pareto Chart Answers

What are the largest issues facing our team or business?

What 20% of sources are causing 80% of the problems (80/20 Rule)?

Where should we focus our efforts to achieve the greatest improvements?

Pareto ChartAccidental Deaths

5000

10000

35000

40000

45000

30000

25000

20000

15000

Pois

on

PoisonDrowningFallsMotor Vehicle

s

Fire Firearms

Ingestion of Food/Object

Stem-and-Leaf Plots

A Stem-and-Leaf Plot is very useful. It can show the distribution of the data, yet not lose the actual data points.

The Stem-and-Leaf plot is most easily explained using and example.

Consider the following data - which represents the daily high temperature for a city over a day span

78 76 82 75 85 82 78 74 83 90 70 76 85 92 87 67 65 68 73 74 83 88 86 85 92 90 82 75 69 80 85 77 86 85 90 85 80 70 65 60

Stem-and-Leaf Plots We can see that the data ranges from about 60 to about 95.

We sort the data from lowest to highest: 60 65 65 67 68 69 70 70 73 74 74 75 75 76 76 77 78 78 80 80 82 82 82 83 83 85 85 85 85 85 85 86 86 87 88 90 90 90 92 92

Now we can create the stem and leaf graph as follows: Stem Leaves 6 055789 7 003445566788 8 00222335555556678 9 00022

Stem-and-Leaf Plots

Stem Leaves 6 055789 7 003445566788 8 00222335555556678 9 00022

If you look at the page sideways you can see the distribution of the data. The same rule that says you should 5-20 classes of data in a histogram applies to a stem and leaf diagram. We could clearly expand the stem and leaf diagram to include more rows and could also be condensed to include fewer rows.

HISTOGRAM

A histogram is like a bar chart - it consists of a horizontal scale for values of the data being represented and a vertical scale for frequencies, and bars representing the frequency of each class of values.

A relative frequency histogram will have the same shape and horizontal scale as the histogram - but the vertical scale will be marked with relative frequencies.

Theoretically each bar should be marked with the lower class boundary at the left and upper class boundary at the right.

HISTOGRAM

Histograms - are more commonly used. Preserve some information about the shape of the data distributions and are not limited by the size of the data set.

The purpose of a histogram is to take the data that is collected from a process and then display it graphically to view how the distribution of the data, centers itself around the mean, or main specification. From the data, the histogram will graphically show:

The center of the data. The spread of the data. Any data skewness . The presence of outliers (product outside the specification

range). The presence of multiple modes (or peaks) within the data

HISTROGRAM

BOXPLOT

In 1977, John Tukey published an efficient method for displaying a five-number data summary. The graph is called a boxplot and summarizes the following statistical measures:

-median -upper and lower quartile -minimum and maximum value

Histograms are excellent for focusing attention on key aspects of the shape of a distribution (symmetry, skewness), but they are not good tools for making comparisons among datasets. Boxplots are ideal for making comparisons.

John Wilder Tukey John Wilder Tukey (June 16,

1915 - July 26, 2000) was a statistician born in New Bedford, Massachusetts.

Tukey obtained a A.B. in 1936 and Sc.M. in 1937, both in Chemistry, from Brown University, before moving to Princeton University where he received his Ph.D. in mathematics. During World War II, Tukey worked at the Fire Control Research Office and collaborated with Samuel Wilks and William Cochran. After the war, he returned to Princeton, dividing his time between the university and AT&T Bell Laboratories.

Lottery payoffs for winning numbers for three time periods (May 1975-March 1976, November 1976-September 1977, and December 1980-September 1981).

Boxplots

The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR).

The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

In this example, the three boxplots have nearly identical median values. The IQR is decreasing from one time period to the next, indicating reduced variability of payoffs in the second and third periods. In addition, the extreme values are closer to the median in the later time periods.

Dot Plot In a dot plot, each data entry is plotted, using a point, above

a horizontal axis.

Use a dot plot to display the ages of the 30 students in the statistics class.

18 20 21 27 29 20

19 30 32 19 34 19

24 29 18 37 38 22

30 39 32 44 33 46

54 49 18 51 21 21

Ages of Students

Dot Plot

Ages of

Students

15

18

24

45

48

21

51

30

54

39

42

33

36

27

57

From this graph, we can conclude that most of the values lie between 18 and 32.

Pie Chart

A pie chart is a circle that is divided into sectors that represent categories. The area of each sector is proportional to the frequency of each category.

Accidental Deaths in the USA in 2002

(Source: US Dept. of Transportation)

Type Frequency

Motor Vehicle 43,500

Falls 12,200

Poison 6,400

Drowning 4,600

Fire 4,200

Ingestion of Food/Object 2,900

Firearms 1,400

Pie Chart

To create a pie chart for the data, find the relative frequency (percent) of each category

Type FrequencyRelative

Frequency

Motor Vehicle 43,500 0.578

Falls 12,200 0.162

Poison 6,400 0.085

Drowning 4,600 0.061Fire 4,200 0.056

Ingestion of Food/Object 2,900 0.039

Firearms 1,400 0.019

n = 75,200

Pie ChartNext, find the central angle. To find the central angle, multiply the

relative frequency by 360°.

Type FrequencyRelative

FrequencyAngle

Motor Vehicle 43,500 0.578 208.2°

Falls 12,200 0.162 58.4°

Poison 6,400 0.085 30.6°

Drowning 4,600 0.061 22.0°Fire 4,200 0.056 20.1°

Ingestion of Food/Object 2,900 0.039 13.9°

Firearms 1,400 0.019 6.7°

Firearms1.9%

Motor vehicles57.8%

Poison8.5%

Falls16.2%

Drowning6.1%

Fire5.6%

Ingestion3.9%

Pie Chart

Times Series Chart A data set that is composed of quantitative data entries taken at

regular intervals over a period of time is a time series. A time series chart is used to graph a time series.

Example: The following table lists the number of minutes Robert used on his cell phone for the last six months.

Month Minutes

January 236

February 242

March 188

April 175May 199

June 135

Construct a time series chart for the number of minutes used.

Times Series Chart

Robert’s Cell Phone Usage

200

150

100

50

250

0

Min

ute

s

Month

Jan Feb Mar Apr May June

Quartiles and Percentiles

Percentile: A percentile is a measure that tells us what percent of the

total frequency scored at or below that measure.

Quartiles: Quartile is another term referred to in percentile

measure. The total of 100% is broken into four equal

parts: 25%, 50%, 75%, 100%.

The median is the value in the middle of the ordered array, the lower quartile is the middle value of the half of the data below the median, and the upper quartile is the middle value of the half of the data above the median.


i x[i] 1 102 2 104 3 105 ---- the first quartile, Q1 = 105 4 107 5 108 6 109 ---- the second quartile, Q2 or median = 109 7 110 8 112 9 115 ---- the third quartile, Q3 = 115 10 115 11 118


For this data set: smallest non-outlier observation = 5 (left "whisker") lower (first) quartile (Q1, x.25) = 7 median (second quartile) (Med, x.5) = 8.5 upper (third) quartile (Q3, x.75) = 9 largest non-outlier observation = 10 interquartile range, IQR = Q3 − Q1 = 2 the value 3.5 is a "mild" outlier, between 1.5*(IQR) and

3*(IQR) below Q1 the value 0.5 is an "extreme" outlier, more than 3*(IQR) below

Q1 the data is skewed to the left (negatively skewed)


+------+-+ o * |---------| + | | -- | +-----+-+ +---+---+---+---+---+---+---+---+---+---+ number line 0 1 2 3 4 5 6 7 8 9 10

Measures of the center

Now that we have seen how to picture data, we will explore methods of measuring characteristics of data.

The measure we first look at is a measure of central tendency. This is a value at the center or middle of a data set.

Consider the following example where we introduce the mean, median, mode and midrange. Here is some data

10 11 12 12 15 17 21 22 23 27


The mean (or arithmetic mean) is the average of these data points. To calculate the mean you simply add the data points and divide by the number of data points. The mean is denoted by x . In our example above:

Sum of data points: 10+11+12+12+15+17+21+22+23+27 = 170

Number of data points = 10 Average = 170/10 = 17

The median is the middle value when the scores are arranged in order of increasing (or decreasing) magnitude To calculate the median follow this rule:

If the number of scores is odd, the median is the number that is located in the exact middle of the list If the number of scores is even, the median is found by computing the mean of the two middle numbers

NOTE: TO APPLY THE RULES ABOVE THE LISTS MUST BE SORTED!


In our example above: 15 and 17 are the middle numbers. So the median is (15+17)/2 = 16.

The mode of the data set is the score that occurs most frequently. When two scores occur with the same greatest frequency, each one is a mode and the data is bimodal. If more than two scores occur with the same greatest frequency, each is a mode and the data is multimodal. When all scores occur just once there is no mode. The mode is denoted by M

The value 12 in the above dataset occurs most frequently and is therefore the mode.

The midrange is simply (low value + high value)/2. In our example above this is (10+ 27)/2 = 37/2 = 18.5

SOME MATHEMATICAL NOTATION

Mathematicians like to have symbols to represent complicated calculations. Here are some we will use

throughout the course: ∑ denotes the summation of a group of values (this

means add them all up) x denotes the variable, usually used to represent the

individual data values n represents the number of values in a sample N represents the number of values in a population

is the mean of a sample

is the mean of a population

n

xx _

N

x

Measures of Variation Measures of central tendency give us measures of where the

middle of a set of data occurs, but this is not enough to characterize a set of data.

Consider the following 2 data sets: 50 60 70 80 90 And 69 69 70 71 71

Both these data sets have a mean of 70. Yet the first data set is more widely dispersed than the second data set. So a measure of variation is clearly needed.

Consider the following data - it represents the actual weight of a 20 oz steak at a restaurant. We will use this throughout this section

17 20 21 18 20 20 20 18 19 19 20 19 22 20 18 20 18 19 20 19

Measures of Variation The range is the difference between the highest value and

the lowest value in a dataset.

To compute it simply subtract the lowest value from the highest value. In the example above the range is (22-17)=5

Range can be misleading since it does not take into consideration every value. Consider each of the following data sets:

1 10 10 10 10

And 1 2 5 8 10

Both have a range of 9, yet the first data set is clearly not as dispersed as the second.

Measures of Variation

A more accurate measure of variation can be given by the standard deviation of the data.

The standard deviation of a set of sample scores is a measure of variation of scores about the mean. It is calculated by

1

)( 2

1

_

n

xxs

n

ii


The procedure for finding the standard deviation is as follows:

Find the mean of the scores Subtract the mean from each individual score Square each of the values in step 2 Add up all the squares obtained in step 3 Divide the total in step 4 by n-1 Find the square root of step 5.


The sample variance is the standard deviation squared. To calculate all you do all the steps for the standard deviation except taking the final square root. Here is the formula:

1

)( 2

1

_

2

n

xxs

n

ii

Interpretation of standard deviation

A small standard deviation means the data is close together, a large deviation means the data is wide spread

The range rule of thumb states that for typical data sets, the range of the data is about 4 standard deviations wide so the standard deviation is about the range divided by 4. This is a very rough estimate

The 68-95-99 rule states that about 68% of all scores fall within one standard deviation of the mean, 95% of all scores fall within about 2 standard deviations of the mean and 99.7% of all scores fall within 3 standard deviations from the mean.

This only works for data that is approximately bell shaped. The above rule tells us that data more than 2 standard

deviations from the mean is unusual. While data within 2 standard deviations is normal

Chebyshev's Theorem states that at least 75% of all scores fall within 2 standard deviations from the mean and at least 89% fall within at least 3 standard deviations from the mean. This works for ANY distribution (not just bell shaped)

Z-Scores How do we compare two different sets of data.

Suppose you are comparing gas mileage on two separate kinds of automobiles - say light trucks and compact cars. Assume the mean miles per gallon for the light trucks is 23.6 miles per gallon with a standard deviation of 3.6 miles per gallon and if the mean miles per gallon for compact cars is 28.7 miles per gallon with a standard deviation of 5.7 miles per gallon.

If you are trying to compare a light truck with a miles per gallon rating of 27.5 and a compact car with a miles per gallon rating 31.2.

Which one is more "unusual"? To solve this problem we need some way to standardize these scores - this way we would not have to know what scale was being used. The way to get a standard score is the z score.

Z-Scores

The standard score or z-score, is the number of standard deviations that a given value x is above or below the mean. You calculate the z score using:

So for the light truck described above: the z score is z=(28.7-23.6)/3.6=1.42 standard deviations above the mean.

The z score for the compact car described above is

z=(31.2-27.5)/5.7=0.65 standard deviations above the mean

s

xxz

_

Z-Scores

Example: According to the American Freshman that number of hours per week that college freshman spend studying has a mean of 7.06 hours with a standard deviation of 2.32 hours. Suppose Sally Simplestudent spends 2 hours per week studying. Does Sally spend an unusually small amount of time studying?

According to the z score: z = (2-7.06)/2.32 = -2.18, Sally is more than 2 standard deviations away from the mean, so her low amount of study time is unusual.

Z-Scores

Intuition: a measure of how far an individual score is from the mean compared to the average distance of scores in the entire distribution from the mean.

Intuition: you can think of z-Scores as simply indicating the number of standard deviations a certain data point is away from the mean.

The Empirical Rule - for any bell-shaped, nearly symmetric distribution of data, the interval (x-s, x+s) contains approximately 68% of the data points, the interval (x-2s, x+2s) contains approximately 95% of the data points, and the interval (x-3s, x+3s) usually contains all the data points.

Documents

Chapter 1