Chapter 7: Statistical Applications in Traffic Engineering

Chapter 7: Statistical Applications in Chapter 7: Statistical Applications in Traffic EngineeringTraffic Engineering

Chapter objectives: By the end of these chapters the student will be able to (We spend 3 lecture periods for this chapter. We do skip simple descriptive stats because they were covered in CE361.):

Lecture number

Lecture Objectives(after these lectures you will be able to)

Lecture 3 (Chap 7a file)

• Apply the basic principles of statistics contained in section 7.1 to traffic data analyses• Explain the characteristics of the normal distribution and read the normal distribution table correctly (section 7.2) and get necessary values from Excel.• Explain the meaning of confidence bounds and determine the confidence interval of the mean (section 7.3)• Determine sample sizes of traffic data collection (section 7.4)• Explain how random variables are added (section 7.5)• Explain the implication of the central limit theorem (section 7.5.1)• Explain the characteristics of various probabilistic distributions useful for traffic engineering studies and choose a correct distribution for the study(section 7.6)

Lecture 4a(Chap 7b file)

• Explain the special characteristics of the Poisson distribution and its usefulness to traffic engineering studies (section 7-7)• Conduct a hypothesis test correctly (two-sided, one-sided, paired test, F-test) (section 7-8)

Lecture 4b(Chap 7 file)

• Conduct a Chi-square test to test hypotheses on an underlying distribution f(x) (section 7-8)

IntroductionIntroduction

How many samples are required?How many samples are required? What confidence should I have in this estimate?What confidence should I have in this estimate? What statistical distribution best describes the observed What statistical distribution best describes the observed

data mathematically?data mathematically? Has a traffic engineering design resulted in a change in Has a traffic engineering design resulted in a change in

characteristics of the population (hypothesis tests)?characteristics of the population (hypothesis tests)?

Traffic engineering studies: Infer the characteristics in a population (typically infinite) by observing the characteristics of a finite sample.

Statistical analysts is used to address the following questions:

7.1 An Overview of Probability Functions and Statistics7.1 An Overview of Probability Functions and Statistics

Most of the topics in this section are reviews of what we have learned in CEEn 361. (Review 7.1.1, 7.1.3 and 7.1.4 by yourself.)

7.1.2 Randomness and distributions describing randomness

“Model the system as simply (or as precisely) as possible (or necessary) for

all practical purposes.”

One new topic in 7.1.4 is a method to estimate the standard deviation. This is based on the normal distribution – the probability of one standard deviation from the mean is 68.3% in the two-way analysis. 85%-15% = 70%, close enough.

21585 PP

sest

The discussion of turning vehicles is very instructive. P.132 right column.

Connection between the typical computation and probability Connection between the typical computation and probability involving formulas for mean and varianceinvolving formulas for mean and variance

N

iix

Nx

1

1

P(x) x*P(x) (x-μ) 2̂*P(x)3.50 0.17 0.58 0.014.25 0.17 0.71 0.052.70 0.17 0.45 0.172.70 0.17 0.45 0.173.65 0.17 0.61 0.005.50 0.17 0.92 0.53

Mean 3.72 Sum 3.72 0.93Variance 0.93

Data (Population)

Mean µ = x*P(x) Variance 2 = (x - µ)2P(x)

1

2

2

N

xxs i

(Population)

(Sample)

7.2 The normal distribution and its applications7.2 The normal distribution and its applications

z = (x - µ)/

= (65 – 55)/7

= 1.43

Mean = 55 mph, STD = 7 mph

What’s the probability the next value will be 65 mph or less?

From the sample normal distribution to the standard normal distribution.

0.9236 from Table 7.3

(Discuss the 3 procedures in p. 137 left column top)

Use of the standard normal distribution Use of the standard normal distribution table, Tab 7-3table, Tab 7-3

Z = 1.43

Most popular one is 95% within µ ± 1.96 (Excel functions: NORMSDIST and NORMSINV)

Table 7-3

7.3 Confidence bounds (of the mean)7.3 Confidence bounds (of the mean)

Point estimates: A point estimate is a single-values estimate of a population parameter made from a sample.

Interval estimates: An interval estimate is a probability statement that a population parameter is between two computed values (bounds).

µ

X

X

X – tas/sqrt(n) X + tas/sqrt(n)

- - True population mean

Point estimate of X from a sample

Two-sided interval estimate

7.3 (cont)7.3 (cont)

When n gets larger (n>=30), t can become z. The probability of any random variable being within 1.96 standard deviations of the mean is 0.95, written as:

P[(µ - 1.96) y (µ + 1.96)] = 0.95

Obviously we do not know µ and . Hence we restate this in terms of the distribution of sample means:

P[( x - 1.96E) y ( x + 1.96E)] = 0.95

Where, E = s/SQRT(n), standard error of the mean

_ _

When E is meant to mean tolerance, we use the symbol e.

7.4 Sample size computations7.4 Sample size computations

For cases in which the distribution of means can be considered normal, the confidence range for 95% confidence is:

n

s96.1

If this value is called the tolerance (or “precision”), and given the symbol e, then the following equation can be solved for n, the desired sample size:

n

se 96.1 and 2

2

84.3e

sn

By replacing 1.96 with z and 3.84 with z2, we can use this for any level of confidence.

7.5 Addition of random variables7.5 Addition of random variables

ii XaY

xiiY a

222xiiY a

Summation of random variables:

Expected value (or mean) of the random variable Y:

Variance of the random variable Y: These concepts are useful for statistical work. See the sample problems in page 140.

7.5.1 The central limit theorem7.5.1 The central limit theorem

Definition: The population may have any unknown distribution with a mean µ and a finite variance of 2. Take samples of size n from the population. As the size of n increases, the distribution of sample means will approach a normal distribution with mean µ and a variance of 2/n.

F(x)

xµ

X distribution

X ~ any (µ, 2)

approaches

)(Xf

X

µ XX distribution

),(~ 2XNX

7.6 The Binomial Distribution Related to the 7.6 The Binomial Distribution Related to the Bernoulli and Normal DistributionsBernoulli and Normal Distributions

Discrete distribution

Has only two possible outcomes: Heads-tails, one-zero, yes-no

P(X = 1) = p

P(x + 0) = 1 - p

Event X1 0

p

1 - p

Probability mass function

Assumptions:

There is a single trial with only two possible outcomes.

The probability of an outcome is constant for each trial.

7.6.1 Bernoulli and the Binomial distribution (discrete probability functions))

Explanation of the Binomial distributionExplanation of the Binomial distributionAssumptions:

n independent Bernoulli trials

Only 2 possible outcomes on each trial

Constant probability for each outcome on each trial

The quantity of interest is the total number of X of positive outcomes, a value between 0 and N.

Outcome

0 1 2 3

Example: 3 trials of flipping a coin

No. of tails Possible outcomes Prob. of outcome0 HHH (1/2)0(1/2)3

1 HHT HTH THH 3(1/2)1(1/2)2

2 TTH THT HTT 3(1/2)2(1/2)1

3 TTT (1/2)3(1/2)0

Read 7.6.2 for a sample application of the Binomial distribution.

(See equation 7-14)

Mean: Np, Variance: Npq Discuss 7.6.2.

7.7 The Poisson distribution (“counting 7.7 The Poisson distribution (“counting distribution” or “Random arrival” discrete distribution” or “Random arrival” discrete

probability function)probability function)

!)(

x

emxXP

mx

With mean µ = m and variance 2 = m.

If the above characteristic is not met, the Poisson theoretically does not apply.

The binomial distribution tends to approach the Poisson distribution with parameter m = np. Also, the binomial distribution approaches the normal distribution when np/(1-p)>=9

When time headways are exponentially distributed with mean = 1/, the number of arrivals in an interval T is Poisson distributed with mean = m = T. Note that the unit is veh/unit time (arrival rate).

(Read the sample problem in page 144, table 7.5)

7.8 Hypothesis testing7.8 Hypothesis testing

Two distinct choices:

Null hypothesis, H0

Alternative hypothesis: H1

E.g. Inspect 100,000 vehicles, of which 10,000 vehicles are “unsafe.” This is the fact given to us.

H0: The vehicle being tested is “safe.”

H1: The vehicle being tested is “unsafe.”

In this inspection,

15% of the unsafe vehicles are determined to be safe Type II error (bad error)

and 5% of the safe vehicles are determined to be unsafe Type I error (economically bad but safety-wise it is better than Type II error.)

Types of errorsTypes of errors

Reality Decision

Reject H0 Accept H0

H0 is true

H1 is true

Type I error

Type II error

Correct

Correct

Reject a correct null hypothesis

Fail to reject a false null hypothesis

We want to minimize especially Type II error.

Steps of the Hypothesis Testing

State the hypothesis

Select the significance level

Compute sample statistics and estimate parameters

Compute the test statistic

Determine the acceptance and critical region of the test statistics

Reject or do not reject H0

P(type I error) = (level of significance)

P(type II error ) =

(see the binary case in p. 145/146. to get a feel of Type II error.)

Dependence between Dependence between , , , and sample , and sample size nsize n

There is a distinct relationship between the two probability values and and the sample size n for any hypothesis. The value of any one is found by using the test statistic and set values of the other two.

Given and n, determine . Usually the and n values are the most crucial, so they are established and the value is not controlled.

Given and , determine n. Set up the test statistic for and with H0 value and an H1 value of the parameter and two different n values.

The t (or z) statistics is: t or zn

X

)(

7.8.1 Before-and-after tests with two distinct choices

Here we are comparing means; hence divide σ by sqrt(n).

7.8.2 Before-and-after tests with 7.8.2 Before-and-after tests with generalized alternative hypothesisgeneralized alternative hypothesis

The significance of the hypothesis test is indicated by , the type I error probability. = 0.05 is most common: there is a 5% level of significance, which means that on the average a type I error (reject a true H0) will occur 5 in 100 times that H0 and H1 are tested. In addition, there is a 95% confidence level that the result is correct.

If H1 involves a not-equal relation, no direction is given, so the significance area is equally divided between the two tails of the testing distribution.

If it is known that the parameter can go in only one direction, a one-sided test is performed, so the significance area is in one tail of the distribution.

One-sided upper

Two-sided

0.025 each

0.05

Two-sided or one-sided testTwo-sided or one-sided test

These tests are done to compare the effectiveness of an improvement to a highway or street by using mean speeds.

If you want to prove that the difference exists between the two data samples, you conduct a two-way test. (There is no change.)

If you are sure that there is no decrease or increase, you conduct a one-sided test. (There was no decrease)

Null hypothesis H0: 1 = 2 (there is no increase)

Alternative H1: 1 2

Null hypothesis H0: 1 = 2 (there is no change)

Alternative H1: 1 = 2

ExampleExample

Existing After improvement

Sample size 55 55

Mean 60 min 55 min

Standard Deviation

8 min 8 min

53.155

8

55

8 22

2

22

1

21

nnY

96.12/ z 65.1z

At significance level = 0.05 (See Table 7-3.)

The decision point (or typically zc:

For two-sided:

1.96*1.53 = 2.998

For one-sided:

1.65*1.53 =2.525

|µ1 - µ2| = |60-55| = 5 > zc

By either test, H0 is rejected.

7.8.3 Other useful statistical tests7.8.3 Other useful statistical tests

21

21

11 nns

xxt

p

2

11

21

222

211

nn

snsnsp

The t-test (for small samples, n<=30) – Table 7.6:

The F-test (for small samples) – Table 7.7:In using the t-test we assume that the standard deviations of the two samples are the same. To test this hypothesis we can use the F-test.

22

21

s

sF (By definition the larger s is

always on top.)

(See the samples in pages 149 and 151.

7.8.3 Other useful statistical tests (cont)7.8.3 Other useful statistical tests (cont)The F-Test to test if s1=s2

When the t-test and other similar means tests are conducted, there is an implicit assumption made that s1=s2. The F-test can test this hypothesis.

22

21

s

sF The numerator variance > The denominator

variance when you compute a F-value.

If Fcomputed ≥ Ftable (n1-1, n2-1, a), then s1≠s2 at a asignificance level.

If Fcomputed < Ftable (n1-1, n2-1,a), then s1=s2 at a asignificance level.

Discuss the problem in p.151.

Paired difference testPaired difference test

You perform a paired difference test only when you have a control over the sequence of data collection.

e.g. Simulation You control parameters. You have two different signal timing schemes. Only the timing parameters are changed. Use the same random number seeds. Then you can pair. If you cannot control random number seeds in simulation, you are not able to do a paired test.

Table 7-8 shows an example showing the benefits of paired testing The only thing changed is the method to collect speed data. The same vehicle’s speed was measure by the two methods.

Paired or not-paired example (table 7.8)Paired or not-paired example (table 7.8)

Method 1 Method 2 Difference

Estimated mean

56.9 61.2 4.3

Estimated SD 7.74 7.26 1.5

H0: No increase in test scores (means one-sided or one-tailed)

Not paired: Paired:

74.215

26.7

15

74.7 22

Ys

|56.9 – 61.2| = 4.3 < 4.54 (=1.65*2.74)

Hence, H0 is NOT rejected.

4.3 increase > 0.642 (=1.65*0.388)

Hence, H0 is clearly rejected.

388.015

50.1E

Chi-square (Chi-square (22-) test (So called -) test (So called “goodness-of-fit” test)“goodness-of-fit” test)

Example: Distribution of height data in Table 7-9.

H0:The underlying distribution is uniform.

H1: The underlying distribution is NOT uniform.

The authors intentionally used the uniform distribution to make the computation simple. We will test a normal distribution I class using Excel.

Steps of Steps of Chi-square (Chi-square (22-) test -) test

Define categories or ranges (or bins) and Define categories or ranges (or bins) and assign data to the categories and find assign data to the categories and find nnii = the = the number of observations in each category number of observations in each category ii. . (At (At least 5 bins and each should have at least 5 observations.)least 5 bins and each should have at least 5 observations.)

Compute the expected number of samples for Compute the expected number of samples for each category (theoretical frequency), using the each category (theoretical frequency), using the assumed distribution. Define assumed distribution. Define ffii = the number of = the number of samples for each category samples for each category ii..

Compute the quantity:Compute the quantity:

N

i i

ii

f

fn

1

22 )(

Steps of Chi-square (Steps of Chi-square (22-) test (cont)-) test (cont)

2 2 is chi-square distributed (see Table 5-8). If this is chi-square distributed (see Table 5-8). If this value is low if our hypothesis is correct. Usually value is low if our hypothesis is correct. Usually we use we use = 0.05 (5% significance level or 95% = 0.05 (5% significance level or 95% confidence level). When you look up the table, confidence level). When you look up the table, the degree of freedom is the degree of freedom is f = N – 1 – gf = N – 1 – g where where gg is is the number of parameters we use in the the number of parameters we use in the assumed distribution. For normal distribution assumed distribution. For normal distribution g = g = 22 because we use µ and because we use µ and to describe the shape to describe the shape of normal distribution.of normal distribution.

If the computed If the computed 2 2 value is smaller than the value is smaller than the critical critical cc

2 2 value, we accept Hvalue, we accept H00..

What’s the Chi-square (What’s the Chi-square (22-) test testing?-) test testing?

Assumed distribution

Expected distribution (or histogram)

Actual histogram

Chi-square (2-) test

You need to know how to pull out values from the assumed distribution to create the expected histogram.

Documents

Chapter 7: Statistical Applications in Traffic Engineering