Lecture 4 Applied Econometrics and Economic Modeling

Methods for Selecting

Random Samples

8.2 | 8.3 | 8.4 | 8.5 | 8.6

RANDSAMP.XLS

This file contains data about the annual incomes of 40 families.

We want to choose a simple random sample of size 10 from this frame.

How can this be done?

And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population?

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Data

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sampling Terminology

In any sampling problem there is a relevant population, the set of all members about which the study intends to make inferences.

Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the frame, and the potential sample members are called sampling units.

There are two type of samples, probability samples and judgmental samples.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sampling Terminology -- continued A probability sample is a sample in which the

sampling units are chosen from the population by means of a random mechanism such as a random number table.

No formal random mechanism is used to select a judgmental sample, in this case the sampling units are chosen according to the sampler’s judgment.

The simplest type of sampling scheme is appropriately called simple random sampling.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution The idea is very simple. We first generate a column of

random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows.

The following procedure produces the results.

– Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C.

– Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution -- continued

– Copy to a new range. Copy the range A10:C49 to the range E10:G49.

– Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample.

– Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Results

8.2 | 8.3 | 8.4 | 8.5 | 8.6

More Random Samples Automatically If we would like more random samples of size 10 we

would need to repeat the process repeatedly.

To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the RANDSAMP.XLS file. By clicking on the button we get a different random sample.

Example 8.2

Methods for Selecting Random Samples

8.2 | 8.3 | 8.4 | 8.5 | 8.6

RECEIVE.XLS This file contains 280 accounts receivable for the Spring

Mills Company. There are three variables:

– Size: customer size (small, medium, large), depending on its volume of business with Spring Mills

– Days: number of days since the customer was billed

– Amount: amount of the bill

Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Generated Random Sample

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution To select small accounts only, insert blank row after

account 150 (the last small account).

Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet.

To find the amounts owed for the sampled accounts, enter the formula =VLOOKUP(B3,Data!Data,4) in cell B21 and copy it to the range B21:AY35.

8.2 | 8.3 | 8.4 | 8.5 | 8.6


Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula =TRANSPOSE(B37:AY37) and pressing Ctrl-Shift-Enter.

Use StatPro’s histogram procedure to create a histogram - each will look different because of the random numbers selected.

8.2 | 8.3 | 8.4 | 8.5 | 8.6


The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts.

Example 8.3

Methods for Selecting Random Samples

8.2 | 8.3 | 8.4 | 8.5 | 8.6

STRATIFIED.XLS This file contains a frame of all 1000 people in the city of

Smalltown who have Sears credit cards.

Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit.

The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone.

How might Sears proceed?

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Systematic Sampling

A systematic sample provides a convenient way to choose the sample.

It works as follows:

– First, we calculate the sampling interval as the population size divided by the sample size.

– Next, we use a random mechanism to choose a number between 1 and 220 (Say number 131).

– Then we choose the 131st name, the 351st name, the 571 and so on. The result is a systematic sample of size n=250.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Stratified Sampling

Suppose we can identify various subpopulations within the total population. We call these subpopulations strata.

It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling.

This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Stratified Sampling -- continued

To obtain a stratified random sample we must choose a total sample size n, and we must choose a sample size ni for each stratum i.

There are many ways to choose these numbers but the most popular method is proportional sample sizes.

The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution First Sears must decide exactly how to stratify by age.

There reasoning is that different age groups probably have different attitudes and behavior regarding credit.

After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80.

The calculation goes as follows:

– the total sample size is cell C3

– the definitions of the strata in rows 6-8

– the customer data in range A11:B1010

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Stratified Sample

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution -- continued To see what age category each customer falls in we enter

the formula =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3))

in cell C11 and then copy it down column C.

Next, it is useful to “unstack” the data into three groups, one for each age category.

– It is easy to unstack the data in columns A-C.

– With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables.

8.2 | 8.3 | 8.4 | 8.5 | 8.6


Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas=COUNT(E11:E142) and =ROUND(TotSampSize*F6/1000,0).

Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers.

The file shows the calculations for the other categories.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Cluster Sampling Suppose a company is interested in various characteristics

of households in a particular city. The sampling units are households.

We could proceed with the sampling methods discussed but it would be more convenient another way.

We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks.

In this case the city blocks are called clusters and the sampling is called cluster sampling.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Cluster Sampling -- continued The advantage of cluster sampling is sampling

convenience (and possibly less cost).

It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster.

When all sampling units within each cluster are taken it is called a single stage sampling scheme.

Real applications are often more complex and result in multistage sampling schemes.

Example 8.4

An Introduction to Estimation

8.2 | 8.3 | 8.4 | 8.5 | 8.6

AUDIT.XLS

An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts.

The company has approximately 10,000 accounts. An exhaustive enumeration is impossible.

Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data.

What can the auditor conclude from this sample?

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Random Sample

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sources of Estimation Error There are two basic sources of errors that can occur when

we sample randomly from a population:

– Sampling error results from “unlucky” samples.

– Nonsampling errors, which are quite different, can occur for a variety of reasons.

• Nonresponse bias is when a portion of the sample fails to respond to the survey.

• Nontruthful responses are particularly a problem when asked sensitive questions. One solution is to use a randomized response technique by giving two sensitive questions: one sensitive, one innocuous.

• Measurement error occurs when the responses to the questions do not reflect what the investigator had in mind.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sampling Distribution of the Sample Mean We typically estimate the population mean by the

sample mean of the randomly chosen sample.

The sample mean is called a point estimate of the population mean.

In general a point estimate of any population parameter is a single-value estimate of that parameter, based on observed sample data.

The sampling error is the difference between the observed sample mean and the true population mean.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sampling Distribution of the Sample Mean A negative sampling error means an underestimate

of the population mean.

The standard deviation of the observed sample mean is called the standard error of the mean.

The sample mean is an unbiased estimate of the population mean.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution

The receivables for the 100 sampled accounts appear in column E.

We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula

=Sstdev/SQRT(SampSize)in cell B9.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Interpretation

The auditor should interpret these values as follows:

– The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts.

– The standard error provides a measure of accuracy.

The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363.


8.2 | 8.3 | 8.4 | 8.5 | 8.6

Background Information Suppose you have he opportunity to play a game with a

“wheel of fortune”. When you spin a large wheel, it is equally likely to stop in any position.

Depending on where it stops, you win anywhere from $0 to $1000.

Let’s suppose your winnings are actually based on not one, but n spins of the wheel.

If n=2, your winnings are based on the average of two spins. How does the distribution of your winnings depend on n?

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Random Sampling? What does this experiment have to do with random

sampling?

Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population.

Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Random Sampling?

The resulting population distribution is called the uniform distribution on the interval from $0 to $1000.

It can be shown that the mean and standard deviation are $500 and $289, respectively.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

SPIN1.XLS

In order to analyze the distribution of winnings based on the average of n spins we need to do a sequence of simulations for n=1, n=2, n=3, n=6 and n=10.

This spreadsheet contains the simulation for n=1. The other simulations can be found in the following spreadsheets, SPIN2.XLS, SPIN3.XLS, SPIN6.XLS, and SPIN10.XLS.

For each simulation we consider 1000 replications of an experiment.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Simulations

The experiment simulates n spins of the wheel and calculates the average - that is, the winnings - from the n spins.

Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each n. These will show clearly how the distribution of winnings depends on n.

The following slide shows the results for n=1. Here, there is no averaging.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Simulations -- continued

To replicate the experiment 1000 times and collect statistics, we proceed as follows.

– Random outcomes. To generate outcomes uniformly distributed between $0 and $1000 we enter the formula =$B$3RAND( ) *($B$4-$B$3) in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000.

– Summary measures. Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5.

8.2 | 8.3 | 8.4 | 8.5 | 8.6


– Frequency table and histogram. Use the StatPro histogram procedure to create a histogram of the values in column B.

– Note the following from the chart and graph from spin 1:

• The sample mean of the winnings (E4) is very close to the population mean: $500.

• The standard deviation of the winnings (cell E5) is very close to the population standard deviation: $289.

• The histogram is nearly flat.

– These should come as no surprise without any averaging taking place. Therefore, they are equivalent to the flat population distribution.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Simulations -- continued But what happens when n > 1?

The following slide contains the chart and graph of the n=2 simulation.

To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D.

The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

8.2 | 8.3 | 8.4 | 8.5 | 8.6


The histogram is now triangular shaped - symmetric, but not yet bell shaped.

To develop similar simulations for n=3, n=6, n=10, or any other n, we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all n outcomes to its left.

They clearly show two effects of increasing n:

– the histogram becomes more bell shaped

– there is less variability.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Histogram for Three Spins

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Histogram for Six Spins

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Histogram for Ten Spins

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Central Limit Theorem The mean stays right at $500.

This behavior is exactly as the central limit theorem predicts.

– For any population distribution with mean mu, the sampling distribution of the sample mean is approximately normal with the mean mu and the standard deviation , and the approximation as n increases.

If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for n much less than 30; it is already evident for n as low as 6.

n/


8.2 | 8.3 | 8.4 | 8.5 | 8.6

Background Information

A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area.

The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10,

Discuss how she should proceed.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sample Size Determination The determination of sampling size is usually driven by

sampling error considerations.

The usual procedure is to select an acceptable sampling error B called the maximum probable absolute error by using the equation

The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than B in magnitude.

2

24

Bn

8.2 | 8.3 | 8.4 | 8.5 | 8.6

SAMPSIZE.XLS

This file contains the data needed to solve the problem.

The researcher has chosen to maximize probable absolute error criterion with B=10, as the value she is willing to tolerate.

Therefore, she should use the maximum probable error equation.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Solution -- continued To use this equation she must estimate a value of .

Based on her knowledge of the industry and available historical data, she uses a best guess of sigma=50.

She then uses the values from C7 and C8 to find the required sample size in C10 with the formula

=4*PopStDev^2/MaxAbsErr^2

Finally, she takes a sample of size 100 and observes the sample values shown in column F. Based on this sample, we calculate summary measures in the usual way in the range C13:C16.

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Sample Size Determination

8.2 | 8.3 | 8.4 | 8.5 | 8.6

Results

The absolute error in cell C16 is 2 times as great as the standard error in cell C15.

It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed.

In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended.

Education

Lecture 4 Applied Econometrics and Economic Modeling