SJS SDI_161 Design of Statistical Investigations Stephen Senn Random Sampling I

SJS SDI_16 1

Design of Statistical Investigations

Stephen Senn

Random Sampling I

SJS SDI_16 2

Simple Random Sample

Definition

If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected the sampling procedure is called simple random sampling. The sample thus obtained is called a simple random sample.

Scheaffer, Mendenhall and Ott,

Elementary Survey Sampling, Fourth Edition

SJS SDI_16 3

Typical Use of Sample

• The typical use of a sample is to say something about a population mean or proportion

• Point estimate– mean, proportion

• Confidence interval– 95%, 99% etc

• Occasionally we are interested in estimating totals– Total weight, total value etc.

SJS SDI_16 4

sample

y

0 20 40 60 80 100

40

45

50

55

60

95% CI for mean for 100 samples of size 10 from n(50,4)

SJS SDI_16 5

With or without replacement?The above definition is slightly more general than that which we encountered previously. It allows for sampling without replacement.

Our previous definition stressed independence. Strictly speaking, for any finite population draws are not quite independent if they do not occur without replacement.

Why is this? Consider a sample of size two drawn from N without replacement. There are ! ( 1)

2 2!( 2)! 2

N N N N

N

ways of choosing the sample. Hence, the probability of a given sample being chosen is 2

( 1)N N

SJS SDI_16 6

But an independence argument would produce a different answer. The probability of a given item being chosen is 1/N. Hence the probability of two given items being independently chosen in any order is

2

1 1 2 22

( 1)N N N N N

Note, however, that provided N is large compared to n, the distinction between sampling with and without replacement is unimportant.

This is fortunate, since correcting for sampling without replacement from finite populations involves a lot of tedious but elementary algebra!

SJS SDI_16 7

How Not to Draw a Simple Random Sample

• Do not use own judgement– This is haphazard sampling– Subject to psychological bias– Human beings are not good randomisers

• Do not use systematic sampling– There may be cyclic patterns or other trends in

the population

SJS SDI_16 8

The Swiss Lottery

• Draw 6 from 45– 45C6=8,145,060 combinations

• Professor Hans Riedwyl’s study of a given draw– 16,862,596 tickets sold– approximately two tickets per choice– There were over 5000 combinations that were

chosen more than 50 times!

SJS SDI_16 9

The UK Lottery• This is a 6/49 lottery

• In the first 282 draws – average jackpot £2 million– maximum £22.6 million

• Draw 9 January 1995– 133 people bought the winning combination

• 7,17,23,32,38,42

– £122,510 each Source John Haigh, Taking Chances, Oxford

SJS SDI_16 10

Random Pattern?1 2 3 4 5

6 7 8 9 1011 12 13 14 15

16 17 18 19 20

21 22 23 24 2526 27 28 29 30

31 32 33 34 35

36 37 38 39 40

41 42 43 44 4546 47 48 49

Random, from the point of view of the lottery machine but evidently not to the punter!

SJS SDI_16 11

How to Choose a Random Sample

• Sampling frame of N population units with each item identified by a unique number 1 to N

• ‘Generate’ random number between 0 and 1– Using computer, random number table, randomising device

• Multiply by N and round up• Select population member indicated• Repeat n times

– For sampling without replacement draw again if number is chosen twice

SJS SDI_16 12

The S-PLUS Approach> #To illustrate different approaches to samplingN <- 20> # Size of populationn <- 10> # Size of sampleIdentify <- c(1:N)> #Population identifiersIdentify [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20> #Sample with replacementsort(sample(Identify, size = n, replace = T)) [1] 6 7 7 7 12 14 16 16 19 19> #Sample without replacementsort(sample(Identify, size = n, replace = F)) [1] 2 8 11 12 13 14 16 17 18 20

SJS SDI_16 13

Finite Population Correction Factors

In practice we very rarely carry out sampling with replacement.

EXCEPTION The bootstrap - re-sampling investigation of properties of statistics.

However, often populations are large compared to samples. Hence we can behave as if draws were independent. The theory of sampling with replacement applies. In the next few slides we consider what happens when the population is not large.

y1,y2,…yn simple random sample from a population of values u1,u2,…uN.

SJS SDI_16 14

2 2

1 1

1 1

2

2

2 12

1

2 2,

2

( ) / , ( ) ( ) /

1 1( ) ( )

1

( ) [( )( )] ( ) ( ) ( ) ( )

1 1

( 1)

N N

i i i ii i

n n

ii i

N

i Ni

ii

i j i j i j i j i j

N

i ji j

E y u N V y u N

E y E yn n

uu

N N

Cov y y E y y E y y E y E y E y y

u uN N N

2 2

1 1

22

2

1 1

1

1 1

1

1 1

1

N N Ni j

i ii i j i

N N

i i Ni i

ii

u uu u

N N N

u u

uN N N

This section based closely on Scheaffer, Mendenhall and Ott

SJS SDI_16 15

22

2

1 1

1

22

1 1

2

2

1 1

2 2

1

1 1

1

1 1 ( 1)

1 ( 1)

1 1 1

1 ( 1)

1 1( )

( 1) 1

N N

i i Ni i

ii

N N

i ii i

N N

i ii i

N

ii

u u

uN N N

N Nu u

N N N N

u uN N N N

u uN N N

We can use this fact to find the variance of y

SJS SDI_16 16

22

1 1

22

2

22

2

2

1 1( ) 2 ( , )

1

1

1 2 ( 1)

1 2

1

n n

i i ji i i j

i j

V y V y Cov y yn n

nn N

n nn

n N

N n

n N

SJS SDI_16 17

Variance estimation

22 2

1 1

2 2

1

2 2

1

2 2

1 1( ) ( ) ( ) ( )

1 1

1( ) ( )

1

1( ) ( )

1

1 1( )

1 1 1

n n

i ii i

n

ii

n

ii

E s E y y E y yn n

E y n yn

E y nE yn

N nn nV y n n

n n N

2

n

SJS SDI_16 18

When Can we Ignore FPCFs?

• Large population relative to sample– N/n is large

• The sample does not form part of the population for which we are issuing the prediction– Destructive sampling of manufacturing output

• From now on we shall ignore FPCFs

SJS SDI_16 19

22

22

22

2

1

1 1

1 1 1

1

1

( )1

N nn n

n N n

N n Nn

n N N

N n s N n NE

N n N n N

N nV y

N n

SJS SDI_16 20

Error Bounds and Sample SizeIt is traditional to use error bounds of two standard errors.

This is a way of giving an impression of the precision of the survey.

The desired error bound,, can be used to fix the size of the sample.

2

2

2

2

4

n

n

n

This is the appropriate formula for the sample size given a desired bound on the mean

SJS SDI_16 21

Questions

• How many people do you have to have in a room before the probability that at least two share the same birthday is at least 1/2?

• Suppose we want to estimate the total in a population rather than the mean. What is the error bound on the total?

• What is the error bound for a population proportion?

Documents

SJS SDI_161 Design of Statistical Investigations Stephen Senn Random Sampling I