Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations

Statistical software for Sampling from

Finite Populations: an analysis using

Monte Carlo simulations

Michele De Meo

PhD in Statistics

Summary of the thesis

Università degli Studi di Bari - Italy

[email protected]

Aim of the thesis

Test the quality of the software for probability proportional to size (PPS)

sampling

The quality is the ability of the software to ensure properties of the

algorithms:

Monte Carlo Simulations

PPS Sampling without Replacement

1. Hanurav-Vijayan

2. Rao-Sampford

Three important properties:

First-order inclusion probability proportional to size

These algorithms enable computation of joint selection probabilities

Joint selection probabilities usually ensure non-negativity and stability

of the Sen-Yates-Grundy variance estimator

Statistical Software and PPS sampling

Closed source software:

SAS PROC SURVEYSELECT

SPSS COMPLEX SAMPLE

Open source software:

R SAMPLING library

Some notes

The official documentation of Hanurav-Vijayan (H-V) seems confused.

Two scientific articles tried to correct the original one.

According to the official user's guide of SAS and SPSS, the algo H-V

was developed in a way not exactly coincident with the original one.

They seems to be different from each other.

The source code for SAS and SPSS is "closed" and not available. It's

not possible to check "directly" the implemented algorithm.

Hunrav-Vijayan is not available in R, so the algo was developed (and

tested) according to the official bibliography.

Simulation and software control:

the sampled population.

the target population (used to test the algorithms) is the

following (assuming a sample of size n = 5):

the auxiliary variable (x) to select the sample is equal to i.

This is a "trick" to facilitate the code, maintaining the

experiment still valid.


the sampled population.

The positive outcome for the tests performed with this

population (with the the sample of size n = 5) it's necessary

and not sufficient for the validity of the all sampling

algorithms.

The negative outcome with this population and this sample

would be sufficient to invalidate the algorithm. It must "work"

regardless of the population or the sample size.


test 1

The Joint Probability Matrix for Hanurav-Vijayan and Rao-Sampford

is well known:

A first test is comparing the output of the software with the correct

matrix.


test 1

Such a test is not easy in SAS and SPSS:

it's an hidden procedure

the returned matrix refers only to the selected units, not to the

whole population

develop an ad-hoc procedure


results of test 1

In SAS, SPSS and R (for Hanurav-Vijayan and Rao-

Sampford) the matrix is exactly equals to the original

one!


test 2

Perform a Monte Carlo simulation to obtain a numerical

estimate of the joint probability matrix.

Measure the "distance" between estimates and the original

matrix.

Simulation steps

1. define the target population and sample size;

2. define the matrix P(0)=[0]NXN, where N is the population size

3. define the number of simulations (K)

4. execute the following steps (K times), where H=1,2,…,K

5. draw a sample of n=5 units, then build the vector sh=[sh(i)]Nx1. The

element sh(i) is equal to 1 if the unit i of the population has been

selected, 0 otherwise.

6. update the matrix P: P(H) = P(H-1) + sH * s’H

7. the cross product sH * s’H will produce a symmetric N x N matrix.

The element (i,j) of this matrix will be 1 for pairs of drawn units, 0

otherwise.

At the end of this simulation process, the "numerical" estimate

of the Joint Probability Matrix will be equal to:

The proof for this simulation process is

the weak law of large numbers:

P is a "good" estimate of 𝜋

It's possible "measure the distance" between

estimated and real value using the following

distribution:

Per each pairs of units (i,j), we can use the p-level to

analyze "how good is the estimate".

P-level too close to zero are representative of a

wrong output!

Simulation results

P-level in R for Hanurav-Vijayan.

k=10,000,001 and n=5. p-level<.01 highlighted in red.

Simulation results

P-level in SAS for Hanurav-Vijayan.


Simulation results

P-level in SPSS for Hanurav-Vijayan.


Simulation results

P-level in R for Rao-Sampford.


Simulation results

P-level in SAS for Rao-Sampford.


Simulation results

P-level in SPSS for Rao-Sampford.


Conclusions

R is better suited for the development of this type of

simulations

the data.frame (the "container" of the data) is easily managed

as an "object" matrix, therefore it's easy to access the data

Specific "libraries" needed for working with matrices in SAS

and SPSS

R code is "sliding" and more powerful!

Conclusions

Looking at the join probability matrix in R, there is a clear

"correspondence" between estimated and real values (for

both Rao-Sampford and Hanurav-Vijayan)

The results show a general bad situation for both algorithms

tested in SAS and SPSS:

almost all p-values are equal to zero, even for the

inclusion probabilities of first-order!

Conclusions

SAS and SPSS do not converge to the result for two reasons:

1. wrong implementation of the algorithm in program code

(both of them!)

2. wrong pseudo-random number generator (PRNG):

Conclusions

Pseudo-Random Number Generator used:

SAS

Linear congruential generator, Park-Miller (period: 2^31-1)

R and SPSS

Mersenne-Twister (period: 2^ 19,937 - 1)

Conclusions

SAS and SPSS lead to results strongly "biased", regardless

of the cause of non-convergence (both for Hanurav-Vijayan

and for Rao-Sampford)

Negative impact on the validity of the simulation studies

carried out by these procedures (in SAS and SPSS). For

example: Monte Carlo simulations to verify the bias of an

estimator (such as for the total or the variance)

Bibliography Brewer, K. e Hanif, M. (1983), Sampling with Unequal Probabilities, Springer-Verlag, New-York.

Chambers, J. (2008), Software for Data Analysis: Programming with R, Springer, New York.

Chieppa, M. e D'Orazio, M. (1999), Appunti di Teoria dei Campioni,Università degli Studi del Sannio, Benevento. Cicchitelli, G., Herzel, A. e Montanari, G. E. (1997), Il campionamento statistico, Il Mulino, Bologna.

Efron, B. e Tibshirani, R. (1991), Statistical data analysis in the computer age, Science, vol. 253, p. 390395.

Fishman, G. e More, L. R. (1981),In search of correlation in multiplicative congruential generators with modulus 2**31-1, Computer Science and Statistics, Proceedings

of the 13th Symposium on the Interface, p. 155157.

Fox, D. (1989), Computer Selection of Size-Biased Samples, The American Statistician, vol. 43 (3), p. 168171.

Gentle, J., Hardle, W. e Mori, Y. (2008), Handbook of Computational Statistics: Concepts and Methods, Springer-Verlag, New-York. Golmant, J. (1990), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 44 (2), p. 194.

Hanurav, T. (1967), Optimum Utilization of Auxiliary Information: Sampling of Two Units from a Stratum, Journal of the Royal Statistical Society, vol. B (29), p. 374391. Lauro, C. (1996), Computational statistics or statistical computing, is that the question?, Computational Statistics and Data Analysis, vol. 23 (1), p. 191-193. Capitolo 5

L'Ecuyer, P. (1990), Random Numbers for Simulation, Communications of the ACM, vol. 33 (10), p. 8597.

Marsaglia, G. (1995), The Diehard Battery of Tests of Randomness, Rap. tecn., Florida State University, http://www.stat.fsu.edu/pub/diehard/.

Matsumoto, M. e Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistrited Uniform Pseudo-Random Number Generator , ACM Transactions on

Modeling and Computer Simulation, vol. 8 (1), p. 330.

Mecatti, F. (2004), Lezioni di Metodi di Simulazione, Università degli Studi di Milano-Bicocca.

Mood, A., Graybill, A. e Boes, D. (1991), Introduzione alla Statistica,Mc-Graw-Hill, Milano.

Rao, J. (1965), On two simple schemes of unequal probability sampling without replacement, The Indian Journal of Statistics, vol. 3, p. 173-180. Raynald, L. (2008), Programming and Data Management for SPSS Statistics 17.0. A Guide for SPSS Statistics and SAS R Users, http:

//www.spss.com/statistics/base/ProgDataMgmtSPSS17.pdf.

Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, vol. 54, p. 499-513. SAS-Institute (1999), SAS/STAT R User's Guide - Version 8, http://www.math.wpi.edu/saspdf/stat/chap63.pdf.

Sen, A. (1953), On the estimate of variance in sampling with varying probabilities , Journal of the Indian Society of Agricultural Statistics, vol. 5, p. 119-127. Tillé, Y. (2006), Sampling Algorithms, Springer Series in Statistics, New-York.

Tillé, Y. e Alina, M. (2009), Package sampling, http://cran.r-project.org/web/packages/sampling/sampling.pdf.

Vijayan, K. (1967), An Exact pps Sampling Scheme: Generalization of a Method of Hanurav, Journal of the Royal Statistical Society, vol. B (30), p. 556-566.

Watts, D. (1991), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 45 (2), p. 172.

Wu, C. (2005), R/S-PLUS codes for the pseudo EL method and the Rao-Sampford sampling procedure, Rap. tecn., University of Waterloo, http://www.math.uwaterloo.ca/~cbwu/Rcodes/04JSS.R.

Yates, F. e Grundy, P. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, vol. B (15), p. 253-261.

Data & Analytics

Statistical software for Sampling from Finite Populations: an analysis using Monte Carlo simulations