Upload
michele-de-meo
View
2.646
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Statistical software for Sampling from
Finite Populations: an analysis using
Monte Carlo simulations
Michele De Meo
PhD in Statistics
Summary of the thesis
Università degli Studi di Bari - Italy
Aim of the thesis
Test the quality of the software for probability proportional to size (PPS)
sampling
The quality is the ability of the software to ensure properties of the
algorithms:
Monte Carlo Simulations
PPS Sampling without Replacement
1. Hanurav-Vijayan
2. Rao-Sampford
Three important properties:
First-order inclusion probability proportional to size
These algorithms enable computation of joint selection probabilities
Joint selection probabilities usually ensure non-negativity and stability
of the Sen-Yates-Grundy variance estimator
Statistical Software and PPS sampling
Closed source software:
SAS PROC SURVEYSELECT
SPSS COMPLEX SAMPLE
Open source software:
R SAMPLING library
Some notes
The official documentation of Hanurav-Vijayan (H-V) seems confused.
Two scientific articles tried to correct the original one.
According to the official user's guide of SAS and SPSS, the algo H-V
was developed in a way not exactly coincident with the original one.
They seems to be different from each other.
The source code for SAS and SPSS is "closed" and not available. It's
not possible to check "directly" the implemented algorithm.
Hunrav-Vijayan is not available in R, so the algo was developed (and
tested) according to the official bibliography.
Simulation and software control:
the sampled population.
the target population (used to test the algorithms) is the
following (assuming a sample of size n = 5):
the auxiliary variable (x) to select the sample is equal to i.
This is a "trick" to facilitate the code, maintaining the
experiment still valid.
Simulation and software control:
the sampled population.
The positive outcome for the tests performed with this
population (with the the sample of size n = 5) it's necessary
and not sufficient for the validity of the all sampling
algorithms.
The negative outcome with this population and this sample
would be sufficient to invalidate the algorithm. It must "work"
regardless of the population or the sample size.
Simulation and software control:
test 1
The Joint Probability Matrix for Hanurav-Vijayan and Rao-Sampford
is well known:
A first test is comparing the output of the software with the correct
matrix.
Simulation and software control:
test 1
Such a test is not easy in SAS and SPSS:
it's an hidden procedure
the returned matrix refers only to the selected units, not to the
whole population
develop an ad-hoc procedure
Simulation and software control:
results of test 1
In SAS, SPSS and R (for Hanurav-Vijayan and Rao-
Sampford) the matrix is exactly equals to the original
one!
Simulation and software control:
test 2
Perform a Monte Carlo simulation to obtain a numerical
estimate of the joint probability matrix.
Measure the "distance" between estimates and the original
matrix.
Simulation steps
1. define the target population and sample size;
2. define the matrix P(0)=[0]NXN, where N is the population size
3. define the number of simulations (K)
4. execute the following steps (K times), where H=1,2,…,K
5. draw a sample of n=5 units, then build the vector sh=[sh(i)]Nx1. The
element sh(i) is equal to 1 if the unit i of the population has been
selected, 0 otherwise.
6. update the matrix P: P(H) = P(H-1) + sH * s’H
7. the cross product sH * s’H will produce a symmetric N x N matrix.
The element (i,j) of this matrix will be 1 for pairs of drawn units, 0
otherwise.
At the end of this simulation process, the "numerical" estimate
of the Joint Probability Matrix will be equal to:
The proof for this simulation process is
the weak law of large numbers:
P is a "good" estimate of 𝜋
It's possible "measure the distance" between
estimated and real value using the following
distribution:
Per each pairs of units (i,j), we can use the p-level to
analyze "how good is the estimate".
P-level too close to zero are representative of a
wrong output!
Simulation results
P-level in R for Hanurav-Vijayan.
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results
P-level in SAS for Hanurav-Vijayan.
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results
P-level in SPSS for Hanurav-Vijayan.
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results
P-level in R for Rao-Sampford.
k=10,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results
P-level in SAS for Rao-Sampford.
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Simulation results
P-level in SPSS for Rao-Sampford.
k=1,000,001 and n=5. p-level<.01 highlighted in red.
Conclusions
R is better suited for the development of this type of
simulations
the data.frame (the "container" of the data) is easily managed
as an "object" matrix, therefore it's easy to access the data
Specific "libraries" needed for working with matrices in SAS
and SPSS
R code is "sliding" and more powerful!
Conclusions
Looking at the join probability matrix in R, there is a clear
"correspondence" between estimated and real values (for
both Rao-Sampford and Hanurav-Vijayan)
The results show a general bad situation for both algorithms
tested in SAS and SPSS:
almost all p-values are equal to zero, even for the
inclusion probabilities of first-order!
Conclusions
SAS and SPSS do not converge to the result for two reasons:
1. wrong implementation of the algorithm in program code
(both of them!)
2. wrong pseudo-random number generator (PRNG):
Conclusions
Pseudo-Random Number Generator used:
SAS
Linear congruential generator, Park-Miller (period: 2^31-1)
R and SPSS
Mersenne-Twister (period: 2^ 19,937 - 1)
Conclusions
SAS and SPSS lead to results strongly "biased", regardless
of the cause of non-convergence (both for Hanurav-Vijayan
and for Rao-Sampford)
Negative impact on the validity of the simulation studies
carried out by these procedures (in SAS and SPSS). For
example: Monte Carlo simulations to verify the bias of an
estimator (such as for the total or the variance)
Bibliography Brewer, K. e Hanif, M. (1983), Sampling with Unequal Probabilities, Springer-Verlag, New-York.
Chambers, J. (2008), Software for Data Analysis: Programming with R, Springer, New York.
Chieppa, M. e D'Orazio, M. (1999), Appunti di Teoria dei Campioni,Università degli Studi del Sannio, Benevento. Cicchitelli, G., Herzel, A. e Montanari, G. E. (1997), Il campionamento statistico, Il Mulino, Bologna.
Efron, B. e Tibshirani, R. (1991), Statistical data analysis in the computer age, Science, vol. 253, p. 390395.
Fishman, G. e More, L. R. (1981),In search of correlation in multiplicative congruential generators with modulus 2**31-1, Computer Science and Statistics, Proceedings
of the 13th Symposium on the Interface, p. 155157.
Fox, D. (1989), Computer Selection of Size-Biased Samples, The American Statistician, vol. 43 (3), p. 168171.
Gentle, J., Hardle, W. e Mori, Y. (2008), Handbook of Computational Statistics: Concepts and Methods, Springer-Verlag, New-York. Golmant, J. (1990), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 44 (2), p. 194.
Hanurav, T. (1967), Optimum Utilization of Auxiliary Information: Sampling of Two Units from a Stratum, Journal of the Royal Statistical Society, vol. B (29), p. 374391. Lauro, C. (1996), Computational statistics or statistical computing, is that the question?, Computational Statistics and Data Analysis, vol. 23 (1), p. 191-193. Capitolo 5
L'Ecuyer, P. (1990), Random Numbers for Simulation, Communications of the ACM, vol. 33 (10), p. 8597.
Marsaglia, G. (1995), The Diehard Battery of Tests of Randomness, Rap. tecn., Florida State University, http://www.stat.fsu.edu/pub/diehard/.
Matsumoto, M. e Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistrited Uniform Pseudo-Random Number Generator , ACM Transactions on
Modeling and Computer Simulation, vol. 8 (1), p. 330.
Mecatti, F. (2004), Lezioni di Metodi di Simulazione, Università degli Studi di Milano-Bicocca.
Mood, A., Graybill, A. e Boes, D. (1991), Introduzione alla Statistica,Mc-Graw-Hill, Milano.
Rao, J. (1965), On two simple schemes of unequal probability sampling without replacement, The Indian Journal of Statistics, vol. 3, p. 173-180. Raynald, L. (2008), Programming and Data Management for SPSS Statistics 17.0. A Guide for SPSS Statistics and SAS R Users, http:
//www.spss.com/statistics/base/ProgDataMgmtSPSS17.pdf.
Sampford, M. (1967), On sampling without replacement with unequal probabilities of selection, Biometrika, vol. 54, p. 499-513. SAS-Institute (1999), SAS/STAT R User's Guide - Version 8, http://www.math.wpi.edu/saspdf/stat/chap63.pdf.
Sen, A. (1953), On the estimate of variance in sampling with varying probabilities , Journal of the Indian Society of Agricultural Statistics, vol. 5, p. 119-127. Tillé, Y. (2006), Sampling Algorithms, Springer Series in Statistics, New-York.
Tillé, Y. e Alina, M. (2009), Package sampling, http://cran.r-project.org/web/packages/sampling/sampling.pdf.
Vijayan, K. (1967), An Exact pps Sampling Scheme: Generalization of a Method of Hanurav, Journal of the Royal Statistical Society, vol. B (30), p. 556-566.
Watts, D. (1991), Correction: Computer Selection of Size-Biased Samples, The American Statistician, vol. 45 (2), p. 172.
Wu, C. (2005), R/S-PLUS codes for the pseudo EL method and the Rao-Sampford sampling procedure, Rap. tecn., University of Waterloo, http://www.math.uwaterloo.ca/~cbwu/Rcodes/04JSS.R.
Yates, F. e Grundy, P. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, vol. B (15), p. 253-261.