Upload
elwin-jones
View
219
Download
0
Embed Size (px)
Citation preview
Eurostat
R and the package StatMatch
Training Course «Statistical Matching»
Rome, 6-8 November 2013
Marcello D’OrazioDept. National Accounts and Economic Statistics, Istatmadorazi [at] istat.it
Eurostat
Applying SM techniques: Software IssuesOne of the major problems in the application of SM techniques consisted in the lack of ad hoc software
Problem tackled in different manners:
a) Writing ad hoc code: • SAS codes by Moririaty (2001); • S-Plus code by Rassler (2002); • R code by D’Orazio et al. (2006)• …
b) Use of existing software, typically programs developed for hot deck imputation when nonparametric micro SM it is considered
For major details see Scanu (2009)
Eurostat
Applying SM techniques: Software for SM
These reasons led develop two ad hoc software:
• SAMWIN (Sacco, 2008): stand alone program for MS Windows; limited functionalities (mainly nonparametric micro approach)
• StatMatch (D’Orazio, 2013; first release in 2008), a free package for the R environment (R Core Team, 2013; http://www.r-project.org/) which implements different SM techniques and some other additional functions http://CRAN.R-project.org/package=StatMatch
Eurostat
Applying SM techniques: the package StatMatch
StatMatch does not come as stand alone software: it is an open source additional package for the R environment.
The first release of StatMatch on the repositories of the Comprehensive R Archive Network (CRAN) dates back to 2008. This release was based on R codes provided in the Appendix of D’Orazio et al. (2006).
Since then a number of updates have been released. Latest version is 1.2.0 released 2012-12-03.
A valuable contribution to the improvement of StatMatch is the work done within the Eurostat’s ESSnet project on “Data Integration” (D’Orazio, 2011). http://www.cros-portal.eu/sites/default/files//Statistical_Matching_with_StatMatch.pdf
Eurostat
Applying SM techniques: the R environment
“R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS”
It is available for 32 bit or 64 bit Operating Systems (Windows, Mac, and Linux).
Can be downloaded from one of the available CRAN (Comprehensive R Archive Network) mirrors: http://cran.r-project.org/mirrors.html
For further details (license, download, install, etc.) see the FAQ: http://cran.r-project.org/faqs.html
Latest version of R is 3.0.2 (“Frisbee Sailing”) released 25 September 2013.
Eurostat
Applying SM techniques: StatMatch features
Statmatch 1.2.0 provides a series of functions to perform:
• SM macro when dealing with variables distributed according to a multivariate normal distribution, function mixed.mtc
• SM micro methods by means of: o random hot-deck: function RANDwNND.hotdeck o distance hot deck: function NND.hotdecko rank hot deck: function rankNND.hotdecko mixed method (based on predictive mean matching, starting from
a multivariate normal distribution): function mixed.mtc
Eurostat
Applying SM techniques: StatMatch features (cont.)
Statmatch 1.2.0 provides a series of functions to perform:
• SM of data from complex sample surveys (micro or macro) (mainly categorical variables). Renssen’s calibration based approach it is considered:
o function harmonize.xo function comb.samples
• Exploration of uncertainty in SM when dealing with categorical variables: o function Frechet.bounds.cat o function Fbwidht.by.x
Eurostat
Applying SM techniques: StatMatch features (cont.)Statmatch 1.2.0 provides a series of functions to perform:
• Additional useful functions:o comp.prop comparison of the marginal distribution of the
same variable(s)o pw.assoc association and PRE measures for categorical
variableso gower.disto mahalanobis.disto maximum.disto fact2dummy substitutes a categorical variables with the
corresponding dummieso create.fused physically creates the synthetic data source
after the application of hot deck SM methods
Eurostat
Basic R: CommandsTechnically R is an expression language with a very simple syntax.
When R is launched it issues a prompt (typically “>” in MS Windows) and it expects input commands. In practice it is a command line interface.
Elementary commands consist of:
Expressions: an expression it is evaluated, printed and the value is lost. Ex.:> 2+2 [1] 4
Assignments: evaluates the expression and passes the value to a variable but the result is not automatically printed.
> x <- 2+2 # saves the result in the object ‘x’>
Eurostat
Basic R: Commands and Data Objects
Usually different commands are provided in separate linesA long command can be written in more lines (then the prompt modifies in “+”)
R is case sensitive
R can store data in different types of objects (vector, factor, matrix, array, data.frame, list)
The R output is poor but usually all the needed result of the computations are stored in an object (typically a list).
Eurostat
Basic R: Calling FunctionsMost of the computations can be done calling already available functions with the following syntax:
> name_fcn(arg1, arg2, …)
arg1, arg2 are the arguments (mandatory or optional) of the function name_fcn, they must be separated by “,”Example:
> x <- c(10, 20, 5) #creates “x” with vals. 10, 20 and 5 > mean(x) #calls function “mean” and applies it to “x”[1] 11.66667
The details about a function can be found in the corresponding help pages:> ?mean # to access help on “mean”
Eurostat
R: Manuals and Packages
Major details can be found on the R web site http://www.r-project.org
Official manuals can be downloaded here: http://cran.r-project.org/manuals.html
Additional doc. (many languages) can be found here: http://cran.r-project.org/other-docs.html
Google search on the R web site:http://cran.at.r-project.org/search.html
Additional packages (about 5,000) can be downloaded on the CRAN repository:
http://cran.at.r-project.org/web/packages/available_packages_by_name.htmlHere listed by topic: http://cran.at.r-project.org/web/views/
Eurostat
R: Additional Interfaces
R is provided with a command line interface (CLI), which is the preferred interface for power users but a good knowledge of the language is required.
Unfortunately such interface is intimidating for beginners.
For these reason a number of additional software have been developed.http://www.sciviews.org/_rgui/
Eurostat
R and the RStudio Interface
A powerful user interface is Rstudio (http://www.rstudio.com/).
It’s free and open source, and works on Windows, Mac, and Linux.
Some features:• Editor window with Syntax highlighting, code completion, and smart
indentation• Workspace browser and data viewer• Plot history, zooming, and flexible image and PDF export• Integrated R help and documentation
It can be downloaded here: http://www.rstudio.com/ide/download/
Tip: it is better to install at first R and then RStudio
Eurostat
Installing and Loading the Package StatMatch
To use the StatMatch functions in R it is necessary to:o Install StatMatch (just once)o Load StatMatch (every time we launch R and then we want to use
StatMatch functions)
InstallationIf R is installed on a PC connected to the internet the one launched R the following command is necessary:
> install.packages(“StatMatch“)
Statmatch and the others necessary packages will be automatically downloaded and installed o the PC
Eurostat
Installing and Loading the Package StatMatch (cont.)
LoadingOnce installed, launch R and send the following command
> library(StatMatch)
Eurostat
StatMatch: function mixed.mtc
mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=0, micro=FALSE,
constr.alg="Hungarian")
Main features:- Estimates of the parameters of the multivariate Normal (micro=FALSE)- Parameters estimated via ML (method="ML") or methods proposed by
Moriarirty & Scheuren (method="ML")- Estimation under the CI (rho.yz=0) or using aux info, i.e. a guess for - Provides synthetic file (micro=FALSE) by using the a mixed SM procedure- The marching variables (match.vars) can include categorical variables: it is
substituted by the corresponding dummies, assumed being distributed normally
YZ
Eurostat
StatMatch: function Frechet.bounds.cat
Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", tol=0.0001)
Main features:- Estimates of the probabilities of the cells in the table Y vs. Z under the CI
assumption- Estimates of the Frechét bounds for probabilities of the cells in the table Y vs. Z by conditioning or not on the X variables
- Measures the uncertainty (two alternatives are available)- handles only categorical variables
Eurostat
StatMatch: function RANDwNND.hotdeck
RANDwNND.hotdeck(data.rec, data.don, match.vars=NULL, don.class=NULL, dist.fun="Manhattan", cut.don="rot", k=NULL, weight.don=NULL, ..)
Main features:- Random hot deck within classes- Chance of selecting donors with prob. proportional to some weights
(weighted random hot deck)- random selection of donors within groups formed according to various rules: cut.don="k.dist": donors with distance from recipient <= k cut.don="exact": the k closest donors cut.don="span": the proportion k of the closest donors ….
Eurostat
StatMatch: function NND.hotdeck
create.fused(data.rec, data.don, mtc.ids, z.vars, dup.x=FALSE, match.vars=NULL)
Main features:- Creates the synthetic data set after hotdeck
Eurostat
StatMatch: function NND.hotdeck
NND.hotdeck(data.rec, data.don, match.vars, don.class=NULL, dist.fun="Manhattan", constrained=FALSE, constr.alg="Hungarian", ...)
Main features:- Distance hotdeck, with or without donoation classes- Several distance functions; handling of mixed type matching variables- Possibility to perform constrained distance hotdeck (constrained=TRUE):
a donor can be used just once but in order to minimize the overall matching distance
Eurostat
StatMatch: function NND.hotdeck
rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, don.class=NULL, weight.rec=NULL, weight.don=NULL,
constrained=FALSE, constr.alg="Hungarian")
Main features:- Implements rank hotdeck (just one continuos variable)- Possibility to consider donation classes- Possibility to use weights in estimating empirical cumulative distribution
function- Matching can constrained or unconstrained
Eurostat
StatMatch: function comp.prop
comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)
Main features:- Compares marginal/joint distributions of categorical variables- The second distribution (p2) can be reference one (how close is the first wrt
to the second assumed to be the reliable one)- Provides similarity/dissimilarity measures- Performs the Chi Square test and provides a rough idea of the value of the
generalised design effect that would determine the acceptance of H0 (alpha=0.05) when comparing distributions estimated from complex sample survey data
Eurostat
StatMatch: function pw.assoc
pw.assoc(formula, data, weights=NULL, freq0c=NULL)
Main features:- Computes the Cramer’s V- Computes measures of the proportional reduction of the variance of the
target categorical variable due to each of the available categorical predictors. The measures being considered are Kendall & Stuart lambda and tau and the Theil’s uncertainty coefficient U
- It is possible to use units’ weights in estimating the cell frequencies
Eurostat
StatMatch: function Fbwidths.by.x
Fbwidths.by.x(tab.x, tab.xy, tab.xz)
Main features:- Computes uncertainty measures for each possible combination of the X
variables- Two measures of uncertainty are available
Eurostat
StatMatch: function harmonize.x
harmonize.x(svy.A, svy.B, form.x, x.tot=NULL, cal.method="linear", ...)
Main features:- Harmonizes the marginal/joint distribution of the X variables between two
surveys- Different possible “configurations” of the X variables can be considered
(joint distribution of two variables or marginal distributions of both the variables, etc.)
- Weights modified via calibration (linear, raking, etc.) or poststratification- Possibility to specify the “true” distribution of the X variable that has to be
met; in alternative the target distribution is estimated by “pooling” the starting ones
Eurostat
StatMatch: function comb.samples
comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, estimation=NULL, micro=FALSE, ...)
Main features:- Estimates the joint distribution of Y (y.lab) and Z (z.lab) under the CI
assumption- When data from auxiliary survey (svy.C) are available it is possible to
estimate the contingency table by Incomplete or synthetic two-way stratification
- Possibility to obtain predicted cell probabilities at unit level (micro=TRUE) using Linear Probability Models