27
Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic Statistics, Istat madorazi [at] istat.it

Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Embed Size (px)

Citation preview

Page 1: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

R and the package StatMatch

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Marcello D’OrazioDept. National Accounts and Economic Statistics, Istatmadorazi [at] istat.it

Page 2: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: Software IssuesOne of the major problems in the application of SM techniques consisted in the lack of ad hoc software

Problem tackled in different manners:

a) Writing ad hoc code: • SAS codes by Moririaty (2001); • S-Plus code by Rassler (2002); • R code by D’Orazio et al. (2006)• …

b) Use of existing software, typically programs developed for hot deck imputation when nonparametric micro SM it is considered

For major details see Scanu (2009)

Page 3: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: Software for SM

These reasons led develop two ad hoc software:

• SAMWIN (Sacco, 2008): stand alone program for MS Windows; limited functionalities (mainly nonparametric micro approach)

• StatMatch (D’Orazio, 2013; first release in 2008), a free package for the R environment (R Core Team, 2013; http://www.r-project.org/) which implements different SM techniques and some other additional functions http://CRAN.R-project.org/package=StatMatch

Page 4: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: the package StatMatch

StatMatch does not come as stand alone software: it is an open source additional package for the R environment.

The first release of StatMatch on the repositories of the Comprehensive R Archive Network (CRAN) dates back to 2008. This release was based on R codes provided in the Appendix of D’Orazio et al. (2006).

Since then a number of updates have been released. Latest version is 1.2.0 released 2012-12-03.

A valuable contribution to the improvement of StatMatch is the work done within the Eurostat’s ESSnet project on “Data Integration” (D’Orazio, 2011). http://www.cros-portal.eu/sites/default/files//Statistical_Matching_with_StatMatch.pdf

Page 5: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: the R environment

“R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS”

It is available for 32 bit or 64 bit Operating Systems (Windows, Mac, and Linux).

Can be downloaded from one of the available CRAN (Comprehensive R Archive Network) mirrors: http://cran.r-project.org/mirrors.html

For further details (license, download, install, etc.) see the FAQ: http://cran.r-project.org/faqs.html

Latest version of R is 3.0.2 (“Frisbee Sailing”) released 25 September 2013.

Page 6: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: StatMatch features

Statmatch 1.2.0 provides a series of functions to perform:

• SM macro when dealing with variables distributed according to a multivariate normal distribution, function mixed.mtc

• SM micro methods by means of: o random hot-deck: function RANDwNND.hotdeck o distance hot deck: function NND.hotdecko rank hot deck: function rankNND.hotdecko mixed method (based on predictive mean matching, starting from

a multivariate normal distribution): function mixed.mtc

Page 7: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: StatMatch features (cont.)

Statmatch 1.2.0 provides a series of functions to perform:

• SM of data from complex sample surveys (micro or macro) (mainly categorical variables). Renssen’s calibration based approach it is considered:

o function harmonize.xo function comb.samples

• Exploration of uncertainty in SM when dealing with categorical variables: o function Frechet.bounds.cat o function Fbwidht.by.x

Page 8: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Applying SM techniques: StatMatch features (cont.)Statmatch 1.2.0 provides a series of functions to perform:

• Additional useful functions:o comp.prop comparison of the marginal distribution of the

same variable(s)o pw.assoc association and PRE measures for categorical

variableso gower.disto mahalanobis.disto maximum.disto fact2dummy substitutes a categorical variables with the

corresponding dummieso create.fused physically creates the synthetic data source

after the application of hot deck SM methods

Page 9: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Basic R: CommandsTechnically R is an expression language with a very simple syntax.

When R is launched it issues a prompt (typically “>” in MS Windows) and it expects input commands. In practice it is a command line interface.

Elementary commands consist of:

Expressions: an expression it is evaluated, printed and the value is lost. Ex.:> 2+2 [1] 4

Assignments: evaluates the expression and passes the value to a variable but the result is not automatically printed.

> x <- 2+2 # saves the result in the object ‘x’>

Page 10: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Basic R: Commands and Data Objects

Usually different commands are provided in separate linesA long command can be written in more lines (then the prompt modifies in “+”)

R is case sensitive

R can store data in different types of objects (vector, factor, matrix, array, data.frame, list)

The R output is poor but usually all the needed result of the computations are stored in an object (typically a list).

Page 11: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Basic R: Calling FunctionsMost of the computations can be done calling already available functions with the following syntax:

> name_fcn(arg1, arg2, …)

arg1, arg2 are the arguments (mandatory or optional) of the function name_fcn, they must be separated by “,”Example:

> x <- c(10, 20, 5) #creates “x” with vals. 10, 20 and 5 > mean(x) #calls function “mean” and applies it to “x”[1] 11.66667

The details about a function can be found in the corresponding help pages:> ?mean # to access help on “mean”

Page 12: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

R: Manuals and Packages

Major details can be found on the R web site http://www.r-project.org

Official manuals can be downloaded here: http://cran.r-project.org/manuals.html

Additional doc. (many languages) can be found here: http://cran.r-project.org/other-docs.html

Google search on the R web site:http://cran.at.r-project.org/search.html

Additional packages (about 5,000) can be downloaded on the CRAN repository:

http://cran.at.r-project.org/web/packages/available_packages_by_name.htmlHere listed by topic: http://cran.at.r-project.org/web/views/

Page 13: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

R: Additional Interfaces

R is provided with a command line interface (CLI), which is the preferred interface for power users but a good knowledge of the language is required.

Unfortunately such interface is intimidating for beginners.

For these reason a number of additional software have been developed.http://www.sciviews.org/_rgui/

Page 14: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

R and the RStudio Interface

A powerful user interface is Rstudio (http://www.rstudio.com/).

It’s free and open source, and works on Windows, Mac, and Linux.

Some features:• Editor window with Syntax highlighting, code completion, and smart

indentation• Workspace browser and data viewer• Plot history, zooming, and flexible image and PDF export• Integrated R help and documentation

It can be downloaded here: http://www.rstudio.com/ide/download/

Tip: it is better to install at first R and then RStudio

Page 15: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Installing and Loading the Package StatMatch

To use the StatMatch functions in R it is necessary to:o Install StatMatch (just once)o Load StatMatch (every time we launch R and then we want to use

StatMatch functions)

InstallationIf R is installed on a PC connected to the internet the one launched R the following command is necessary:

> install.packages(“StatMatch“)

Statmatch and the others necessary packages will be automatically downloaded and installed o the PC

Page 16: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

Installing and Loading the Package StatMatch (cont.)

LoadingOnce installed, launch R and send the following command

> library(StatMatch)

Page 17: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function mixed.mtc

mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=0, micro=FALSE,

constr.alg="Hungarian")

Main features:- Estimates of the parameters of the multivariate Normal (micro=FALSE)- Parameters estimated via ML (method="ML") or methods proposed by

Moriarirty & Scheuren (method="ML")- Estimation under the CI (rho.yz=0) or using aux info, i.e. a guess for - Provides synthetic file (micro=FALSE) by using the a mixed SM procedure- The marching variables (match.vars) can include categorical variables: it is

substituted by the corresponding dummies, assumed being distributed normally

YZ

Page 18: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function Frechet.bounds.cat

Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", tol=0.0001)

Main features:- Estimates of the probabilities of the cells in the table Y vs. Z under the CI

assumption- Estimates of the Frechét bounds for probabilities of the cells in the table Y vs. Z by conditioning or not on the X variables

- Measures the uncertainty (two alternatives are available)- handles only categorical variables

Page 19: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function RANDwNND.hotdeck

RANDwNND.hotdeck(data.rec, data.don, match.vars=NULL, don.class=NULL, dist.fun="Manhattan", cut.don="rot", k=NULL, weight.don=NULL, ..)

Main features:- Random hot deck within classes- Chance of selecting donors with prob. proportional to some weights

(weighted random hot deck)- random selection of donors within groups formed according to various rules: cut.don="k.dist": donors with distance from recipient <= k cut.don="exact": the k closest donors cut.don="span": the proportion k of the closest donors ….

Page 20: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function NND.hotdeck

create.fused(data.rec, data.don, mtc.ids, z.vars, dup.x=FALSE, match.vars=NULL)

Main features:- Creates the synthetic data set after hotdeck

Page 21: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function NND.hotdeck

NND.hotdeck(data.rec, data.don, match.vars, don.class=NULL, dist.fun="Manhattan", constrained=FALSE, constr.alg="Hungarian", ...)

Main features:- Distance hotdeck, with or without donoation classes- Several distance functions; handling of mixed type matching variables- Possibility to perform constrained distance hotdeck (constrained=TRUE):

a donor can be used just once but in order to minimize the overall matching distance

Page 22: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function NND.hotdeck

rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, don.class=NULL, weight.rec=NULL, weight.don=NULL,

constrained=FALSE, constr.alg="Hungarian")

Main features:- Implements rank hotdeck (just one continuos variable)- Possibility to consider donation classes- Possibility to use weights in estimating empirical cumulative distribution

function- Matching can constrained or unconstrained

Page 23: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function comp.prop

comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)

Main features:- Compares marginal/joint distributions of categorical variables- The second distribution (p2) can be reference one (how close is the first wrt

to the second assumed to be the reliable one)- Provides similarity/dissimilarity measures- Performs the Chi Square test and provides a rough idea of the value of the

generalised design effect that would determine the acceptance of H0 (alpha=0.05) when comparing distributions estimated from complex sample survey data

Page 24: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function pw.assoc

pw.assoc(formula, data, weights=NULL, freq0c=NULL)

Main features:- Computes the Cramer’s V- Computes measures of the proportional reduction of the variance of the

target categorical variable due to each of the available categorical predictors. The measures being considered are Kendall & Stuart lambda and tau and the Theil’s uncertainty coefficient U

- It is possible to use units’ weights in estimating the cell frequencies

Page 25: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function Fbwidths.by.x

Fbwidths.by.x(tab.x, tab.xy, tab.xz)

Main features:- Computes uncertainty measures for each possible combination of the X

variables- Two measures of uncertainty are available

Page 26: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function harmonize.x

harmonize.x(svy.A, svy.B, form.x, x.tot=NULL, cal.method="linear", ...)

Main features:- Harmonizes the marginal/joint distribution of the X variables between two

surveys- Different possible “configurations” of the X variables can be considered

(joint distribution of two variables or marginal distributions of both the variables, etc.)

- Weights modified via calibration (linear, raking, etc.) or poststratification- Possibility to specify the “true” distribution of the X variable that has to be

met; in alternative the target distribution is estimated by “pooling” the starting ones

Page 27: Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic

Eurostat

StatMatch: function comb.samples

comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, estimation=NULL, micro=FALSE, ...)

Main features:- Estimates the joint distribution of Y (y.lab) and Z (z.lab) under the CI

assumption- When data from auxiliary survey (svy.C) are available it is possible to

estimate the contingency table by Incomplete or synthetic two-way stratification

- Possibility to obtain predicted cell probabilities at unit level (micro=TRUE) using Linear Probability Models