For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR

For the e-Stat meeting of 6-7 April 2011

Paul Lambert / DAMES Node inputs

1)Updates on DAMES

2)Bringing DAMES inputs to e-Stat

3)Misc. feedback - Stat-JR

4)Outputs / applications

1) Updates on DAMES

• DAMES Node extended period ends 31st July 2011• Some ongoing funding in E-Stat & NeISS projects until 2012• Dissemination workshop in Oxford in June 2011• Most funded posts have ended (1 programmer still funded)

• Our main contributions have been• ‘GESDE’ services for specialist data resources and the data

services supporting them (recent paper)• Training events / online materials • Social care and e-Health application projects

www.dames.org.uk

http://www.cros-portal.eu/sites/default/files/PS1%20Poster%202.pdf

http://www.dames.org.uk/

GESDE: online services for data coordination/organisation

Tools for handing variables in social science data

Recoding measures; standardisation / harmonisation; Linking; Curating

17/MAR/2010 DIR workshop: Handling Social Science Data

3

The data curation tool

4

The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

It includes a file storage system allowing users to upload files and access their own and others’ files

2) Bringing DAMES inputs to e-Stat

a) Possible mechanisms for data linking & connecting StatJR with GESDE resources (and/or other files)• Filestore system, or manual inputs

b) Data-management templates / pre-analysis functionality

c) Workflows/e-book inputs • ‘documentation for replication’

Supporting data linkage• The current framework needs manual linkage (e.g. 2 files on pc)• Templates could be written to link with fixed file(s); or named files +

fixed qualities (e.g.: matching vars with gb91soc90.dta)

** Sample Stata code: global soc “occ” global new “occ”sav temp.dta, replaceuse http://www.camsis.stir.ac.uk/downloads/data/gb91soc90.dta, clearkeep if ukempst==0keep soc90 mcamsisrename soc90 $socrename mcamsis ${new}_mcamsissort $socsav temp2.dta, replaceuse temp.dtasort $soc merge $soc using temp2.dtakeep if _merge==1 | _merge==3drop _merge

Stat-JR within the DAMES filestore?• Could we install Stat-JR on the Stirling unix system and

allow it to be invoked, via our portal, on datasets/templates within the portal? – would allow users to add own data & link with our data – (would need programmers; could give team here access to portal)

(Uploading through file browser; could potentially also use curation tool) (Actually, Templates, too, could be placed online/shared in this manner)

b) Data-oriented templates– Deterministic file matching routines

– E.g. BHPS file matching routine for compiling data across multiple files (cf. PanelWhiz)

– Recodes (manual input or external file input)

– Aggregating/standardising variables– Templates for weighted models in relevant packages – {Perhaps: responding to leverage/diagnostics}

• I could do many of these via templates which compile and run Stata/R command files

• Is there any value in that (cf. just doing them in Stata/R!) • Is there value in writing code for the e-Stat engine itself?

• E.g.: BHPS panel merge macro (similar to ‘PanelWhiz’)

e.g. Recode examples (shown before) Stata syntax:

recode var1 1/5=1 6/10=2 *=3, generate(var2)

SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2.

Data matrix format: -> Manual entry available in StatJR, but doesn’t seem to preserve metadata?

c) Workflows / e-books

Two main objectives: • ‘Documentation for replication’

..I think syntax for Stat-JR would help here..

• Sensitivity analysis across multiple measures / models / data permutations

Data storage/accessLinking different variables Compiling results across many models

Idea of auto-compiled user notes?

• Full account of models constructed (‘What was that?’)– Of benefit to novice and advanced practitioners– Potentially a part of the e-notebook, but could be a linked online guide (static)– E-Stat commands to provide documentation for replication– Terminologies used for the model/other user notes– Software equivalents or near equivalents (including estimator specs) – Algebraic expression and model abstract

• ?possible tools for storing/compiling multiple model results – (mentioned previously, cf. ‘est table’ in Stata)

Any missing components of ‘model description’ user notes? (slight modification from Sept 2010)

1) E-Stat model syntax:model{ for (i in 1:length(y36)) { y36[i] ~ dnorm(mu[i], tau) mu[i] <- cons[i] * beta0 + y8[i] * beta1 } ….

2) E-Stat model:

Template1Lev = Linear regression using MCMC

3) Model abstract/background information: E.g. something like: “This model is suitable for a single outcome measure with a continuous distribution. It is comparable to the widely used OLS regression model, and usually leads to identical results.. [etc]. See … for further description.”

4) Algebraic representation:

[Image from Latex code]

5) Specification of the model in other popular packages: BUGS syntax: [input here]MLwiN syntax: [input here]R: [input here] Stata: MCMC estimation routines not available

6) Data copy

[Data after model, e.g. including new variables]

7) Outputs from model

Log file; images

8) Variables summary

[Summary stats]

• Est store demo here

14

3) Some feedback on Stat-JR

• My own current thoughts {see sep. review notes file} – Look and feel

– a syntactical record of the model specification..?– ‘back’ and ‘forward’ options; add # categories to summary; Pre-

specified default settings (e.g. burn-in, cons, etc)

– Make links to users datasets easy – data entry template(?) – Export data as part of output in popular formats – handling large numbers of data files & folders– any way to tie in metadata about the records, e.g. variable labels?

Dataset metadata in StatJR?

• Comparable options for variable labels, value labels, missing data are widely used/desirable

• Effort to bring these in could help• Also relates to having data open in other package at same time

• Could a ‘functional form tool’ be incorporated?• For every dataset associates variables with a basic functional form, i.e.

metric, nominal or ordinal, that user can set/change• Impacts on data options: e.g. separate summary window to summarise

categorical variables such as frequency table/bar chart; options to derive dummy variables and recode values for categorical variables (some of this is similar to what’s available on ‘NESSTAR’)

• Use this data in some models options (or pref. just let the user decide..)?

Social science users

• I’ve shown Alpha version to a couple of colleagues • {Comments notes doc from Chris Playford}• Impressed by the range of options and potential for software

comparisons• Frightened by the specification options/terms; statistical

outputs; point and click format; and the current installation requirements

• The most common critical comment has been ‘why?’– as in ‘[Stata] already does everything I need’ and/or ‘I bet this doesn’t

work with large and complex data’!! Think about niche – sophisticated users can already use software,

whilst basic users don’t want advanced options? I suspect that training / pedagogical value is relevant here

4) Outputs / Applications• Applications I’d most like to test...:

– Evaluating different socio-economic measures for model performance (cf. GESDE services)

– ‘Large scale’ data compilation/analysis

• To highlight some output opportunities: – LWS/E-Stat/DAMES (NCRM/DRS) collaborative research seminar

+book proposal, Sept/Oct 2011 on ‘Modelling key variables in social science research’

– Social stratification research conference, Sept 2011– Training support - an installation package plus good illustrative

template for use at workshops, e.g. Essex Summer School course, July 2011?

Documents

For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR