Upload
abigail-macpherson
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
For the e-Stat meeting of 6-7 April 2011
Paul Lambert / DAMES Node inputs
1)Updates on DAMES
2)Bringing DAMES inputs to e-Stat
3)Misc. feedback - Stat-JR
4)Outputs / applications
1) Updates on DAMES
• DAMES Node extended period ends 31st July 2011• Some ongoing funding in E-Stat & NeISS projects until 2012• Dissemination workshop in Oxford in June 2011• Most funded posts have ended (1 programmer still funded)
• Our main contributions have been• ‘GESDE’ services for specialist data resources and the data
services supporting them (recent paper)• Training events / online materials • Social care and e-Health application projects
www.dames.org.uk
GESDE: online services for data coordination/organisation
Tools for handing variables in social science data
Recoding measures; standardisation / harmonisation; Linking; Curating
17/MAR/2010 DIR workshop: Handling Social Science Data
3
The data curation tool
4
The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way
It includes a file storage system allowing users to upload files and access their own and others’ files
2) Bringing DAMES inputs to e-Stat
a) Possible mechanisms for data linking & connecting StatJR with GESDE resources (and/or other files)• Filestore system, or manual inputs
b) Data-management templates / pre-analysis functionality
c) Workflows/e-book inputs • ‘documentation for replication’
Supporting data linkage• The current framework needs manual linkage (e.g. 2 files on pc)• Templates could be written to link with fixed file(s); or named files +
fixed qualities (e.g.: matching vars with gb91soc90.dta)
** Sample Stata code: global soc “occ” global new “occ”sav temp.dta, replaceuse http://www.camsis.stir.ac.uk/downloads/data/gb91soc90.dta, clearkeep if ukempst==0keep soc90 mcamsisrename soc90 $socrename mcamsis ${new}_mcamsissort $socsav temp2.dta, replaceuse temp.dtasort $soc merge $soc using temp2.dtakeep if _merge==1 | _merge==3drop _merge
Stat-JR within the DAMES filestore?• Could we install Stat-JR on the Stirling unix system and
allow it to be invoked, via our portal, on datasets/templates within the portal? – would allow users to add own data & link with our data – (would need programmers; could give team here access to portal)
(Uploading through file browser; could potentially also use curation tool) (Actually, Templates, too, could be placed online/shared in this manner)
b) Data-oriented templates– Deterministic file matching routines
– E.g. BHPS file matching routine for compiling data across multiple files (cf. PanelWhiz)
– Recodes (manual input or external file input)
– Aggregating/standardising variables– Templates for weighted models in relevant packages – {Perhaps: responding to leverage/diagnostics}
• I could do many of these via templates which compile and run Stata/R command files
• Is there any value in that (cf. just doing them in Stata/R!) • Is there value in writing code for the e-Stat engine itself?
• E.g.: BHPS panel merge macro (similar to ‘PanelWhiz’)
e.g. Recode examples (shown before) Stata syntax:
recode var1 1/5=1 6/10=2 *=3, generate(var2)
SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2.
Data matrix format: -> Manual entry available in StatJR, but doesn’t seem to preserve metadata?
c) Workflows / e-books
Two main objectives: • ‘Documentation for replication’
..I think syntax for Stat-JR would help here..
• Sensitivity analysis across multiple measures / models / data permutations
Data storage/accessLinking different variables Compiling results across many models
Idea of auto-compiled user notes?
• Full account of models constructed (‘What was that?’)– Of benefit to novice and advanced practitioners– Potentially a part of the e-notebook, but could be a linked online guide (static)– E-Stat commands to provide documentation for replication– Terminologies used for the model/other user notes– Software equivalents or near equivalents (including estimator specs) – Algebraic expression and model abstract
• ?possible tools for storing/compiling multiple model results – (mentioned previously, cf. ‘est table’ in Stata)
Any missing components of ‘model description’ user notes? (slight modification from Sept 2010)
1) E-Stat model syntax:model{ for (i in 1:length(y36)) { y36[i] ~ dnorm(mu[i], tau) mu[i] <- cons[i] * beta0 + y8[i] * beta1 } ….
2) E-Stat model:
Template1Lev = Linear regression using MCMC
3) Model abstract/background information: E.g. something like: “This model is suitable for a single outcome measure with a continuous distribution. It is comparable to the widely used OLS regression model, and usually leads to identical results.. [etc]. See … for further description.”
4) Algebraic representation:
[Image from Latex code]
5) Specification of the model in other popular packages: BUGS syntax: [input here]MLwiN syntax: [input here]R: [input here] Stata: MCMC estimation routines not available
6) Data copy
[Data after model, e.g. including new variables]
7) Outputs from model
Log file; images
8) Variables summary
[Summary stats]
• Est store demo here
14
3) Some feedback on Stat-JR
• My own current thoughts {see sep. review notes file} – Look and feel
– a syntactical record of the model specification..?– ‘back’ and ‘forward’ options; add # categories to summary; Pre-
specified default settings (e.g. burn-in, cons, etc)
– Make links to users datasets easy – data entry template(?) – Export data as part of output in popular formats – handling large numbers of data files & folders– any way to tie in metadata about the records, e.g. variable labels?
Dataset metadata in StatJR?
• Comparable options for variable labels, value labels, missing data are widely used/desirable
• Effort to bring these in could help• Also relates to having data open in other package at same time
• Could a ‘functional form tool’ be incorporated?• For every dataset associates variables with a basic functional form, i.e.
metric, nominal or ordinal, that user can set/change• Impacts on data options: e.g. separate summary window to summarise
categorical variables such as frequency table/bar chart; options to derive dummy variables and recode values for categorical variables (some of this is similar to what’s available on ‘NESSTAR’)
• Use this data in some models options (or pref. just let the user decide..)?
Social science users
• I’ve shown Alpha version to a couple of colleagues • {Comments notes doc from Chris Playford}• Impressed by the range of options and potential for software
comparisons• Frightened by the specification options/terms; statistical
outputs; point and click format; and the current installation requirements
• The most common critical comment has been ‘why?’– as in ‘[Stata] already does everything I need’ and/or ‘I bet this doesn’t
work with large and complex data’!! Think about niche – sophisticated users can already use software,
whilst basic users don’t want advanced options? I suspect that training / pedagogical value is relevant here
4) Outputs / Applications• Applications I’d most like to test...:
– Evaluating different socio-economic measures for model performance (cf. GESDE services)
– ‘Large scale’ data compilation/analysis
• To highlight some output opportunities: – LWS/E-Stat/DAMES (NCRM/DRS) collaborative research seminar
+book proposal, Sept/Oct 2011 on ‘Modelling key variables in social science research’
– Social stratification research conference, Sept 2011– Training support - an installation package plus good illustrative
template for use at workshops, e.g. Essex Summer School course, July 2011?