Upload
dafydd
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The importance of data management. Paul Lambert, 31 st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk. - PowerPoint PPT Presentation
Citation preview
Paul Lambert, 31st January 2012
Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data
Management through e-Social Science ESRC research Node www.dames.org.uk
The importance of data management
DAMES, 31/JAN/2012, T1
Today’s session (2V1/2V3)
DAMES, 31/JAN/2012, T1 2
3
‘Data Management though e-Social Science’
DAMES – www.dames.org.uk
ESRC funded research NodeFunded 2008-11, with ongoing work into 2012 with the NeISS
(www.neiss.org.uk) and ‘eStat’ (www.bristol.ac.uk/cmm/research/estat/) projects
Aim: Useful social science provisionsSpecialist data topics – occupations; education
qualifications; ethnicity; social care; health Computer science research on secure data models;
metadata and linking data; workflowsProgramme of case studies and provisions
DAMES, 31/JAN/2012, T1
4
‘Data management’ means… ‘the tasks associated with linking related data resources, with
coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..]
Usually performed by social scientists themselvesMost overt in quantitative survey data analysis
• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables
Usually a substantial component of the work process
Here we differentiate from archiving / controlling data itselfHere we differentiate from archiving / controlling data itself
DAMES, 31/JAN/2012, T1
5
Some components…
Manipulating data Recoding categories / ‘operationalising’ variables
Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)
Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions
Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’
Cleaning data ‘missing values’; implausible responses; extreme values
DAMES, 31/JAN/2012, T1
6
Example – recoding data
Count
323 0 0 0 0 323
982 0 0 0 0 982
0 425 0 0 0 425
0 1597 0 0 0 1597
0 0 340 0 0 340
0 0 3434 0 0 3434
0 0 161 0 0 161
0 0 0 1811 0 1811
0 0 0 0 2518 2518
0 0 0 331 0 331
0 0 0 0 421 421
0 0 0 257 0 257
102 0 0 0 0 102
0 0 0 0 2787 2787
138 0 0 0 0 138
1545 2022 3935 2399 5726 15627
-9 Missing or wild
-7 Proxy respondent
1 Higher Degree
2 First Degree
3 Teaching QF
4 Other Higher QF
5 Nursing QF
6 GCE A Levels
7 GCE O Levels or Equiv
8 Commercial QF, No OLevels
9 CSE Grade 2-5,ScotGrade 4-5
10 Apprenticeship
11 Other QF
12 No QF
13 Still At School No QF
Highesteducationalqualification
Total
-9.001.00
Degree2.00
Diploma
3.00 Higherschool orvocational
4.00 Schoollevel orbelow
educ4
Total
Example - Linking data (on related adults in the BHPS)
Used health services in last year (Y=43%)
GHQ score
indv cp hh xhid indv cp hh xhid
Female 0.63 0.77 0.69 0.65 1.36 1.36 1.36 1.53
Age 0.02 0.03 0.02 0.02 0.13 0.13 0.14 0.14
Age-squared(*100) -0.12 -0.13 -0.13 -0.13
Cohabiting -0.58 -0.58 -0.54 -0.59
Ln(household inc.) -0.09 -0.14 -0.12 -0.11 -0.63 -0.62 -0.63 -0.62
Constant -0.65 -0.67 -0.59 -0.55 12.9 12.8 12.6 12.6
ICC L2% (VC) 0 6.3 8.8 7.9 0 22.9 15.8 7.8
Mean cluster size 1 1.4 1.8 4.6 1 1.4 1.8 4.5
L2:sd(cons) 0.61 0.51 0.53 2.54 1.91 1.15
L2:sd(fem) 2.00 0.82 0.00 2.81 2.32 1.64
L1:sd(cons) 1.81 1.81 1.81 1.81 5.40 4.30 4.76 5.28
-Log-like (-40k) 9648 9625 9624 9632 3529 3383 3410 3512
8
‘The significance of data management for social survey research’
The data manipulations described above are a major component of the social survey research workload
Pre-release manipulations performed by distributors / archivists• Coding measures into standard categories; Dealing with missing records
Post-release manipulations performed by researchers • Re-coding measures into simple categories• All serious researchers perform extended post-release management (and have the scars to show for it)
We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently
So the ‘significance’ of DM is about how much better research might be if we did things more effectively…
DAMES, 31/JAN/2012, T1
..being more effective probably involves..
Knowing about, using and citing previous standard measures/strategies
Effective documentation/dissemination of information on the approach used
Being proactive (not just relying on the most convenient measure to hand)
Trying a few alternatives – sensitivity analysis
DAMES, 31/JAN/2012, T1 9
‘Documentation’ (and its dissemination) is probably the key…
By documentation we mean the ‘paper trail’ (such as data & syntax files during secondary survey research)
For scientists, this is the log book / journal / laboratory notebook
For social sciences, there are few agreed standards
10
Image of Alexander Graham Bell’s 1876 notebook, taken from: http://sandacom.wordpress.com/2010/03/11/the-face-rings-a-bell/
Effective documentation is possible, but requires some effort (e.g. Long, 2009)
11
..good levels of documentation are not engrained in the social sciences!
DAMES, 31/JAN/2012, T1
“…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old word-processing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” (p142 of Scott, J. (2005) ‘Some principal concerns in the shaping of sociology’, in Halsey, A.H. and Runciman, W. (eds) British Sociology: See from without and within. London: British Academy)
...Yet, ‘documentation for replication’ is a reasonable expectation for a scientific model of research
(e.g. Steuer, Dale, Freese)…
Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic.Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research
Methodology, 9(2), 143-158.Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology?
Sociological Methods & Research, 36(2), 153-71.
12
A bit of focus…
Most of the DAMES applications aim to facilitate one of two data management activities, their documentation, and the dissemination of that documentation:
1) Variable constructions o Coding and re-coding values
2) Linking datasetso Internal and external linkages
DAMES, 31/JAN/2012, T1
13
‘Documentation for replication’ supports replication of..
Your own analysis in response to comments, revisions, requests for access)
Others’ analysis To build upon – cumulative science To critique / cross-examine
In secondary survey research Complex data is often updated (new related records; revised
and re-released; re-weighted or re-standardardised; new levels of access/linkage)
New analysis feasible - variable operationalisations; new statistical methods
Most documentation requirements are achieved by effective use of software (‘syntax’ programming)
See our training workshops, www.dames.org.uk/workshops
DAMES, 31/JAN/2012, T1
14
Keep clear records of your DM activities!
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycle
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.dames.org.uk/workshops
DAMES, 31/JAN/2012, T1
15
We’ve written a guide for researchers... ‘Software Session 1: Documentation and workflows with
popular software packages’ (www.dames.org.uk/workshops/stir10/docs_workflows_2010.html)
Dozens of sample command files in SPSS, Stata and R from DAMES Node workshops at www.dames.org.uk
DAMES, 31/JAN/2012, T1 16
17DAMES, 31/JAN/2012, T1
For data distributors, the provision of systematic metadata is also beneficial
Example of DDI format metadata
(see also talk 5)
18DAMES, 31/JAN/2012, T1
19
NESSTAR
DAMES, 31/JAN/2012, T1
What more is needed for good data management?
1) Good standards in the operationalisation of variables
See yesterday’s workshop sessions (www.dames.org.uk)Most options have already been studied!Using GEODE/GEMDE/GEEDE to facilitate sensitivity
analysis and comparisons of alternative plausible measures
• Collect documentation/metadata on specialist records• Promote more effective measurement options
e.g. effect proportional scaling; replication of measures used before; derivation of recommended standards
DAMES, 31/JAN/2012, T120
DAMES ‘GESDE’ tools: online services for data coordination/organisation
Tools for handing variables in social science data
Recoding measures; standardisation / harmonisation; Linking; Curating
21
0.0
2.0
4.0
6
ES5
ES2E9
E6E5
E3E2
G13G11
G10G7
G5G3
G2K4
R7WR
WR9O17
O8O4
MNI9
I99CM
CFCM2
CF2CG
ISEISIOP
AWMWG1
WG2WG3
GN1
Increase in R-squared Increase in BIC
Britain
-.05
0.0
5.1
ES5
ES2E9
E6E5
E3E2
G13G11
G10G7
G5G3
G2K4
R7WR
WR9O17
O8O4
MNI9
I99CM
CFCM2
CF2CG
ISEISIOP
AWMWG1
WG2WG3
GN1
Sweden
Source: BHPS and LNU 1991, adults aged 23-55 in work in 1991, N=4536 Britain, 2504 Sweden. Model 1: Health = quadratic age + gender + age*gender; Model 2: Health = (Model 1) + classificationGraph shows improvement in Pseudo R2 for Logistic regression, Model 2 v's Model 1,plus scaled BIC statistic (Model 2 BIC - Model 1 BIC / Model 1 BIC), cropped at 2*r2. Unweighted data.
Predictors of ‘poor health’ in Sweden (comparison of different occupation-based measures, from DAMES, TP 2011-1)
What more is needed for good data management?
2) Incentives/disincentivesArguably, good data management is penalised at
present (‘Don’t get it right, get it published’)Few formalised requirements of documentation or
data management activity (cf. metadata publishing standards such as DDI)
Citation rankings might incentivise here (citation of your do files..)
Prospects are probably rather bleak for good science..!!
DAMES, 31/JAN/2012, T1 23
Summary
the ‘significance’ of DM is about how much better research might be if we did things more effectively…
Can (try to) provide data oriented facilities supporting improved data management
May also need a cultural change in expectations…
DAMES, 31/JAN/2012, T1 24