27
GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd International Digital Curation Conference, 21-22 nd November 2006, Glasgow. Paul Lambert, Larry Tan, Ken Turner, & Vernon Gayle University of Stirling Richard Sinnott University of Glasgow Ken Prandy Cardiff University

GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

Embed Size (px)

Citation preview

Page 1: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Data curation standards and the messy world of social science occupational

information resources

Paper presented to the 2nd International Digital Curation Conference, 21-22nd November 2006, Glasgow.

Paul Lambert, Larry Tan,

Ken Turner, & Vernon Gayle University of Stirling

Richard Sinnott University of Glasgow

Ken Prandy Cardiff University

Page 2: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

GEODE – www.geode.stir.ac.uk

Grid Enabled Occupational Data Environment

Operate as a ‘portal’ • User friendly access to occupational data• High volume use

Support a community of occupational data providers• Depository of occupational information resources• Limited volume use

Experiment with / promote ‘e-Social Science’

Page 3: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

(Part 1) Occupational analyses in the social sciences

(Quotes as reproduced in Coxon and Jones 1978; Crompton 1998)

“A man’s work is as good a clue as any to the course of his life and to his social being and identity” (Hughes, 1958)

“The backbone of the class structure, and indeed of the entire reward system of modern Western society, is the occupational order” (Parkin, 1972)

“Nothing stamps a man as much as his occupation. Daily work determines the mode of life.. It constrains our ideas, feelings and tastes” (Goblot, 1961)

Page 4: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Why is occupational research ‘messy’?

Two stage process:

1. Collect & preserve ‘source occupational data’

2. Summary / translation of source data

This model is a ‘scientific’ approach• Published documentation (at both stages)

• Replicable

• Validation exercises But social researchers have been not been good at using it…

• (Bechhofer 1969; Marsh 1986; Rose and Pevalin 2003)

Page 5: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

{Stage 1 - Collecting Occupational Data – Examples}

Example 1: BHPS

Occ description Employment status SOC-2000 EMPST

Miner (coal) Employee 8122 7

Police officer (Serg.) Supervisor 3312 6

Electrical engineer Employee 2123 7

Retail dealer (cars) Self-employed w/e 1234 2

Example 2: European Social Survey, parent’s data

Occ description SOC-2000 EMPST

Miner ?8122 ?6/7

Police officer ?3312 ?6/7

Engineer ?? ??

Self employed businessman ?? ?1/2

Page 6: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

{Stage 1 - Collecting occupational data – summary}

All methods lead eventually to coding to an occupational index scheme:

– Occupational Unit Groups– Standardised Industrial Classifications– Standardised employment status classifications

Occupational index schemes are the point of departure for GEODE

Page 7: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Stage 2: Summary / translation of source occ. data

a) Published ‘occupational information resources’ used to link source data, via an index scheme, with substantively meaningful measures

• Social class schemes

• Stratification scales

• Gender segregation statistics

• Labour process statistics

b) Coding by fiat – (Allocation by ‘expert’ social scientist)

• Lack of documentation / replicability / consistency

• Unscientific…

Page 8: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

What’s the problem?

But…• Low uptake of existing occupational information resources• Strict security constraints on users’ micro-social survey data• Problems in the formatting / distribution of occupational information

resources (Part 2)

External user

(micro-social data)

Occ information (index file) (aggregate)

User’s output

(micro-social data)

id oug sex . oug CS-M CS-F EGP id oug CS

1 110 1 . 110 60 58 I 1 110 60 .

2 320 1 . 320 69 71 II 2 320 69 .

3 320 2 . 874 39 51 VIIa 3 320 71 .

4 874 1 . 4 874 39 .

5 874 2 . 5 874 51 .

Page 9: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Handling Occupational Information

• Messy because: – Large volume of occupational information resources– Limited coordination between resources– Inconsistencies in access and exploitation processes

Occupational information resources are used to interpret occupational records

Page 10: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Some illustrative occupational information resources

Index units # distinct files (average size kb)

Updates?

CAMSIS, www.camsis.stir.ac.uk

Local OUG*(e.s.)

200 (100) y

CAMSIS value labelswww.camsis.stir.ac.uk

Local OUG 50 (50) n

ISEI tools, home.fsw.vu.nl/~ganzeboom

Int. OUG 20 (50) y

E-Sec matrices www.iser.essex.ac.uk/esec

Int. OUG*(e.s.)

20 (200) n

Hakim gender seg codes (Hakim 1998)

Local OUG 2 (paper) n

Page 11: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Occupational information resources

Large volumes of occupational information resources• Coverage across countries and time periods• Different research fields / topics• Dynamic: updates to occupational information resources

• Internet based distributions lead to duplication and expansion, e.g. ISEI - ISCO translation files at:

– PISA webpages (Ganzeboom)– IDEAS/Repec webpagees (Hendrickx)– CAMSIS occupational data webpage

Some maths: • 100+ alternative index schemes (OUGs; others)

X • 500+ alternative output measures (class schemes, etc)

Page 12: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Occupational information resources

Limited coordination• Varying metadata practices

• Coordinated structure, e.g. ISEI at IDEAS/Repec [rare]

• Natural language, e.g. CAMSIS [common]

• No documentation

• Varying data file formats • SPSS, Stata, Plain text

• One-way distribution• Internet download; text publications

• Gaps between NSI’s and academic researchers• NSI’s make regular changes to favoured resources

Page 13: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Occupational information resources

Limited coordination (ctd)• Varying translation rules

• One file for all occupations (‘universal’)

• Multiple files for different contexts (‘specific’)

• Different occupational index requirements

ISEI CAMSIS EGP Wright{status scale} {stratification scale} {class scheme} {class scheme}

Occ title Occ title; e.s.; gender Occ title; e.s. Occ conditions

Page 14: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Occupational information resources

Inconsistencies in access / exploitation• Occupational Unit Group schemes’ variants

• Decennial updates / International variations• Localised adaptations [e.g. HESA] / Survey variations [e.g. GHS]

• Numeric or string format preservation• Hierarchical organisations

• E.g. ISCO-88

• 1234 123 12 1

• 110 = 0110 11 1 0

• Focus for application of occupational data• Individual level measures

• Household / career contexts

Page 15: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Returning to the occupational research model

Two stage process:

1. Collection & preservation of ‘source occupational data’

2. Summary / translation of source data via occupational information resources

Critically, stage (2) places responsibility for reviewing occupational information resources with the social scientist

The volume of variants / inconsistencies isn’t huge, but is enough to impede easy application

Page 16: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

(Part 2) Curating Occupational Data

• GEODE – Grid Enabled Occupational Data Environment

• Core provision: support the management of and access to occupational information resources

‘Occupational information depository’ Easy access to occupational data (portal for

occupational data)

Page 17: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Metadata - Occupational information depository

How to facilitate searching, registering, accessing index service?

Establish a ‘GEODE-M’ meta-data subset (.xml)• Founded on Michigan Data

Documentation Initiative

• Semantic curation of occupational information

<docDscr>Release date

<stdyDscr>Country

Time period

Author

<fileDscr>Format

<otherMat>Missing data

Data extensions

<dataDscr> <varGrp><var>

<concept> to differentiate index and output variable groups

<stdCatgry> to reference variable defintions

Page 18: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Benefits of DDI-XML curation

XML suits: • OGSA-DAI

• (data access & integration, www.ogsadai.org.uk)• Supports data indexing / preservation / management• Supports secure data matching programme

• Could facilitate analytical queries

• ‘Gridsphere’ search programmes

• Data curation standards– DDI widely deployed in social science resources– XML accessibility / transferability– Repeatability of tags very helpful

– E.g. data files; index measures; contexts; authors

Page 19: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Implementing ‘GEODE-M’ metadata

• Critical entries: • Context of data [country, time period]• Index scheme • <StdCatgry> : GEODE database of known index scheme• Source uri for resource

2 stage curation process (…?)1) Web-proforma for supply of occupational data

• Author; context, index units• Gridsphere ‘portlet’

2) Manual updating of xml resource by depositor / GEODE members

• Gridsphere ‘portlet’

Page 20: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Example issues

• <StdCatgry> [Variant implementations <-> indexed translation files]

• <context> [cross-country resources]

• <producer> role=“formatting” [caters to multiple author roles]

• <fileDscr id="dkcherisco88.sav"> [caters to multiple files]

• <abstract>

ISEI CAMSIS EGP Wright

Occ title Occ title; e.s.; gender Occ title; e.s. Occ conditions

<stdCatgry> (from www.geode.stir.ac.uk/ougs.html#)

ISCO88 SOC90; ukempst; gdr SOC90; ukempst SIC92; SUPVIS; ..

<context>: <nation abbr=“..”> <timePrd></timePrd>

10 [all]; all GB; 1990-2000 GB; 1950-2000 10 [all]; 1985-2000

Page 21: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Management of GEODE-M curation

Metadata considerations• ‘GEODE-M’ as {flexible} recommended components of DDI • GEODE-M templates

• webpages at GEODE• Other facilities?

Data considerations:

• Stored at GEODE v’s Linkage to external data• Proprietary software (plain text / SPSS / STATA)

At present: • Stage 1 – automated curation (allows external linkage, any file

format)• Stage 2 – extended manual curation (requires GEODE server copy of

data, translation to plain text rectangular format• Premised upon small commitment from depositors & GEODE

Page 22: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

GEODE – user uptake

• High potential demand • Numerous queries on occupational data management

• Numerous researchers wishing to distribute occupational data

• Prototype GEODE services not yet user-friendly

Carrots– High demands for easier access and review

Sticks– Poor standards of many previous research which neglects

good review of occupational information

Hurdles– Change research cultures in social science disciplines(?)

Page 23: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

Conclusions

• Occupational data curation and the Grid• Grid facilitates management / access via xml formats (OGSA-DAI)

• Current models require moderate specialist input (manual curation)

• Grid offers new level of service not previously available • Dynamic coordinated file storage • File matching [security]

• Occupational data as case study for focused DDI xml curation • Complex but finite range of occupational information resources

• High user demand

• Uptake will require combination of motivation, and instigation

Page 24: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

App 1: e-Social Science

‘The Grid’ and ‘e-Science’: 1. Online Coordination of electronic resources and collaborations

(Distributed computing) Large scale Collaborative Heterogeneous

2. Standard protocols / information management systems

UK eSocial Science:

1) Investment in assessing / implementing technology

2) Computationally demanding data analysis

3) Qualitative and quantitative data collection technologies

4) **Data sharing, processing and access**

Page 25: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

App 2: GEODE architecture

Page 26: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

App3: {Collecting occupational data}

a) Follow a recommended process: ONS good practice• www.statistics.gov.uk/methods_quality/ns_sec/questions.asp • Industry description / occupation description / size of

organisation / employment status / supervisory status• Occupation descriptions -> standardised numeric index • Text coding tools, e.g.CASCOT -

www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/

b) Do your own thing: European Social Survey parental occupational questions free text description of parental occupations

Page 27: GEODE - Glasgow DCC, Nov 2006 Data curation standards and the messy world of social science occupational information resources Paper presented to the 2nd

GEODE - Glasgow DCC, Nov 2006

App 4: Summary data: what is the best class scheme?

a) Published ‘occupational information resources’ link source data, via index scheme, with substantively meaningful measures

‘Occupation-based social classifications’– Social class schemes

• Registrar General’s Social Class Scheme (1907-2001) [skill / prestige]• National Statistics Socio-Economic Classifn. (2002-) [employment relations]• Goldthorpe / CASMIN / EGP (Employment relations) • Wright [ownership and authority]• W.E.S. [female occupational groupings]

– Stratification scales• SIOPS [prestige]• ISEI [socio-economic status – education and income average]• CAMSIS [social interaction]

{CAMSIS is the best…}