15
Data Citation for the Social Sciences Mary Vardigan ICPSR CODATA Conference on Data Attribution and Citation August 22-23, 2011

Data Citation for the Social Sciences Mary Vardigan ICPSR CODATA Conference on Data Attribution and Citation August 22-23, 2011

Embed Size (px)

Citation preview

Data Citation for the Social Sciences

Mary Vardigan ICPSR

CODATA Conference on Data Attribution and Citation August 22-23, 2011

Today’s Presentation

• Norms in the social sciences and implications for data citation

• Summary of major citation issues for social science

Knowledge claims

• Social science advances through knowledge claims published in the literature

• Need to verify and extend claims; Secondary analysis encouraged

• Follows that data need to be available for reuse and cited

Data sharing

• Strong tradition of data sharing, both formal and informal

• Active social science data archives around the world

• Some PIs distribute data on Web sites• Pienta, Alter, and Lyle found 88.5% of data

generated not publically archived (since 1985)

Metadata

• Metadata play important role – Documentation necessary to understand the data

• Questionnaires, user guides, methodology descriptions, record layouts also provided

• Heterogeneous in format – most unstructured• Data Documentation Initiative (DDI) seeks to

provide a structured metadata standard

Granularity and versioning

• “Studies” may be single datasets or aggregations

• Also a need to cite data subsets that support the findings in publications

• Data are sometimes updated and need to be versioned

Content and formats

• Mostly quantitative data and some qualitative• Boundaries blurring between social science and

other domains• Survey data supplemented by biomarker data• Survey data merged with administrative records• Trend toward complex collections• Social media data• Video, audio data

Confidentiality concerns

• Survey respondents promised anonymity, a critical pledge to uphold

• Legal agreements required for restricted data use• New mechanisms to analyze restricted data

online emerging – virtual enclaves and virtual datasets

• Often a public-use version and restricted versions coexist

Replication

• Most claims not able to be replicated based on information in publications

• Replication archives -- ICPSR, Dataverse, etc.• What is required is chain of evidence and record

of decisions – deep citation and provenance • Need both production transparency (record of

decisions in transforming data) and analytic transparency (how conclusions drawn)

Some tradition of citation

• Citation standard for machine-readable files created in 1979

• Citations available from data providers -- Census Bureau and ICPSR since late 1980s

• Journals just beginning to cite data• Persistent identifiers: DOIs or handles

Journal practices

• Historically little effort to standardize or verify data references in publications

• Growing movement to require data behind findings to be publically available

• AER: Will publish only if “data used in the analysis are clearly and precisely documented and readily available for replication.”

Influencing journals

• Data-PASS campaign to influence journals sponsored by professional associations

• Wrote to major professional associations demonstrating inconsistencies in citing data

• Success with American Sociological Review, which changed submission criteria

Linking data and publications

• ICPSR has done this since the beginning in 1962

• Now a Bibliography of 60K citations to publications with two-way linking to data

• Vendors like Thomson Reuters now interested in these linkages

Summary -- Citation issues for social science

• Versioning – Data can be dynamic• Unit/Granularity – What is optimal? • Importance of metadata – How to create

durable link?• Replication –– Cite subsets and

replication/workflow files containing scripts?

Thank you…

–Mary Vardigan [email protected]