48
Data Sharing & Data Citation Micah Altman, Institute for Quantitative Social Science, Harvard University Prepared for data coding, analysis, archiving, and sharing for open collaboration NSF Sept 15-16, 2011

Data Sharing & Data Citation

Embed Size (px)

DESCRIPTION

Prepared for data coding, analysis, archiving, and sharing for open collaboration; National Science Foundation, Sept 15-16, 2011

Citation preview

Page 1: Data Sharing & Data Citation

Data Sharing & Data Citation

Micah Altman, Institute for Quantitative Social Science, Harvard University

Prepared for data coding, analysis, archiving, and sharing for open collaboration

NSFSept 15-16, 2011

Page 2: Data Sharing & Data Citation

Collaborators*

Data Sharing & Data Citation

Margaret Adams, George Alter, Leonid Andreev, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Tom Carsey, Kevin Condon, Jonathan Crabtree, Merce Crosas, Darrell Donakowski, Myron Guttman, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Amy Pienta, Lois Timms-Ferrarra, Akio Sone, Bob Treacy, Copeland Young

Research SupportThanks to the Library of Congress (PA#NDP03-1), the

National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.

* And co-conspirators

Page 3: Data Sharing & Data Citation

Related Work

Data Sharing & Data Citation

Altman, M., and J. Crabtree, 2011. “Using the SafeArchive System: TRAC-Based Auditing of LOCKSS”, Proceedings of Archiving 2011. M. Crosas, 2011, “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”, D-Lib Magazine 17(1/2). M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182Gutmann,M. Abrahamson, M, Adams, M.O., Altman, M, Arms, C., Bollen, K., Carlson, M., Crabtree, J., Donakowski, D., King, G., Lyle, J., Maynard, M., Pienta, A., Rockwell, R, Timms-Ferrara L., Young, C., 2009. "From Preserving the Past to Preserving the Future: The Data-PASS Project and the challenges of preserving digital social science data", Library Trends 57(3):315-33 M. Altman, 2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag.M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April).G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing", Sociological Methods and Research, Vol. 32, No. 2, pp. 173-199

Page 4: Data Sharing & Data Citation

Data Sharing & Data Citation

Motivations

Page 5: Data Sharing & Data Citation

Access to Data is the Foundation of Science

Data Sharing & Data Citation

Science is not (only) about being scientific Scientific progress requires community:

competition and collaboration in the pursuit of common goals Without access to the same materials:

no community exists

… data is the nucleus of scientific collaboration

The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research

results Experimental expensive to reproduce, observational data

impossible Hard for journal editors to verify --

If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be

replicated

… data is needed for scientific replicationSources: Fienberg et. al 1985; ICSU 2004; Nature 2009

Page 6: Data Sharing & Data Citation

Open Data Broadens & Deepens Impact

Data Sharing & Data Citation

Data Intensive Science Increased opportunities for

interdisciplinarity Science modeling across multiple

scales Continuous, complete, fine-grained

information on physical processes, systems, human behavior

Education Data eases transition from

education to research Open Data Democratizes

Science Citizen-scientist Developing countries Researchers outside of inner-circle

of institution Crowd-sourcing, open notebooks,

and mashups

& Data Sharing Increases Publication Impact

[Gleditsch 2003; Wilson 2008; Piowar 2007]

Page 7: Data Sharing & Data Citation

Data is Key to Government

Data Sharing & Data Citation

Statistics = state-istics Reformers use data

To assess the performance of the state

To assess social conditions Governments attempt to control

access to data to evade accountability

Policy debates often centers on data War on poverty, civil rights,

consumer protection – all made heavy use of statistical arguments

Economic, environment policies are data-intensive

Data access brings together both sides of political spectrum In modern democracy the public

needs a direct source of information Liberals and conservatives

support access to data informing policy

Source: “Propaganda” http://www.media-studies.ca/articles/images/berlin_wall.jpg

Sources: Gough 2003; Shulman 2006; Wagner & Steinzor 2006; Alonzo and Starr 1988

Page 8: Data Sharing & Data Citation

Open Data is “Research Insurance”

Data Sharing & Data Citation

Keeps open option to after nominal end of project – extends lifecycle Continuation projects Publication revisions Broader research programs

Insures against loss of “project memory” Departure of a senior personnel from

institution Departure of post-docs, graduate

students from students Accidental loss of data due to local IT

failures Reduces questions from secondary

analysts Insures against intentional and

unintentional errors All collaborators can verify results prior

to publication Enables more intensive peer review

Source: Berman, et. al 2008.

Page 9: Data Sharing & Data Citation

Data Sharing Across Communities

[Micah Altman, 10/6/2009]

Open Data9

Data sharing practices vary greatly across communities Proprietary Formal sharing Formal deposit Significant correlates: Tacit knowledge, Individual investment of

time in data collection, confidentiality, journal practices, funder policies & practices

Source: R.I.N. 2008 also see Borgman 2007; Niu 2006

Page 10: Data Sharing & Data Citation

So when do things go wrong?

Source: Reich & Rosenthal 2005

Page 11: Data Sharing & Data Citation

Data Sharing & Data Citation

Page 12: Data Sharing & Data Citation

Confidentiality Restrictions for Personal Private Information

Overlapping laws differ: People/subjects covered Organizations covered Required technical and

procedural controls Definition of identifiability

Some Strategies Consent for sharing up front Commercialize Observe public activity Share aggregates only De-identify

Recent Statistical Results (Oversimplified ) De-identification often leaks Aggregation sometimes

leaks12

Not included: EU directives, foreign laws,

ANPRM Request for Comment on proposed revisions to 45 CFR 46

www.hhs.gov/ohrp/humansubjects/anprm2011page.html

ANPRM Request for Comment on proposed revisions to 45 CFR 46

www.hhs.gov/ohrp/humansubjects/anprm2011page.html

Page 13: Data Sharing & Data Citation

Data Sharing & Data Citation

Integrating Tools

Page 14: Data Sharing & Data Citation

Data Management - Goals

Data Sharing & Data Citation

Page 15: Data Sharing & Data Citation

Data Management Elements

Data Sharing & Data Citation

Page 16: Data Sharing & Data Citation

Core Requirements for Data Sharing Infrastructure

Data Sharing & Data Citation

Stakeholder incentives recognition; citation; payment; compliance; services

Dissemination access to metadata; documentation; data

Access control authentication; authorization; rights management

Provenance chain of control; verification of metadata, bits, semantic

content Persistence

bits; semantic content; use Legal protection

rights management; consent; record keeping; auditing Usability

discovery; deposit; curation; administration; collaboration Business modelSources: King 2007; ICSU 2004; NSB 2005

Page 17: Data Sharing & Data Citation

Why is Infrastructure for Data Sharing Necessary?

Data Sharing & Data Citation

Accessibility: Many large data sets: in public archives Most data in published articles:

not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available

Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are

lost Locating/browsing/extracting requires specialized tools & approaches

Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies

Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have

altered content! Why not Single Centralized infrastructure ?

Single point of failure Difficult when data are heterogeneous in format, origin, size, effort needed

to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility

Page 18: Data Sharing & Data Citation

For Organizations For Scholars

•Brand it like your own website.•Upload any type of data.•Establish a persistent data citation•Facilitate data discovery•Provide live analysis •Receive permanent storage space

•Used by archives, libraries, journals, schools•Enable contributors to upload data•Organize studies by collections•Search across a universe of data•Control access and terms of use•Federate with catalogs and partners: OAI-PMH, LOCKSS, Z39.50, DDI

Dataverse Gateway to over 39000 social science studies (world’s largest catalog)

Web Virtual Hosting 2.0 Service -- Over 350 virtual archives Federated search and delivery

Page 19: Data Sharing & Data Citation

Virtual Archive: Scholar Site

Data Sharing & Data Citation

Scholar retains control over branding and dissemination

Preservation and long-term access is guaranteed

Dissemination and compliance with Data Manage Plans is verifiable

Integrates with OpenScholar

Page 20: Data Sharing & Data Citation

Interoperability & Integration

Page 21: Data Sharing & Data Citation

Data Sharing & Data

Citation

Mind the Gaps GAP: Coverage across entire lifecycle

-- decoupling of dissemination, formal publication, long-term access, reuse

GAP: Interoperability and integration across tools GAP: Maturity and sustainability of tools

--- most tools have small communities of maintainers, particular worrisome w/lack of interoperability

desi

gn

publ

ishi

ng

diss

emin

atio

n

arch

ivin

g

reus

e

colle

ctio

n

proc

essi

ng

inte

grat

ion

anal

ysis

cati / capi

Enhanced publication (sweave)

identifiers

Google-__________

data archives, hosting, networks

General digital libraries and repositories

Scientific workflow systems

Page 22: Data Sharing & Data Citation

Data Sharing & Data Citation

Supporting Institutions

Page 23: Data Sharing & Data Citation

Institutional Data Access Strategies*

Data Sharing & Data Citation

“Ignore it, maybe someone else will take care of it” (internet archive, …)

“We’ll always be here” (self-preservation)

Let the publishers do It

“We are ever true to [Insert Alma Mater]” (institutional archives)

“Ask us (domain archive) to do it” (ICPSR, MRA, Roper, …)

“Ask someone(s) else do it” (Data-PASS, Meta-Archive, ClockSS)

“Trust No One” (LOCKSS)

*All quotes are entirely fictional :-)

Page 24: Data Sharing & Data Citation

Data Sharing & Data

Citation

Institutional Preservation Strategies -- Corollaries

There are potential single points of failure in both technology, organization and legal regimes:

Diversify your portfolio: multiple software systems, hardware, organization (e.g., Data-PASS :-)

Seek international partners

Many combinations of preservation & dissemination strategies are compatible:

Layer technologies and strategies Leverage dissemination (in a planned way) for preservation

(and vice-versa)

Preservation is impossible to demonstrate conclusively: Consider organizational credentials No organization is absolutely certain to be reliable

Open Data

Institutional Topologies

Projects at the Intersection

Page 25: Data Sharing & Data Citation

Partnership Agreements MOU Secession Plans & Agreements

Coordinating Operations Development of shared

procedures

Joint “Not-bad” practices Identification & selection Metadata Confidentiality

Shared Catalog Unified Discovery Content replication

Data-PASS is a broad-based partnership of data archives dedicated to acquiring and preserving data at-risk of being lost to the social science research community.

Data-PASS partners have rescued thousands of data sets and created the largest catalog of social science data in existence.

Data-PASS partners collaborate to identify and promote good archival practices, seek out at-risk research data, build preservation infrastructure,and mutually safeguard

Data Sharing & Data Citation

Page 26: Data Sharing & Data Citation

Ideal integration of policy and technology?

Expressed in high-level domain/business language Captures a significant portion of business domain Translated to a formal schematization Automatically measurable Directly controls procedures & actions to achieve

compliance Verifiable translation from business domain policy

Data Sharing & Data Citation

Policy: A set of rules and objectives expressed at a high level domain that controls actions at a lower level

Page 27: Data Sharing & Data Citation

Data Sharing & Data Citation

“The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies.”

Policy

Schematization

Behavior(Operationalization)

Page 28: Data Sharing & Data Citation

SafeArchive: TRAC-Based Management of LOCKSS Facilitating collaborative replication and

preservation with technology… Collaborators declare explicit non-uniform resource

commitments Policy records commitments, storage network

properties Storage layer provides replication, integrity,

freshness, versioning SafeArchive software provides monitoring, auditing,

and provisioning Content is harvested through HTTP (LOCKSS) or OAI-

PMH Integration of LOCKSS, The Dataverse Network, TRAC

Data Sharing & Data Citation

Page 29: Data Sharing & Data Citation

AligningIncentives

Data Sharing & Data Citation

Page 30: Data Sharing & Data Citation

Data Sharing & Data Citation

Stakeholders & Information Flow

Research sources:- Research Subjects.- Owners of subject material- Owners of supplementary data

Research sponsors:- Home institution- Funding sources

Project Personnel:- Investigators- Research Staff

Research Publishers- Print publishers- Research archives

Research Consumers- Readers- Secondary researcher

LicensingCopyrightDMCAInformed ConsentPrivacyTrade secrets

LicensingFreedom of InformationCopyright

Copyright

CopyrightLicensing

Fair Use

InformationTransfer

PrivacyConfidentialityIntellectual Property

Replicable ResearchPolicy RelevanceAccessibility of ResearchProtect IPAvoid third party IP/Privacy Issues

Replicable ResearchPublishPromote use of PublicationsTrack use

Replicable researchPromote use of their publicationsProtect publisher IPAvoid third party IP/Privacy Issues

Replicate and extendSecondary analysisLink research

Stakeholder Concerns Legal Issues

Data Collection

Publication of Research Products

Page 31: Data Sharing & Data Citation

Data Citation as a Leverage Point Services

Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance

Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services

Persistence of identifiers are needed to maintain long-term access

Incentives Scholarly credit (intellectual attribution) is a large motivator

for many researchers – citation creates incentive for researchers to publish data

Scholars also comply with enforceable journal policies-- requiring data citation is a light-weight method to make data access policies auditable

Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact

Data Sharing & Data Citation

Page 32: Data Sharing & Data Citation

Data Sharing & Data Citation

Com

mon

Prin

cipl

es

Page 33: Data Sharing & Data Citation

Data Sharing & Data Citation

Page 34: Data Sharing & Data Citation

Thanks to 37 Participants

Data Sharing & Data Citation

Page 35: Data Sharing & Data Citation

Data Sharing & Data Citation

What is a citation?

Page 36: Data Sharing & Data Citation

Data Sharing & Data Citation

Page 37: Data Sharing & Data Citation

Workflow

Data Sharing & Data Citation

Page 38: Data Sharing & Data Citation

Workflow

Data Sharing & Data Citation

Page 39: Data Sharing & Data Citation

Data Sharing & Data Citation

- Separatescientific principles, use cases, requirements

-Distinguish syntax, semantics, from presentation-Design for ecosystem & lifecycle-Incremental value for incremental effort-- Think Globally, Act Locally

Design Principles

Page 40: Data Sharing & Data Citation

Data Sharing & Data Citation

Theory

Page 41: Data Sharing & Data Citation

Data Sharing & Data Citation

Theory +

Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works

At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited

Citations should persist and enable access to fixed version of data at least as long as citing work

Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem

Page 42: Data Sharing & Data Citation

Theory + Practice

Data Sharing & Data Citation

Page 43: Data Sharing & Data Citation

Data Sharing & Data Citation

Use Cases

Page 44: Data Sharing & Data Citation

Data Sharing & Data Citation

Use Cases (details)Operational Constraints?

-Syntax-Interoperability-Technical contexts of use

Page 45: Data Sharing & Data Citation

Actors

Data Sharing & Data Citation

Page 46: Data Sharing & Data Citation

Data Sharing & Data Citation

- Semantic:Persistent ID, Author, Title, Version (or at least date)

- Presentation:Any styleGrouped with other referencesActionable in context

- PolicyTreat data cites as first classIf its needed support a claim, cite itOffer credit to contributors

Simple Proposal

Page 47: Data Sharing & Data Citation

- We cannot depend on a single tool-- plans for integration and interoperability through citations and linking mechanisms, interchange formats, ontology hooks, protocols ?

- Large portion of benefit from data sharing arises from open access… -- how can OpenShare “nudge” researchers toward Open Data?

- Individual researchers cannot ensure long-term access -- how will OpenShapa fit in institutional ecosystem?Dis

cuss

ion

Page 48: Data Sharing & Data Citation

Contact

Data Sharing & Data Citation

Micah Altman

futurelib.org