130
The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags help) Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitive Social Science Harvard University @mercecrosas NDSR 2016 Symposium

The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Embed Size (px)

Citation preview

Page 1: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The Rise of Data Publishing in the Digital World

(and how Dataverse and DataTags help)

Mercè Crosas, Ph.D.Chief Data Science and Technology Officer

Institute for Quantitive Social ScienceHarvard University

@mercecrosas

NDSR 2016 Symposium

Page 2: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

From 1665 to late 20th century:A steady increase in size and

complexity of research output

Page 3: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The number of journals doubles every 20 years since 1750s, with growth of number of scientists

1665 1765 1865 1965

100

10000

Mabe, 2003

Page 4: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The number of journals doubles every 20 years since 1750s, with growth of number of scientists

1700: 3 journals

1665 1765 1865 1965

100

10000

Mabe, 2003

Page 5: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The number of journals doubles every 20 years since 1750s, with growth of number of scientists

1700: 3 journals

1800: ~10 journals

1665 1765 1865 1965

100

10000

Mabe, 2003

Page 6: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The number of journals doubles every 20 years since 1750s, with growth of number of scientists

1700: 3 journals

1800: ~10 journals

1900: ~400 journals

1665 1765 1865 1965

100

10000

Mabe, 2003

Page 7: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The number of journals doubles every 20 years since 1750s, with growth of number of scientists

1700: 3 journals

1800: ~10 journals

1900: ~400 journals

2000: ~14,000 journals(peer-reviewed)

1665 1765 1865 1965

100

10000

Mabe, 2003

Page 8: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

1665 1765 1865 1965

100

10000

Page 9: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

1665 1765 1865 1965

100

10000

Page 10: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

1665 1765 1865 1965

100

10000

Page 11: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text 50% cite previous work

1665 1765 1865 1965

100

10000

Page 12: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text 50% cite previous work

First Line Graphs and bar charts (Playfair, 1786)

1665 1765 1865 1965

100

10000

Page 13: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

50% of articles have tables & figures

50% cite previous work

First Line Graphs and bar charts (Playfair, 1786)

1665 1765 1865 1965

100

10000

Page 14: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

50% of articles have tables & figures

50% cite previous work

method sections appear

First Line Graphs and bar charts (Playfair, 1786)

1665 1765 1865 1965

100

10000

Page 15: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

50% of articles have tables & figures

50% cite previous work

method sections appear

First Line Graphs and bar charts (Playfair, 1786)

First Scatterplots (Hershel,1833; Galton 1896)

1665 1765 1865 1965

100

10000

Page 16: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

50% of articles have tables & figures

most articles have tables & figures, often standalone

50% cite previous work

method sections appear

First Line Graphs and bar charts (Playfair, 1786)

First Scatterplots (Hershel,1833; Galton 1896)

1665 1765 1865 1965

100

10000

Page 17: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Tables and Visuals Become Increasingly Common, and part of the Scientific Argument

a few tables & visuals, as part of

the text

50% of articles have tables & figures

most articles have tables & figures, often standalone

50% cite previous work

100% with citations(1 per 100 words)

part of scholarly credit

method sections appear

First Line Graphs and bar charts (Playfair, 1786)

First Scatterplots (Hershel,1833; Galton 1896)

1665 1765 1865 1965

100

10000

Page 18: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

Page 19: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century:

Page 20: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)

Page 21: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

Page 22: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts

Page 23: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

Page 24: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:

Page 25: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics

Page 26: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies

Page 27: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies

• Number of scientists increases from 100s to a few millions

Page 28: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies

• Number of scientists increases from 100s to a few millions• Science becomes extremely specialized:

Page 29: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies

• Number of scientists increases from 100s to a few millions• Science becomes extremely specialized:

• from 1 journal to 14,000 peer-reviewed journals

Page 30: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Scholarly Publishing Adapts to the Increase of Cognitive Complexity (Gross et al 2001)

• 18th century: • formal components appear in articles (introduction,

conclusions, table, figures, citations)• 19th century:

• explain data instead of establish observations of facts• wide use of visuals, high citation density, methods section

• 20th century:• structured quantitative data with increased use of statistics• wide range of data types with new technologies

• Number of scientists increases from 100s to a few millions• Science becomes extremely specialized:

• from 1 journal to 14,000 peer-reviewed journals• one new journal for each 150 authors, read by 500

Page 31: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

In the last decades, more and more publications

and data

Page 32: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

A Steeper Growth of Scholarly Output Since 1950, the total number of journals doubles every ~15 years

2010: 80,000 journals

2010: 33,000 peer-reviewed

Page 33: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

Page 34: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

1920 - 1950s

Page 35: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

1920 - 1950s

Page 36: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

1920 - 1950s 1970 - 1980s

Page 37: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

First Biomedical Databases

(PDB, GenBank, ...)

1920 - 1950s 1970 - 1980s

Page 38: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

First Biomedical Databases

(PDB, GenBank, ...)

1920 - 1950s 1970 - 1980s 2016

Page 39: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

A wide range of Research Data Repositories

First Biomedical Databases

(PDB, GenBank, ...)

1920 - 1950s 1970 - 1980s 2016

Page 40: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An Outburst of Research Data and Specialization, Results into > 1000 Community Repositories

First Social Science Data Archives

(ODUM, ICPSR, ...)

A wide range of Research Data Repositories

First Biomedical Databases

(PDB, GenBank, ...)

1500 repositories listed in re3data.org

1920 - 1950s 1970 - 1980s 2016

Page 41: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Page 42: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

Page 43: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

Page 44: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

Page 45: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

• Finding & Reuse

Page 46: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

• Finding & Reuse

Data Archiving: Long-term access to data

Page 47: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

• Finding & Reuse

Data Archiving: Long-term access to data

• Accessibility

Page 48: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

• Finding & Reuse

Data Archiving: Long-term access to data

• Accessibility

• Preservation

Page 49: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Emerges as the Union of Scholarly Publishing and Data Archiving

Scholarly publishing: Distribute research output

• Attribution and credit

• Dissemination

• Finding & Reuse

Data Archiving: Long-term access to data

• Accessibility

• Preservation

• Finding & Reuse

Page 50: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

Page 51: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.

Page 52: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.

Page 53: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

• Data (and software) have become common input and output of research

Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.

Page 54: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

• Data (and software) have become common input and output of research

• A scholarly article cannot hold or describe accurately these vast amounts of data and software

Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.

Page 55: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Why Data Publishing now?

• Data (and software) have become common input and output of research

• A scholarly article cannot hold or describe accurately these vast amounts of data and software

• As input and output of research, data must be citable and accessible to enable validation and reuse, with attribution

Extending Gross et al. thesis, data publishing accommodates the complexity of research input and output in the digital world.

Page 56: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

FAIR = Findable Accessible Interoperable Reusable

Page 57: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

FAIR = Findable Accessible Interoperable Reusable

Page 58: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

FAIR = Findable Accessible Interoperable Reusable

Page 59: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

FAIR = Findable Accessible Interoperable Reusable

Page 60: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

FAIR = Findable Accessible Interoperable Reusable

Page 61: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

FAIR = Findable Accessible Interoperable Reusable

Page 62: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

FAIR = Findable Accessible Interoperable Reusable

Page 63: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

• Sufficient information to understand and reuse the data

FAIR = Findable Accessible Interoperable Reusable

Page 64: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

• Sufficient information to understand and reuse the data

Repository

FAIR = Findable Accessible Interoperable Reusable

Page 65: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

• Sufficient information to understand and reuse the data

Repository

• Digital access to metadata and data

FAIR = Findable Accessible Interoperable Reusable

Page 66: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

• Sufficient information to understand and reuse the data

Repository

• Digital access to metadata and data

• Archive and preservation for long-term access

FAIR = Findable Accessible Interoperable Reusable

Page 67: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

What is needed for FAIR Data Publishing

Data Citation

• Persistent id to reference data uniquely

• Support for versions and fixity

• Attribution to authors and repository

Metadata

• Catalog to discover and locate the data

• Sufficient information to understand and reuse the data

Repository

• Digital access to metadata and data

• Archive and preservation for long-term access

• Interoperability through standards and APIs

FAIR = Findable Accessible Interoperable Reusable

Page 68: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)
Page 69: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

A data repository system that serves as a solution for publishing FAIR research data

Page 70: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Around the World

Dataverse repositories serve a community, an institution, an archive, ...

Page 71: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Around the World

Harvard Dataverse: Generic data repository open to researchers world wide

Dataverse repositories serve a community, an institution, an archive, ...

Page 72: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Dataverses contain datasets, datasets contain metadata and data files

Page 73: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Citation in Dataverse

Page 74: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Citation in Dataverse

Published Year

Dataset Title

Global Persistent Identifier

Repository= Data Publisher

Version (or time range)

Authors

Page 75: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Citation Basics

Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015

Page 76: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Citation Basics

Force11, Joint Declaration of Data Citation Principles; Starr et al, 2015

The dataset landing page is accessible and guaranteed by the repository (or data publisher), even when data are restricted or deaccessioned

Page 77: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Metadata In Dataverse

Page 78: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Metadata In Dataverse

Citation Metadataauthor, title, repository, year published, version,

etc

• Dublin Core• DataCite

Domain-specific Metadata

data collection info (methods, organism, observation, survey,

experiment, etc)

• DDI (social sciences)• ISA-Tab BioCaddie (biomed)• Virtual Observatory (astro)• + Custom metadata blocks

File-level Metadata

metadata inside the data file (variables, instrument

details, geospatial info, etc)

• DDI (for variables),• + more to be determined

Fields StandardsMetadata Level

Page 79: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Metadata In Dataverse

Citation Metadataauthor, title, repository, year published, version,

etc

• Dublin Core• DataCite

Domain-specific Metadata

data collection info (methods, organism, observation, survey,

experiment, etc)

• DDI (social sciences)• ISA-Tab BioCaddie (biomed)• Virtual Observatory (astro)• + Custom metadata blocks

File-level Metadata

metadata inside the data file (variables, instrument

details, geospatial info, etc)

• DDI (for variables),• + more to be determined

Fields StandardsMetadata Level

Page 80: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Metadata In Dataverse

Citation Metadataauthor, title, repository, year published, version,

etc

• Dublin Core• DataCite

Domain-specific Metadata

data collection info (methods, organism, observation, survey,

experiment, etc)

• DDI (social sciences)• ISA-Tab BioCaddie (biomed)• Virtual Observatory (astro)• + Custom metadata blocks

File-level Metadata

metadata inside the data file (variables, instrument

details, geospatial info, etc)

• DDI (for variables),• + more to be determined

Fields StandardsMetadata Level

Page 81: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Metadata In Dataverse

Citation Metadataauthor, title, repository, year published, version,

etc

• Dublin Core• DataCite

Domain-specific Metadata

data collection info (methods, organism, observation, survey,

experiment, etc)

• DDI (social sciences)• ISA-Tab BioCaddie (biomed)• Virtual Observatory (astro)• + Custom metadata blocks

File-level Metadata

metadata inside the data file (variables, instrument

details, geospatial info, etc)

• DDI (for variables),• + more to be determined

Fields StandardsMetadata Level

Dat

aver

se JS

ON

Sch

ema

Page 82: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: Tabular Files

Page 83: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: Tabular Files

RDataStataSPSSExcelCSV

var 1 var 2 var 3

obs 1 2 a 0

obs 2 4 c 0

obs 3 6 b 1

obs 4 1 e 0

obs 5 2 a 1

obs 6 3 b 1

Page 84: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: Tabular Files

RDataStataSPSSExcelCSV

var 1 var 2 var 3

obs 1 2 a 0

obs 2 4 c 0

obs 3 6 b 1

obs 4 1 e 0

obs 5 2 a 1

obs 6 3 b 1

Variable Metadata:Variable name, label, type, stats, geospatial

coordinates

Page 85: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: Tabular Files

RDataStataSPSSExcelCSV

var 1 var 2 var 3

obs 1 2 a 0

obs 2 4 c 0

obs 3 6 b 1

obs 4 1 e 0

obs 5 2 a 1

obs 6 3 b 1

Variable Metadata:Variable name, label, type, stats, geospatial

coordinates

2 a 0

4 c 0

6 b 1

1 e 0

2 a 1

3 b 1

Data Values: Independent of format

Page 86: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: Tabular Files

RDataStataSPSSExcelCSV

var 1 var 2 var 3

obs 1 2 a 0

obs 2 4 c 0

obs 3 6 b 1

obs 4 1 e 0

obs 5 2 a 1

obs 6 3 b 1

Variable Metadata:Variable name, label, type, stats, geospatial

coordinates

2 a 0

4 c 0

6 b 1

1 e 0

2 a 1

3 b 1

Data Values: Independent of format

Universal Numerical Fingerprint (UNF):checksum on data values, from canonical format

Page 87: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: FITS (astro) Files

Page 88: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: FITS (astro) Files

Page 89: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: FITS (astro) Files

Header Metadata:coordinates (R.A., declination), photometric info, ...

Page 90: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Information Extraction: FITS (astro) Files

Header Metadata:coordinates (R.A., declination), photometric info, ...

Data Objects:• Image Files•Spectra•Data cubes•Tables• ...

Page 91: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

In addition to data citation and metadata features, Dataverse has a rich set of features that

facilitate data publishing

Page 92: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Page 93: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Open (default): CC0

Open Open Click to Download

GuestBook Open Open Fill in guestbook before download

Terms of Use Open Open Click through terms of use before download

Data Restricted Open Restricted Request Access via click through

Data Restricted Open Restricted Request Access via application

Metadata Files How to Access

Page 94: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Open (default): CC0

Open Open Click to Download

GuestBook Open Open Fill in guestbook before download

Terms of Use Open Open Click through terms of use before download

Data Restricted Open Restricted Request Access via click through

Data Restricted Open Restricted Request Access via application

Metadata Files How to Access

Page 95: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Open (default): CC0

Open Open Click to Download

GuestBook Open Open Fill in guestbook before download

Terms of Use Open Open Click through terms of use before download

Data Restricted Open Restricted Request Access via click through

Data Restricted Open Restricted Request Access via application

Metadata Files How to Access

Page 96: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Open (default): CC0

Open Open Click to Download

GuestBook Open Open Fill in guestbook before download

Terms of Use Open Open Click through terms of use before download

Data Restricted Open Restricted Request Access via click through

Data Restricted Open Restricted Request Access via application

Metadata Files How to Access

Page 97: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Tiered Access

Open (default): CC0

Open Open Click to Download

GuestBook Open Open Fill in guestbook before download

Terms of Use Open Open Click through terms of use before download

Data Restricted Open Restricted Request Access via click through

Data Restricted Open Restricted Request Access via application

Metadata Files How to Access

Page 98: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Page 99: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Page 100: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Review (collaborators or

anonymous reviewers)

Page 101: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Page 102: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only)

Page 103: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only)

Page 104: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only) Publish v. 1.1

Page 105: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only) Publish v. 1.1

Major change (might include new

data file)

Page 106: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only) Publish v. 1.1

Major change (might include new

data file)

Page 107: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Data Publishing Workflows

Create Dataset(landing page restricted)

Publish v. 1Review

(collaborators or anonymous reviewers)

Minor change (metadata only) Publish v. 1.1

Major change (might include new

data file)Publish v. 2

Page 108: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

And more at dataverse.org guides ...

Page 109: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Biomedical Dataverse addresses data publication of large files: SBGridData

Page 110: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The Biomedical Dataverse at Harvard Medical School - also tested as a persistent repository for LINCS data

(NIH Library of Integrated Network based Cellular Signatures)

Collaboration with Piotr Sliz and Caroline Shamu (HMS)

(NIH Library of Integrated Network-based Cellular Signatures)

Page 111: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

The Biomedical Dataverse at Harvard Medical School - also tested as a persistent repository for LINCS data

(NIH Library of Integrated Network based Cellular Signatures)

Collaboration with Piotr Sliz and Caroline Shamu (HMS)

(NIH Library of Integrated Network-based Cellular Signatures)

Page 112: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

An additional challenge for data publishing:

Sensitive Data

Page 113: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

“User  Uploads  must  be  void  of  all  iden4fiable  informa4on,  such  that  re-­‐iden4fica4on  of  any  subjects  from  the  amalgama4on  of  the  informa4on  available  from  all  of  the  materials  (across  datasets  and  dataverses)  uploaded  under  any  one  author  and/or  user  should  not  be  possible.”

Page 114: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

“SubmiCer  represents  and  warrants  that  the  Content  does  not  contain  any  informa4on  (i)  which  iden4fies,  or  which  can  be  used  in  conjunc4on  with  other  publicly  available  informa4on  to  personally  iden4fy,  any  individual;”

Page 115: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

“If  you  are  submiHng  human  sequences  to  GenBank,  do  not  include  any  data  that  could  reveal  the  personal  iden4ty  of  the  source.  It  is  our  assump4on  that  you  have  received  any  necessary  informed  consent  authoriza4ons  that  your  organiza4ons  require  prior  to  submiHng  your  sequences.”

GenBank

Page 116: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

How can we maximize publishing sensitive data while

being mindful of privacy?

Page 117: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Sweeney  L,  Crosas  M,  Bar-­‐Sinai  M.  Sharing  Sensi4ve  Data  with  Confidence:  The  DataTags  System.  Technology  Science.  2015101601.  October  16,  2015.  hCp://techscience.org/a/2015101601

The DataTags System

Page 118: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)
Page 119: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

A datatag is a set of security features and access requirements for file handling

Page 120: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

A datatag is a set of security features and access requirements for file handling

A datatags repository is one that stores and shares data files in accordance with a standardized and ordered levels of security and access requirements

Page 121: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Datatags&Levels&Tag$Type$ Descrip-on$ Security$Features$ Access$Requirements$

Blue$ Public& Clear&storage&Clear&transmission&

&Open&

Green$ Controlled$public&

Clear&storage&Clear&transmission&

Email,&OAuth&verified&registra:on&

Yellow$ Accountable& Clear&storage&Encrypted&transmit&

Password,&Registered&,&Approval,&Click&DUA&

Orange$ More$accountable&

Encrypted&storage&Encrypted&transmit&

Password,&Registered,&Approval,&Signed&DUA&

Red$ Fully$accountable&

Encrypted&storage&Encrypted&transmit&

TwoDfactor&authen:ca:on,&Approval,&Signed&DUA&

Crimson$ Maximally$restricted&

Mul:Encrypt&store&Encrypted&transmit&

TwoDfactor&authen:ca:on,&Approval,&Signed&DUA&

Page 122: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

DataTags Workflow in a Dataverse Repository(under development)

Data$File$Inges-on$

Sensi-ve$Dataset$

Direct$Access$

Privacy$Preserving$Access$

Automa-c$Interview$$

Review$Board$Approval$

hCp://datatags.orghCp://privacytools.seas.harvard.edu

Two-­‐factor  Authen4ca4on;Signed  DUA

Page 123: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 124: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 125: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 126: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 127: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 128: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Example of DataTags Interview

Page 129: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

Thanks!And join us to this year’s

Dataverse Community Meeting

Page 130: The Rise of Data Publishing in the Digital World (and how Dataverse and DataTags Help)

References• http://dataverse.org

• http://dataverse.harvard.edu

• http://datatags.org

• Sweeney L, Crosas M, Bar-Sinai M. 2015, Sharing Sensitive Data with Confidence: The DataTags System. Technology Science, hCp://techscience.org/a/2015101601

• Gross Harmon, Reidy, 2001, Communicating Science

• Mabe,  2003,  The  Growth  and  Number  of  Journals

• Friendly,  2006,  A  Brief  History  of  Data  Visualiza4on