39
P. Bryan Heidorn University of Arizona and JRS Biodiversity Foundation 2011 Scripting Life: the science behind ViBRANT Paris, France 20-21 January 2011 The Path to Enlightened Solutions for Biodiversity's Dark Data

Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Embed Size (px)

DESCRIPTION

The Path to Enlightened Solutions for Biodiversity's Dark Data Keynote at Scripting Life: the science behind ViBRANT http://vbrant.eu/presentations

Citation preview

Page 1: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

P. Bryan HeidornUniversity of Arizona and JRS Biodiversity Foundation

2011 Scripting Life: the science behind ViBRANTParis, France

20-21 January 2011

The Path to Enlightened Solutions for Biodiversity's

Dark Data

Page 2: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

University of Arizona

Today: 25°CSunny

Page 3: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Thesis

Large amounts of data remain uncurated

Most of that data is from small data sets and is currently largely invisible – Dark Data

This data should be curated locally but not by scientists alone

Need for long-lived institutions

Page 4: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Cyberinfrastructure Vision

“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access.”

NSF Cyberinfrastructure Vision for 21st Century Discovery, Chapter 3

Page 5: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Recognition of need for data curation

“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.”

Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations

Page 6: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Recognition of the importance of Information

Recognition of the need for education

New work roles within traditional institutions

Interagency Working Group on Digital Data

Page 7: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Why Libraries and Museums

Long history of scholarly data management

Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri

Long-lived institutionsExisting overlap with museums and

archives

Page 8: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

The problem

Recognition of the problemInformation is not in accessible format Computer Science, Information

Science and Technology has not addressed the problem

No training or incentive for data generators

Page 9: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Dark data is the data that we know is/was there but we can’t see it.

Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17

Page 10: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Related Ideas

John Porter: Deep verses Wide databases

Swanson: Undiscovered Public Knowledge

Science Commons: Big Verses Small science

Page 11: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

f(x)=axk+o(xk)

Power Law of Science Data

f(x)=axk+o(xk)| X<.20

Dat

a V

olum

e

Science Projects and Initiatives

Page 12: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Does NSF’s Data Follow the Power Law?

I do not know but if $1 = X bytes…..

Awarded Amount 2007

$0

$1,000,000

$2,000,000

$3,000,000

$4,000,000

$5,000,000

$6,000,000

$7,000,000

1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776

Page 13: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

20-80 Rule The small are big!

Total Grants 9347

$2,137,636,716

20% 80%

Number Grants 1869 7478

Total Dollars $1,199,088,125 $938,548,595

Range $6,892,810-$350,000

$350,000-$831

Page 14: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Bio

logy

200

9

#Grants: 1886 $Total: $744,168,471 ≈ €550,000,000Distribution 1266 < $.5 million ≈ €370,000Mode: $304,691 ≈ €225,000

Myth of the mega-project

Page 15: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Because it is high volumeBecause it is information rich – high

entropyWhile needs of large data are

understood small data and integration are not understood

Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. (http://hdl.handle.net/2142/9127).

Small data is big science

Page 16: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Where to find dark data

Scientist’s backpacks and desksLiterature/Biodiversity Heritage LibraryMuseum SpecimensField notesCitizen Observations

Page 17: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

What is dark data good for?

Ecological Niche ModelingClimate Change niche change predictionTaxonomic Name ResolutionLiterature Search Support

Taxonomic intelligenceKey-like – character searching

Phenology and Phenology changeFood-web / trophic level

Page 18: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Problematic Transition

Personal Information Management vsKnowledge Organization

Pluralistic vs Unified (Hjørland, 2007)

Page 19: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Contrast in Styles (White, in press)

Personal Information ManagementOne-Few usersVisual/SpatialProject Oriented

Knowledge OrganizationMany usersLanguage basedLong-term orientation

Page 20: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

New Information Disciplines

Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)

Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form

Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection

(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)

Page 21: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Roles

Page 22: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Skills

Page 23: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Library Roles

Life Cycle PhasesPlanCreateKeep Dispose

Data Management FunctionAccessDocumentOrganizeProtect

Page 24: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

How to Organize at a higher level?

It is difficult to find what is already knownClonal specimens may be stored in

different museums around the worldDNA analysis may be conducted on one

but not the otherMicrographs may be in a databaseTaxonomic treatments or revisions may

exist

Page 25: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Biological Science Collections (BiSciCol) Tracker

S1: KNM

S2: MNHN

Muséum national d'histoire naturelle

Nairobi National Museum

S3: MBG

Living Collection: Missouri Botanical Garden

DeterminationDetermination

?

?

Gene SequenceGene Sequence

GENBANK

?

?

?

?ParasitismParasitism

Agave sisalana

?

Page 26: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

BiSciCol Tracker

Page 27: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

The Future is all about Data

How do we get it?How do we analyze it?How do we disseminate it (Maps, charts

tables..)?How do we keep it?

Provenance, Storage Weeding

How do we make it sustainable?

Page 28: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Digital/Data Curation Programs

University of IllinoisGraduate School of Library and Information

Science

University of ArizonaSchool of Information Resources and Library

Science

University of North CarolinaSchool of Information and Library Science

Page 29: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Education Needs

Biological Information Specialist

Concentration in Data Curation (MSLIS)

Certificate of Advanced Study in Data Curation for Libraries and Scientist

Information and professional education in biodiversity informatics

Page 30: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

MSLIS Data Curation Concentration

Data Curation Educational Program (DCEP)

IMLS – Laura Bush 21st Century Librarian Program,

RE-05-06-0036-06 (Heidorn, PI)

Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations

Page 31: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Biological Information Specialists

At present:

Biologists at all degree levels self-trained in information technology

Information technologists at all degree levels self-trained in biology

(both with gaps in knowledge for many months, years)

Differing roles of BIS in large and small

Page 32: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Master of Science in Biological Informatics

Degree Program began September 2007

Part of campus-wide bioinformatics masters program

NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)

Combines Biology, Bioinformatics, Computer Science core with LIS courses

Page 33: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

What does a BIS need to know?

Biological training and interest in solving biological research problems

Information skills Evaluation and implementation of information

systems: user based assessment and continual quality improvement for the development of tools that work and are used.

Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.

Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.

Page 34: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

UIUC bioinformatics core coursework

Cross-disciplinary course distribution requirement

Bioinformatics: Computing in Molecular

BiologyAlgorithms in BioinformaticsPrinciples of Systematics

Computer Science: AlgorithmsDatabase Systems

Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling

Page 35: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Sample of existing LIS courses

Information Organization and Knowledge Representation

LIS 551 Interfaces to Information Systems

LIS 590DM Document Modeling LIS 590RO Representing and

Organizing Information Resources LIS590ON Ontologies in Natural

Science

Information Resources, Uses and users

LIS 503 Use and Users of Information

LIS 522 Information Sources in the Sciences

LIS 590TR Information Transfer and Collaboration in Science

Information Systems LIS 456 Information Storage

and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network

Information Systems LIS 590EP Electronic Publishing

Disciplinary Focus LIS 530B Health Sciences

Information Services and Resources

LIS 590HI Healthcare Informatics (Healthcare Infrastructure)

LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)

Page 36: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

University of ArizonaGraduate Certificate in Digital

Records Management

Six Graduate Courses within MLA program

Focus on repositoriesCross over with Knowledge

Representation and Metadata

Page 37: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Workforce

Data Curation Workforce Summit Dec 6th at IDCC ChicagoIdentify the Skill sets needed to government

data curationDepartment of Energy, US National Science

Foundation, Institute of Museum and Library Services, Oak Ridge National Laboratory, USGS National Biological Information Infrastructure, CIESIN

Page 38: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

The Future is Collaboration and Data Sharing

• Libraries

• Museums

• Government

• Universities

To bring the best data to the major problems and opportunities

of our time and the future

• NGO• Private Land

Holders• Ranches• Farms

Page 39: Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

MerciMerci