Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT 2011

Preview:

DESCRIPTION

The Path to Enlightened Solutions for Biodiversity's Dark Data Keynote at Scripting Life: the science behind ViBRANT http://vbrant.eu/presentations

Citation preview

P. Bryan HeidornUniversity of Arizona and JRS Biodiversity Foundation

2011 Scripting Life: the science behind ViBRANTParis, France

20-21 January 2011

The Path to Enlightened Solutions for Biodiversity's

Dark Data

University of Arizona

Today: 25°CSunny

Thesis

Large amounts of data remain uncurated

Most of that data is from small data sets and is currently largely invisible – Dark Data

This data should be curated locally but not by scientists alone

Need for long-lived institutions

Cyberinfrastructure Vision

“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access.”

NSF Cyberinfrastructure Vision for 21st Century Discovery, Chapter 3

Recognition of need for data curation

“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.”

Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations

Recognition of the importance of Information

Recognition of the need for education

New work roles within traditional institutions

Interagency Working Group on Digital Data

Why Libraries and Museums

Long history of scholarly data management

Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri

Long-lived institutionsExisting overlap with museums and

archives

The problem

Recognition of the problemInformation is not in accessible format Computer Science, Information

Science and Technology has not addressed the problem

No training or incentive for data generators

Dark data is the data that we know is/was there but we can’t see it.

Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17

Related Ideas

John Porter: Deep verses Wide databases

Swanson: Undiscovered Public Knowledge

Science Commons: Big Verses Small science

f(x)=axk+o(xk)

Power Law of Science Data

f(x)=axk+o(xk)| X<.20

Dat

a V

olum

e

Science Projects and Initiatives

Does NSF’s Data Follow the Power Law?

I do not know but if $1 = X bytes…..

Awarded Amount 2007

$0

$1,000,000

$2,000,000

$3,000,000

$4,000,000

$5,000,000

$6,000,000

$7,000,000

1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776

20-80 Rule The small are big!

Total Grants 9347

$2,137,636,716

20% 80%

Number Grants 1869 7478

Total Dollars $1,199,088,125 $938,548,595

Range $6,892,810-$350,000

$350,000-$831

Bio

logy

200

9

#Grants: 1886 $Total: $744,168,471 ≈ €550,000,000Distribution 1266 < $.5 million ≈ €370,000Mode: $304,691 ≈ €225,000

Myth of the mega-project

Because it is high volumeBecause it is information rich – high

entropyWhile needs of large data are

understood small data and integration are not understood

Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. (http://hdl.handle.net/2142/9127).

Small data is big science

Where to find dark data

Scientist’s backpacks and desksLiterature/Biodiversity Heritage LibraryMuseum SpecimensField notesCitizen Observations

What is dark data good for?

Ecological Niche ModelingClimate Change niche change predictionTaxonomic Name ResolutionLiterature Search Support

Taxonomic intelligenceKey-like – character searching

Phenology and Phenology changeFood-web / trophic level

Problematic Transition

Personal Information Management vsKnowledge Organization

Pluralistic vs Unified (Hjørland, 2007)

Contrast in Styles (White, in press)

Personal Information ManagementOne-Few usersVisual/SpatialProject Oriented

Knowledge OrganizationMany usersLanguage basedLong-term orientation

New Information Disciplines

Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)

Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form

Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection

(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)

Roles

Skills

Library Roles

Life Cycle PhasesPlanCreateKeep Dispose

Data Management FunctionAccessDocumentOrganizeProtect

How to Organize at a higher level?

It is difficult to find what is already knownClonal specimens may be stored in

different museums around the worldDNA analysis may be conducted on one

but not the otherMicrographs may be in a databaseTaxonomic treatments or revisions may

exist

Biological Science Collections (BiSciCol) Tracker

S1: KNM

S2: MNHN

Muséum national d'histoire naturelle

Nairobi National Museum

S3: MBG

Living Collection: Missouri Botanical Garden

DeterminationDetermination

?

?

Gene SequenceGene Sequence

GENBANK

?

?

?

?ParasitismParasitism

Agave sisalana

?

BiSciCol Tracker

The Future is all about Data

How do we get it?How do we analyze it?How do we disseminate it (Maps, charts

tables..)?How do we keep it?

Provenance, Storage Weeding

How do we make it sustainable?

Digital/Data Curation Programs

University of IllinoisGraduate School of Library and Information

Science

University of ArizonaSchool of Information Resources and Library

Science

University of North CarolinaSchool of Information and Library Science

Education Needs

Biological Information Specialist

Concentration in Data Curation (MSLIS)

Certificate of Advanced Study in Data Curation for Libraries and Scientist

Information and professional education in biodiversity informatics

MSLIS Data Curation Concentration

Data Curation Educational Program (DCEP)

IMLS – Laura Bush 21st Century Librarian Program,

RE-05-06-0036-06 (Heidorn, PI)

Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations

Biological Information Specialists

At present:

Biologists at all degree levels self-trained in information technology

Information technologists at all degree levels self-trained in biology

(both with gaps in knowledge for many months, years)

Differing roles of BIS in large and small

Master of Science in Biological Informatics

Degree Program began September 2007

Part of campus-wide bioinformatics masters program

NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)

Combines Biology, Bioinformatics, Computer Science core with LIS courses

What does a BIS need to know?

Biological training and interest in solving biological research problems

Information skills Evaluation and implementation of information

systems: user based assessment and continual quality improvement for the development of tools that work and are used.

Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.

Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.

UIUC bioinformatics core coursework

Cross-disciplinary course distribution requirement

Bioinformatics: Computing in Molecular

BiologyAlgorithms in BioinformaticsPrinciples of Systematics

Computer Science: AlgorithmsDatabase Systems

Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling

Sample of existing LIS courses

Information Organization and Knowledge Representation

LIS 551 Interfaces to Information Systems

LIS 590DM Document Modeling LIS 590RO Representing and

Organizing Information Resources LIS590ON Ontologies in Natural

Science

Information Resources, Uses and users

LIS 503 Use and Users of Information

LIS 522 Information Sources in the Sciences

LIS 590TR Information Transfer and Collaboration in Science

Information Systems LIS 456 Information Storage

and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network

Information Systems LIS 590EP Electronic Publishing

Disciplinary Focus LIS 530B Health Sciences

Information Services and Resources

LIS 590HI Healthcare Informatics (Healthcare Infrastructure)

LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)

University of ArizonaGraduate Certificate in Digital

Records Management

Six Graduate Courses within MLA program

Focus on repositoriesCross over with Knowledge

Representation and Metadata

Workforce

Data Curation Workforce Summit Dec 6th at IDCC ChicagoIdentify the Skill sets needed to government

data curationDepartment of Energy, US National Science

Foundation, Institute of Museum and Library Services, Oak Ridge National Laboratory, USGS National Biological Information Infrastructure, CIESIN

The Future is Collaboration and Data Sharing

• Libraries

• Museums

• Government

• Universities

To bring the best data to the major problems and opportunities

of our time and the future

• NGO• Private Land

Holders• Ranches• Farms

MerciMerci