46
A Deep Survey of the Digital Resource Landscape: Perspectives from the Neuroscience Information Framework Maryann E. Martone, Ph. D. University of California, San Diego

A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Embed Size (px)

DESCRIPTION

Presentation at the National Science Foundation

Citation preview

Page 1: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

A Deep Survey of the Digital Resource Landscape:

Perspectives from the Neuroscience Information Framework

Maryann E. Martone, Ph. D.University of California, San Diego

Page 2: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the

neuroscience community?– How many are there?– What domains do they cover? What domains do they not cover?– Where are they?

• Web sites• Databases• Literature• Supplementary material

– Who uses them?– Who creates them?– How can we find them?– How can we make them better in the future?

http://neuinfo.org

• PDF files

• Desk drawers

Page 3: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

The Neuroscience Information Framework

• NIF has developed a production technology platform for researchers to:– Discover– Share– Analyze– Integrate neuroscience-relevant

information• Since 2008, NIF has assembled

the largest searchable catalog of neuroscience data and resources on the web

• Cost-effective and innovative strategy for managing data assets

“This unique data depository serves as a model for other Web sites to provide research data. “ - Choice Reviews Online

NIF is poised to capitalize on the new tools and emphasis on big data and open science

Page 4: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

http://neuinfo.orgJune10, 2013 dkCOIN Investigator's Retreat 4

The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience

• A portal for finding and using neuroscience resources

A consistent framework for describing resources

Provides simultaneous search of multiple types of information, organized by category

Supported by an expansive ontology for neuroscience

Utilizes advanced technologies to search the “hidden web”

UCSD, Yale, Cal Tech, George Mason, Washington Univ

Literature

Database Federation

Registry

Page 5: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Part 1: Surveying the resource landscape

•NIF Registry: A catalog of neuroscience-relevant resources• > 6000 currently

listed• > 2200 databases

•And we are finding more every day

Page 6: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat

How do resources get added to the NIF Registry?

June10, 2013 6

•NIF curators•Nomination by the community•Semi-automated text mining pipelines

NIF RegistryRequires no special skillsSite map available for

local hosting

•NIF Data Federation• DISCO interop• Requires some

programming skill

Bandrowski et al., 2012

Page 7: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

NIF Registry

• Extended over time– Parent resource– Supporting agency– Grant numbers– Accessibility– Related to– Organism– Disease or condition– Last updated

First catalog: SFN Neuroscience Database Gateway NIF 0.5 NIF 1.0+

Simple metadata model

Name, description, type, URL, other names, keywords, unique identifier

~2003 2006 2008

Page 8: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 8

Resource Curation

June10, 2013

• NIF Registry is hosted on Semantic Media Wiki platform Neurolex– Community can add,

review, edit without special privileges

– Searchable by Google– Integrated with NIF

ontologies– Graph structure

http://neurolex.org

Page 9: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

The resource graph

NIF is creating the linked data graph of resources

Page 10: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Keeping the Registry Current

– NIF employs an automated link checker – Last analysis: 478/6100 invalid URL’s (~8%)– 199 can’t locate at another university or location out of service (~3%)– Bigger issue: Many resources are no longer updated or maintained

1996 1998 2000 2002 2004 2006 2008 2010 2012 20140

20

40

60

80

100

120

140

160

180

200

0

500

1000

1500

2000

2500

3000

3500

Resources addedLast

upd

ated

Page 11: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

• Automated text mining is used to look for “web page last updated” or copyright dates– Identified for 570 resources; manual review suggested that the results

were accurate although we can’t guarantee that the date itself is accurate

– 373 were not updated within the last 2 years (65%)•Manual review of ~200 resources identified by 3DVC for their

catalog– 38 not updated within the past 2 years (~20%)– 8 migrated to new addresses or institutions– 7 are no longer in service (~3%)– 3 were deemed no longer appropriate

Tracking the fate of digital resources

Yuling Li, Paul Sternberg, Cal Tech

Page 12: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Keeping content up to date

Connectome

Tractography

Epigenetics

•New tags come into existence•New resource types come into existence, e.g., Mobile apps•Resources add new types of content • Change name• Change scope

•> 7000 updates to the registry last year

It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review

Page 13: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 13

Ontology provides a human-centric model for search and data integration

June10, 2013

Page 14: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Last updated...

• Some neglected resources are still valuable– Complete data sets– Rare data

• Software may still be usable

• Some databases, however, may only be of historical interest– “all metalloproteins

found in PDB”

Are all databases and data sets equally valuable?

Page 15: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

• The NIF Registry has created a linked data graph of web-accessible resources•Maintained on a community wiki

platform• Provides data on the fluidity of the

resource landscape– New resources continue to be created and

found– Relatively few disappear altogether– Many more grow stale, although their value

may still be significant– Maintaining up to date curation requires

frequent updating

Summary

NIF Registry provides insight into the state of digital resources on the web

Page 16: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Part 2: Surveying the data landscape

•The NIF data federation performs deep search over the content of over 200 databases•New databases are added at a rate of 25-40 per year• Latest update: Open Source Brain; ingest

completed in 2 hours•Databases chosen on a variety of criteria:• Early: testing different types of resources• Thematic areas• Volunteers

Page 17: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 17Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-1310000

100000

1000000

10000000

100000000

1000000000

0

50

100

150

200

250

Num

ber o

f Fed

erat

ed R

ecor

ds (M

illio

ns)

Num

ber o

f Fed

erat

ed D

atab

ases

Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web

DISCO

June10, 2013

Page 18: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 18

Data Ingestion Architecture

CurrentPlanned

DISCO Dashboard Functions• Ingest Script Manager• Public Script Repository• Data & Event Tracker• Versioning System• Curator Tool • Data Transformer Manager

June10, 2013 Luis Marenco, Rixin Wang, Perrry Miller, Gordon ShepherdYale University

Page 19: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 19

DISCO Dashboard

June10, 2013

• Management of registry resources through a single administrative dashboard

• Associated discovery pipeline

• Tools to manage data updates

• Change tracking

• Globally unique identifier creation

Luis Marenco, Rixin Wang, Perrry Miller, Gordon ShepherdYale University

Page 20: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

NIF data federation

NIF was designed to be populated rapidly with progressive refinement

Page 21: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

What are the connections of the hippocampus?

Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms

and related conceptsBoolean queries

Data sources categorized by

“data type” and level of nervous

system

Common views across multiple

sources

Tutorials for using full resource when getting there from

NIF

Link back to record in

original source

Page 22: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Results are organized within a common framework

Connects to

Synapsed with

Synapsed by

Input regioninnervates

Axon innervatesProjects toCellular contact

Subcellular contact

Source site

Target site

Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

Page 23: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

NIF Semantic Framework: NIFSTD ontology

• NIF covers multiple structural scales and domains of relevance to neuroscience• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene

Ontology, Chebi, Protein Ontology

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellular structure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

Page 24: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 24

Use of Ontologies• Controlled vocabulary for describing type of resource and

content– Database, Image, Diabetes

• Entity-mapping of database and data content• Data integration across sources• Search: Mixture of mapped content and string-based

search– Different parts of the infrastructure use the vocabularies in

different ways– Utilize synonyms, parents, children to refine search– Increasing use of other relationships and logical inferencing

• Generation of semantic content (i.e. RDF, Linked Data)

June10, 2013

Page 25: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

25

NIF Concept Mapper

June10, 2013

Aligns sources to the NIF semantic framework

Page 26: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Column level mapping: Reducing false positives

Page 27: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework

•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

Page 28: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 28

Content Annotation – Google Refine

June10, 2013

Page 29: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 29

Resource Provider Services - Linkout

June10, 2013

Page 30: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

What have we learned: Grabbing the long tail of small data

• NIF can be used to survey the data landscape

• Analysis of NIF shows multiple databases with similar scope and content

• Many contain partially overlapping data

• Data “flows” from one resource to the next– Data is reinterpreted, reanalyzed or

added to

• Is duplication good or bad?

Page 31: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

What do you mean by data?Databases come in many shapes and sizes

• Primary data:– Data available for reanalysis, e.g.,

microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

• Secondary data– Data features extracted through

data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

• Tertiary data– Claims and assertions about the

meaning of data• E.g., gene

upregulation/downregulation, brain activation as a function of task

• Registries:– Metadata– Pointers to data sets or materials

stored elsewhere

• Data aggregators– Aggregate data of the same type

from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

• Single source– Data acquired within a single

context , e.g., Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

Page 32: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

NIF Analytics: The Neuroscience Landscape

NIF is in a unique position to answer questions about the neuroscience landscape

Where are the data?

StriatumHypothalamusOlfactory bulb

Cerebral cortex

Brain

Brai

n re

gion

Data source

Vadim Astakhov, Kepler Workflow Engine

Page 33: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Whither neuroscience information?

What is easily machine processable and accessible

What is potentially knowable

What is known:Literature, images, human

knowledge

Unstructured; Natural language processing, entity recognition, image

processing and analysis;

communication

Page 34: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Open world meets closed world

We know a lot about some things and less about others; some of NIF’s sources are comprehensive; others are highly biased

But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray records

Page 35: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Diseases of nervous system

What drives discovery?

The combination of ontologies, diverse data and analytics lets us look at the current landscape in interesting ways

Neurodegenerative

Seizure disorders

Neoplastic disease of nervous system NIH

ReporterNIF

dat

a fe

dera

ted

sour

ces

Page 36: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Embracing duplication: Data Mash ups

•NIF queries across 3 of approximately 10 fMRI databases•Two resources, Brede and SUMSdb curated activation foci from the literature•~300 PMID’s were common between Brede and SUMSdb• PMID serves as a unique identifier for an article

•Same information; value addedData is additive

Page 37: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Same data: different analysis

•Gemma: Gene ID + Gene Symbol•DRG: Gene name + Probe ID

•Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

• Analysis:• 1370 statements from Gemma regarding gene expression as a function of chronic

morphine• 617 were consistent with DRG; over half of the claims of the paper were not

confirmed in this analysis• Results for 1 gene were opposite in DRG and Gemma• 45 did not have enough information provided in the paper to make a judgment

Relatively simple standards would make life easier

Page 38: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Phases of NIF• 2006-2008: A survey of what was out there• 2008-2009: Strategy for resource discovery

– NIF Registry vs NIF data federation– Ingestion of data contained within different technology platforms, e.g., XML vs relational

vs RDF– Effective search across semantically diverse sources

• NIFSTD ontologies

• 2009-2011: Strategy for data integration– Unified views across common sources– Mapping of content to NIF vocabularies

• 2011-present: Data analytics– Uniform external data references

• 2012-present: SciCrunch: unified biomedical resource services

NIF provides a strategy and set of tools applicable to all biomedical science

Page 39: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 39

Where is the Neuroscience in NIF?• Search semantics• Ranking• Resources supported by NIH Blueprint Institutes are

more thoroughly covered• Data types, e.g., Brain activation foci

June10, 2013

Page 40: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

40

Building a Uniform Resource Layer

Discoverability

Accessibility

Web of Data

Data specified via simple semanticsData in a usable formSemantically-enabled search

Enhanced semanticsStandardized representationLinked Open Data - RDF

Data resources simply describedAutomated data harvesting technologies Common resource registry

A production data (resource) catalog and underlying technology platform for researchers to discover, share, access, analyze, and integrate biomedical information

June10, 2013

Page 41: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Community Built Uniform Resource Layer

June10, 2013 41

SciCrunch

NIF

Neuroscience

MONARCH

Animal Models

CommunityServices dkCOIN

SharedResources

Undiagnosed Disease Program

Phenotype RCN

3D Virtual Cell

National Institute on Aging

One Mind for Research

BIRN

International Neuroinformatics

Coordinating Facility

Model Organism Databases

Community Outreach

DELSA

Varied

(not just a data catalog)

Page 42: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Each project shares resources and adds unique value to the resource layer

42

•3dVC: Focus on models and simulation•Gene Ontology: Focus on bioinformatics tools•National Institute on aging: Aging-related data sets•Monarch: Phenotype-Genotype; deep semantic data integration•One Mind for Research: Biospecimen repositories•NeuroGateway: Computational resources•FORCE11: Tools for next-gen publishing and e-scholarship

SciCrunch

SciCrunch is actively supporting multiple communities; multiple communities are enriching and improving SciCrunch

Page 43: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat

Customized portals and rankings

June10, 2013 43

SciCrunch

NIF

Neuroscience

MONARCH

Animal Models

CommunityServices dkCOIN

SharedResources

Undiagnosed Disease Program

Phenotype RCN

3D Virtual Cell

National Institute on Aging

One Mind for Research

BIRN

International Neuroinformatics

Coordinating Facility

Model Organism Databases

Community Outreach

DELSA

Varied

dkCOINOntology

SciCrunchShared

Resources

Page 44: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

Community database: beginning

Community database:

End

Register your resource to NIF!

“How do I share my data/tool?”

“There is no database for my data”

1

2

3

4

Institutional repositories

Cloud

INCF: Global infrastructure

Government

Education

Industry

NIF is designed to leverage existing investments in resources and infrastructure

Tool repositories

Page 45: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

dkCOIN Investigator's Retreat 45

Collaboration, competition, coordination, cooperation

•The diversity and dynamism of biomedical data will make data integration challenging always

•The overall data space is vast: No one group or individual can do everything– Cooperation and coordination is essential

•Creating a core resource registry and data catalog allows the entire community to track resources, work together to keep it updated, promote cross-fertilization, and build better resources

June10, 2013

Page 46: A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework

NIF team (past and present)

Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum

Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceSvetlana SulimaDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer

And my colleagues in Monarch, dkNet, 3DVC, Force 11