Where are the Data? Perspectives from the Neuroscience Information Framework

Where are the Data?

Perspec.ves from the Neuroscience Informa.on Framework

Jeffrey S. Grethe, Ph. D. Center for Research in Biological Systems

University of California, San Diego

Introduc*on

“Neural Choreography” “A grand challenge in neuroscience is to elucidate brain func3on in rela3on to

its mul3ple layers of organiza3on that operate at different spa3al and temporal scales. Central to this effort is tackling “neural choreography” -‐-‐ the integrated func3oning of neurons into brain circuits-‐-‐their spa3al organiza3on, local and long-‐distance connec3ons, their temporal orchestra3on, and their dynamic features. Neural choreography cannot be understood via a purely reduc3onist approach. Rather, it entails the convergent use of analy3cal and synthe3c tools to gather, analyze and mine informa*on from each level of analysis, and capture the emergence of new layers of func3on (or dysfunc3on) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....

However, the neuroscience community is not yet fully engaged in exploiEng the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “

Akil et al., Science, Feb 11, 2011

“We speak piously of taking measurements and making small studies that will add another brick to the temple of science. Most such bricks just lie around the brickyard.”

PlaO, J.R. (1964) Strong Inference. Science. 146:

347-‐353.

"We now have unprecedented ability to collect data about nature…but there is now a crisis developing in biology, in that c omp le t e l y un s t r u c tu r ed informa*on does not enhance understanding”

Sidney Brenner

Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community

Whole brain data (20 um

microscopic MRI)

Mosiac LM images (1 GB+)

Conven3onal LM images

Individual cell morphologies

EM volumes & reconstruc3ons

Solved molecular structures

No single technology serves these all equally well.

à Mul*ple data types; mul*ple scales; mul*ple databases

The Data Federa*on Problem

Where are the data?

What do you mean by data? Databases come in many shapes and sizes

•  Primary data: –  Data available for reanalysis, e.g.,

microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

•  Secondary data –  Data features extracted through

data processing and some3mes normaliza3on, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connec3vity statements (BAMS)

•  Ter3ary data –  Claims and asser3ons about the

meaning of data •  E.g., gene upregula3on/

downregula3on, brain ac3va3on as a func3on of task

•  Registries: –  Metadata –  Pointers to data sets or

materials stored elsewhere •  Data aggregators

–  Aggregate data of the same type from mul3ple sources, e.g., Cell Image Library ,SUMSdb, Brede

•  Single source –  Data acquired within a single

context , e.g., Allen Brain Atlas

Data, not just stories about them! 47/50 major preclinical published cancer studies could not be replicated

•  “The scien3fic community assumes that the claims in a preclinical study can be taken at face value-‐that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of 3me. Unfortunately, this is not always the case.”

•  GeQng data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop beSer metrics to evaluate the validity of data

Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531

•  “There are no guidelines that require all data sets to be reported in a paper; oeen, original data are removed during the peer review and publicaEon process. “

In an ideal world... We’d like to be able to find

•  What is known: –  What is the average diameter of a Purkinje neuron –  Is GRM1 expressed In cerebral cortex? –  What are the projec3ons of hippocampus? –  What genes have been found to be upregulated in

chronic drug abuse in adults –  Find images showing dendri3c spines containing

membrane bound organelles –  What animal models have similar phenotypes to

Parkinson’s disease? –  What studies used my polyclonal an3body against

GABA in humans?

•  What is not known: –  Connec3ons among data –  Gaps in knowledge

Without some sort of framework, very difficult to do

The Problems Researchers Face

•  We are not publishing data in a form that is easy to find or integrate

•  What we mean isn’t clear to a search engine (or even to a human)

•  NIF Registry: A catalog of neuroscience-‐relevant resources

> 4700 currently described > 2000 databases

•  Searching and naviga*ng across individual resources takes an inordinate amount of human effort

But we have Google! •  Current web is designed to share documents –  Documents are unstructured data

•  Much of the content of digital resources is part of the “hidden web”

•  Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

But we have Pub Med!

“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )

Author, year, journal, keywords

•  Bulk of neuroscience data is published as part of papers – > 20,000,000

•  Structured vs. unstructured informa3on

NIF: A New Type of En*ty for New Modes of Scien*fic Dissemina*on

•  NIF’s mission is to maximize the awareness of, access to and u3lity of digital resources produced worldwide to enable beher science and promote efficient use –  NIF is the only neuroscience informa3on en3ty that views resources

globally without respect to domain, funding agency, ins3tute or community

–  NIF is like a “Pub Med” for all neuroscience resources –  Aggregates all the different databases, tools and resources now

produced by the scien3fic community –  Makes them searchable from a single interface –  A prac3cal approach to the data deluge –  The “authority” on resources for neuroscience –  Educate neuroscien*sts and students about effec*ve data sharing

People use NIF to... •  Find resources

–  “Where can I find a translaEon of Talaraich to MNI coordinates-‐ NIF Forum –  “What biospecimen banks are available with Essues from opiate addicts?”-‐NIH

•  Find answers –  What is the amount of data published on males vs females-‐ NIH –  “What projects to the ventral lateral geniculate nucleus”-‐UCSD researcher –  “What is known about the choroid plexus?”-‐Small business owner

•  NIF is listed in the library guides of > 85 research universi3es worldwide (ñ 70% from last year) •  NIF receives hits from > 350 colleges and universi3es every month •  NIF receives hits from pharmaceu3cal companies •  Listed as link on 4 socie3es: Society for Neuroscience, American Associa3on of Anatomists,

Society of Immune Pharmacology, American Academy of Neurology

•  Track resource u3liza3on –  What projects are using my an3body/mouse/database?

•  Serve as a springboard –  NIF ontologies, tools and data resources are used by many groups (>80,000 hits/

month on NIF services) –  NIF technologies and exper3se jumpstart related efforts

•  One Mind for Research

An Overview of NIF •  Assembled the largest searchable

colla3on of neuroscience data on the web

•  The largest catalog of biomedical resources (data, tools, materials, services) available

•  The largest ontology for neuroscience •  NIF search portal: simultaneous search

over data, NIF catalog and biomedical literature

•  Neurolex Wiki: a community wiki serving neuroscience concepts

•  A unique technology planorm •  Cross-‐neuroscience analy3cs •  A reservoir of cross-‐disciplinary

biomedical data exper.se

NIF services for data providers •  NIF ensures that all data are discoverable, accessible and understandable –  If data are already in a database, NIF federates them

•  Aligns data to common framework •  Makes them collec3vely searchable •  Provides uniform data access services for linking resources

–  If data are not in a database: •  NIF locates a suitable database within its federa3on and facilitates inges3on

•  If no database is available, NIF creates a reasonable structure using its database tools; stores data in available data repositories (currently UCSD CRBS/SDSC) and makes it available through the NIF portal –  Assigns a URI for data iden3fica3on

NIF uses manual, semi-‐automated and automated tools for inges3on and cura3on

Registering a resource in NIF NIF provides a set of tools and services for easy sharing of data and linking of data to ar3cles, web sites etc. –  NIF makes it easy to add and manage resources through NIF •  Need to respect resource and 3me constraints of resource providers

–  Different levels of access •  NIF Registry (basic) •  NIF Site Map •  NIF level 2

–  create web access and basic structure for resources without API

–  U3lizes DISCO tools developed at Yale •  NIF level 3: Web service access, schema registra3on

What users are searching for:

NIF Registry

•  NIF Registry: each resource gets its own URI and own Wiki page –  Insert maps, Twiher feeds

•  NIF site map: manage updates to your resource page –  U3lizes DISCO protocol

(Luis Marenco, Rixin Wang, Yale U)

–  NIF also consumes other sitemaps for bioscience, e.g., Biositemaps

The NeuroLex Wiki: A lexicon for neuroscience

•  Seman3c wiki tracking > 18,000 neuroscience concepts

•  Built from and for NIF ontologies

•  Supports integra3on of tools and widgets

A dynamic index for neuroscience Parts of rodent brain

Parts of human brain

Parts of white maher

A Seman*cally Enabled Search Engine •  NIF has developed a produc3on technology planorm for researchers to discover, share, access, analyze, and integrate neuroscience-‐relevant informa3on –  Seman3cally-‐enabled search engine and interface that customizes results for neuroscience

–  System that searches the “hidden web”, i.e., content not well served by search engines

–  Automated data harves3ng technologies that produce dynamic indices of data content including databases, web pages, text, xml etc.

–  Easy to use tools to make products and data available •  NIF has developed a wealth of knowledge about data

resources and data integra3on in the life sciences

0

20

40

60

80

100

120

140

160

0.01

0.1

1

10

100

1000

Jun-‐08 Dec-‐08 Jul-‐09 Jan-‐10 Aug-‐10 Feb-‐11 Sep-‐11 Apr-‐12

Num

ber o

f Fed

erated

Datab

ases

Num

ber o

f Fed

erated

Records (M

illions)

DISCO

RDP

NIF Data Federa*on NIF provides access to the largest collec3on of neuroscience relevant data on the web, all from a single interface –already have surpassed year 4 cumula3ve targets

Resource Registry: 4700 ...

An3bodies: 935,000 Brain connec3vity: 66,000 Animal models: 270,000 Brain ac3va3on foci: 56,000

NIF Search Interface

NIF Search Interface

Making common neuroscience concepts computable: concept-‐based queries

•  Search Google: GABAergic neuron •  Search NIF: GABAergic neuron

–  NIF automa3cally searches for types of GABAergic neurons

“Search compu*ng” What genes are upregulated by drugs of abuse

in the adult mouse? Morphine

Increased expression

Adult Mouse

Some concepts, e.g., age category, are quan3ta3ve but s3ll must be interpreted in a global query system

NIF STANDARD ONTOLOGIES (NIFSTD) •  Set of modular ontologies

–  Covering neuroscience relevant terminologies

–  Comprehensive 50,000+ dis3nct concepts + synonyms

•  Expressed in OWL-‐DL language •  Closely follows OBO community

best prac3ces –  As long as they seem prac3cal

•  Avoids duplica3on of efforts –  Standardized to the same upper level

ontologies, e.g., –  Basic Formal Ontology (BFO), OBO

Rela3ons Ontology (OBO-‐RO), Phonotypical Quali3es Ontology (PATO)

–  Relies on exis3ng community ontologies e.g., CHEBI, GO, PRO, OBI etc.

•  Modules cover orthogonal domain e.g. , Brain Regions, Cells, Molecules, Subcellular parts, Diseases, Nervous system func3ons, etc.

Bill Bug et al.

Data Services for Users

Current Planned

Vocabulary •  NITRC (autocomplete) •  Neuroscience.com (annotate) •  INCF Atlasing tools

Data Summary (NIF Navigator) •  NIDA, Blueprint •  NeuroLex

Individual Data Sources •  DOMEO •  OneMind •  Eagle I

DISCO Services (LinkOut) •  PubMed

NIF Link Out Broker: Connec*ng Resources

NIF inserts links between data and ar3cles on behalf of data providers using NCBI’s Link Out feature

NIF inserted > 800,000 references to Pub Med ID’s

Grabbing the long tail of small data

•  Analysis of NIF shows mul3ple databases with similar scope and content

•  Many contain par3ally overlapping data

•  Data “flows” from one resource to the next –  Data is reinterpreted, reanalyzed or added to – When does it become something else?

•  Is duplica3on good or bad?

NIF Analy*cs: The Neuroscience Ecosystem

NIF is in a unique posi3on to answer ques3ons about the neuroscience ecosystem

Where are the data?

Striatum Hypothalamus Olfactory bulb

Cerebral cortex

Brain

Brain region

Data source

How much of the landscape do we have?

Query for “reference” brain structures and their parts in NIF Connec*vity database

Embracing duplica*on: Data Mash ups

•  ~300 PMID’s were common between Brede and SUMSdb •  Same informa3on; value added

Same data -‐ different aspects

Same data: different analysis Chronic vs acute morphine in striatum

•  Drug Related Gene database: extracted statements from figures, tables and supplementary data from published ar3cle

•  Gemma: Reanalyzed microarray results from GEO using different algorithms

•  Both provide results of increased or decreased expression as a func3on of experimental paradigm –  4 strains of mice –  3 condi3ons: chronic morphine,

acute morphine, saline

Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databases

hhp://www.chibi.ubc.ca/Gemma/home.html

How easy was it to compare? •  Gemma: Gene ID + Gene Symbol •  DRG: Gene name + Probe ID •  Gemma: Increased expression/decreased expression •  DRG: Increased expression/decreased expression

–  But...Gemma presented results rela3ve to baseline chronic morphine; DRG with respect to saline, so direc3on of change is opposite in the 2 databases

•  Analysis: –  1370 statements from Gemma regarding gene expression as a func3on of chronic morphine

–  617 were consistent with DRG; à over half of the claims of the paper were not confirmed in this analysis

–  Results for 1 gene were opposite in DRG and Gemma –  45 did not have enough informa3on provided in the paper to make a judgment

NIF annota3on standard

A global view of data Informa*cs should not be an aherthought – You (and the machine) have to be able to find it •  Accessible through the web •  Annota3ons

– You have to be able to use it •  Data type specified and in a usable form

– You have to know what the data mean – Some seman3cs – Context: Experimental metadata – Provenance: Where did the data come from?

Repor3ng neuroscience data within a consistent framework helps enormously

•  We live in a linked world: “ Too Big to Know”

•  Mul3ple efforts are underway simultaneously –  Launched without knowledge of

others –  Mine is beher / Not Invented Here

•  Coopera3on and coordina3on will allow us to move forward faster –  NIF has tried to be a good ci3zen by

sharing exper3se, data, knowledge, tools

Compe**on Coopera*on Coordina*on Collabora*on

NIF team (past and present) Maryann Martone, UCSD, Principal Inves3gator Jeffrey Grethe, UCSD, Co Inves3gator Amarnath Gupta, UCSD, Co Inves3gator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech Arun Rangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University Sridevi Polavarum Tim Clark, Harvard University Paolo Ciccarese

Vadim Astakhov Davis Banks Bill Bug Jonathan Cachat Chris Condit Mark Ellisman Lee Hornbrook Fahim Imam Stephen Larson Jennifer Lawrence Cliff Lee Larry Lui Sarah Maynard Binh Ngo Andrea Arnaud Stagg Xufei Qian Willie Wong Jonathan Pollock, NIH, Program Officer

Karen Skinner, NIH, Program Officer

Thank You…

Health & Medicine

Where are the Data? Perspectives from the Neuroscience Information Framework