View
1.059
Download
0
Embed Size (px)
DESCRIPTION
Presented during the 34th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'12). Part of the workshop 'New Models and Modes for Data Sharing: Experiences from Neuroscience'. Presented by Jeffrey S. Grethe, Ph.D. from the Center for Research in Biological Systems at the University of California, San Diego. This workshop featured several large scale efforts to establish data sharing platforms, standards and tools to promote data intensive analysis in the neurosciences. As we head into the second decade of the 21st century, many scientists realize that current methods for publishing and accessing data are outmoded and inefficient. Neuroscience, with its large diverse and highly competitive community, has been slow to adopt more open sharing of data and has lacked effective tools to do so. There has been a significant investment in databases and tools for biological science, and frequent calls for more of them, but few calls to the biological community to adopt practices and frameworks for making their resources more easily discoverable and data more accessible. Data are contained within diverse sources, from web pages, databases, literature to personal lab systems, making for a haphazard mechanism for data and tool discovery. Although these mechanisms are effective for small communities, they are parochial for the totality of resources available, leading to fragmentation in the resource ecosystem. Neuroscience, with its diverse subdisciplines, complex data types and broad domain, presents the perfect exemplar of the current practices, bottlenecks and issues surrounding open access to data. This situation is changing, however, as groups have started to work together to define new models and tools for sharing and analyzing neuroscience data on an international scale. In this workshop, we bring together experts from national and international projects to discuss issues of data access and progress towards establishing platforms and best practices for effective sharing of neuroscience data in support of basic and clinical neuroscience.
Citation preview
Where are the Data?
Perspec.ves from the Neuroscience Informa.on Framework
Jeffrey S. Grethe, Ph. D. Center for Research in Biological Systems
University of California, San Diego
Introduc*on
“Neural Choreography” “A grand challenge in neuroscience is to elucidate brain func3on in rela3on to
its mul3ple layers of organiza3on that operate at different spa3al and temporal scales. Central to this effort is tackling “neural choreography” -‐-‐ the integrated func3oning of neurons into brain circuits-‐-‐their spa3al organiza3on, local and long-‐distance connec3ons, their temporal orchestra3on, and their dynamic features. Neural choreography cannot be understood via a purely reduc3onist approach. Rather, it entails the convergent use of analy3cal and synthe3c tools to gather, analyze and mine informa*on from each level of analysis, and capture the emergence of new layers of func3on (or dysfunc3on) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploiEng the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
“We speak piously of taking measurements and making small studies that will add another brick to the temple of science. Most such bricks just lie around the brickyard.”
PlaO, J.R. (1964) Strong Inference. Science. 146:
347-‐353.
"We now have unprecedented ability to collect data about nature…but there is now a crisis developing in biology, in that c omp le t e l y un s t r u c tu r ed informa*on does not enhance understanding”
Sidney Brenner
Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community
Whole brain data (20 um
microscopic MRI)
Mosiac LM images (1 GB+)
Conven3onal LM images
Individual cell morphologies
EM volumes & reconstruc3ons
Solved molecular structures
No single technology serves these all equally well.
à Mul*ple data types; mul*ple scales; mul*ple databases
The Data Federa*on Problem
Where are the data?
What do you mean by data? Databases come in many shapes and sizes
• Primary data: – Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data – Data features extracted through
data processing and some3mes normaliza3on, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connec3vity statements (BAMS)
• Ter3ary data – Claims and asser3ons about the
meaning of data • E.g., gene upregula3on/
downregula3on, brain ac3va3on as a func3on of task
• Registries: – Metadata – Pointers to data sets or
materials stored elsewhere • Data aggregators
– Aggregate data of the same type from mul3ple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source – Data acquired within a single
context , e.g., Allen Brain Atlas
Data, not just stories about them! 47/50 major preclinical published cancer studies could not be replicated
• “The scien3fic community assumes that the claims in a preclinical study can be taken at face value-‐that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of 3me. Unfortunately, this is not always the case.”
• GeQng data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop beSer metrics to evaluate the validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531
• “There are no guidelines that require all data sets to be reported in a paper; oeen, original data are removed during the peer review and publicaEon process. “
In an ideal world... We’d like to be able to find
• What is known: – What is the average diameter of a Purkinje neuron – Is GRM1 expressed In cerebral cortex? – What are the projec3ons of hippocampus? – What genes have been found to be upregulated in
chronic drug abuse in adults – Find images showing dendri3c spines containing
membrane bound organelles – What animal models have similar phenotypes to
Parkinson’s disease? – What studies used my polyclonal an3body against
GABA in humans?
• What is not known: – Connec3ons among data – Gaps in knowledge
Without some sort of framework, very difficult to do
The Problems Researchers Face
• We are not publishing data in a form that is easy to find or integrate
• What we mean isn’t clear to a search engine (or even to a human)
• NIF Registry: A catalog of neuroscience-‐relevant resources
> 4700 currently described > 2000 databases
• Searching and naviga*ng across individual resources takes an inordinate amount of human effort
But we have Google! • Current web is designed to share documents – Documents are unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
But we have Pub Med!
“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
Author, year, journal, keywords
• Bulk of neuroscience data is published as part of papers – > 20,000,000
• Structured vs. unstructured informa3on
NIF: A New Type of En*ty for New Modes of Scien*fic Dissemina*on
• NIF’s mission is to maximize the awareness of, access to and u3lity of digital resources produced worldwide to enable beher science and promote efficient use – NIF is the only neuroscience informa3on en3ty that views resources
globally without respect to domain, funding agency, ins3tute or community
– NIF is like a “Pub Med” for all neuroscience resources – Aggregates all the different databases, tools and resources now
produced by the scien3fic community – Makes them searchable from a single interface – A prac3cal approach to the data deluge – The “authority” on resources for neuroscience – Educate neuroscien*sts and students about effec*ve data sharing
People use NIF to... • Find resources
– “Where can I find a translaEon of Talaraich to MNI coordinates-‐ NIF Forum – “What biospecimen banks are available with Essues from opiate addicts?”-‐NIH
• Find answers – What is the amount of data published on males vs females-‐ NIH – “What projects to the ventral lateral geniculate nucleus”-‐UCSD researcher – “What is known about the choroid plexus?”-‐Small business owner
• NIF is listed in the library guides of > 85 research universi3es worldwide (ñ 70% from last year) • NIF receives hits from > 350 colleges and universi3es every month • NIF receives hits from pharmaceu3cal companies • Listed as link on 4 socie3es: Society for Neuroscience, American Associa3on of Anatomists,
Society of Immune Pharmacology, American Academy of Neurology
• Track resource u3liza3on – What projects are using my an3body/mouse/database?
• Serve as a springboard – NIF ontologies, tools and data resources are used by many groups (>80,000 hits/
month on NIF services) – NIF technologies and exper3se jumpstart related efforts
• One Mind for Research
An Overview of NIF • Assembled the largest searchable
colla3on of neuroscience data on the web
• The largest catalog of biomedical resources (data, tools, materials, services) available
• The largest ontology for neuroscience • NIF search portal: simultaneous search
over data, NIF catalog and biomedical literature
• Neurolex Wiki: a community wiki serving neuroscience concepts
• A unique technology planorm • Cross-‐neuroscience analy3cs • A reservoir of cross-‐disciplinary
biomedical data exper.se
NIF services for data providers • NIF ensures that all data are discoverable, accessible and understandable – If data are already in a database, NIF federates them
• Aligns data to common framework • Makes them collec3vely searchable • Provides uniform data access services for linking resources
– If data are not in a database: • NIF locates a suitable database within its federa3on and facilitates inges3on
• If no database is available, NIF creates a reasonable structure using its database tools; stores data in available data repositories (currently UCSD CRBS/SDSC) and makes it available through the NIF portal – Assigns a URI for data iden3fica3on
NIF uses manual, semi-‐automated and automated tools for inges3on and cura3on
Registering a resource in NIF NIF provides a set of tools and services for easy sharing of data and linking of data to ar3cles, web sites etc. – NIF makes it easy to add and manage resources through NIF • Need to respect resource and 3me constraints of resource providers
– Different levels of access • NIF Registry (basic) • NIF Site Map • NIF level 2
– create web access and basic structure for resources without API
– U3lizes DISCO tools developed at Yale • NIF level 3: Web service access, schema registra3on
What users are searching for:
NIF Registry
• NIF Registry: each resource gets its own URI and own Wiki page – Insert maps, Twiher feeds
• NIF site map: manage updates to your resource page – U3lizes DISCO protocol
(Luis Marenco, Rixin Wang, Yale U)
– NIF also consumes other sitemaps for bioscience, e.g., Biositemaps
The NeuroLex Wiki: A lexicon for neuroscience
• Seman3c wiki tracking > 18,000 neuroscience concepts
• Built from and for NIF ontologies
• Supports integra3on of tools and widgets
A dynamic index for neuroscience Parts of rodent brain
Parts of human brain
Parts of white maher
A Seman*cally Enabled Search Engine • NIF has developed a produc3on technology planorm for researchers to discover, share, access, analyze, and integrate neuroscience-‐relevant informa3on – Seman3cally-‐enabled search engine and interface that customizes results for neuroscience
– System that searches the “hidden web”, i.e., content not well served by search engines
– Automated data harves3ng technologies that produce dynamic indices of data content including databases, web pages, text, xml etc.
– Easy to use tools to make products and data available • NIF has developed a wealth of knowledge about data
resources and data integra3on in the life sciences
0
20
40
60
80
100
120
140
160
0.01
0.1
1
10
100
1000
Jun-‐08 Dec-‐08 Jul-‐09 Jan-‐10 Aug-‐10 Feb-‐11 Sep-‐11 Apr-‐12
Num
ber o
f Fed
erated
Datab
ases
Num
ber o
f Fed
erated
Records (M
illions)
DISCO
RDP
NIF Data Federa*on NIF provides access to the largest collec3on of neuroscience relevant data on the web, all from a single interface –already have surpassed year 4 cumula3ve targets
Resource Registry: 4700 ...
An3bodies: 935,000 Brain connec3vity: 66,000 Animal models: 270,000 Brain ac3va3on foci: 56,000
NIF Search Interface
NIF Search Interface
Making common neuroscience concepts computable: concept-‐based queries
• Search Google: GABAergic neuron • Search NIF: GABAergic neuron
– NIF automa3cally searches for types of GABAergic neurons
“Search compu*ng” What genes are upregulated by drugs of abuse
in the adult mouse? Morphine
Increased expression
Adult Mouse
Some concepts, e.g., age category, are quan3ta3ve but s3ll must be interpreted in a global query system
NIF STANDARD ONTOLOGIES (NIFSTD) • Set of modular ontologies
– Covering neuroscience relevant terminologies
– Comprehensive 50,000+ dis3nct concepts + synonyms
• Expressed in OWL-‐DL language • Closely follows OBO community
best prac3ces – As long as they seem prac3cal
• Avoids duplica3on of efforts – Standardized to the same upper level
ontologies, e.g., – Basic Formal Ontology (BFO), OBO
Rela3ons Ontology (OBO-‐RO), Phonotypical Quali3es Ontology (PATO)
– Relies on exis3ng community ontologies e.g., CHEBI, GO, PRO, OBI etc.
• Modules cover orthogonal domain e.g. , Brain Regions, Cells, Molecules, Subcellular parts, Diseases, Nervous system func3ons, etc.
Bill Bug et al.
Data Services for Users
Current Planned
Vocabulary • NITRC (autocomplete) • Neuroscience.com (annotate) • INCF Atlasing tools
Data Summary (NIF Navigator) • NIDA, Blueprint • NeuroLex
Individual Data Sources • DOMEO • OneMind • Eagle I
DISCO Services (LinkOut) • PubMed
NIF Link Out Broker: Connec*ng Resources
NIF inserts links between data and ar3cles on behalf of data providers using NCBI’s Link Out feature
NIF inserted > 800,000 references to Pub Med ID’s
Grabbing the long tail of small data
• Analysis of NIF shows mul3ple databases with similar scope and content
• Many contain par3ally overlapping data
• Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or added to – When does it become something else?
• Is duplica3on good or bad?
NIF Analy*cs: The Neuroscience Ecosystem
NIF is in a unique posi3on to answer ques3ons about the neuroscience ecosystem
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source
How much of the landscape do we have?
Query for “reference” brain structures and their parts in NIF Connec*vity database
Embracing duplica*on: Data Mash ups
• ~300 PMID’s were common between Brede and SUMSdb • Same informa3on; value added
Same data -‐ different aspects
Same data: different analysis Chronic vs acute morphine in striatum
• Drug Related Gene database: extracted statements from figures, tables and supplementary data from published ar3cle
• Gemma: Reanalyzed microarray results from GEO using different algorithms
• Both provide results of increased or decreased expression as a func3on of experimental paradigm – 4 strains of mice – 3 condi3ons: chronic morphine,
acute morphine, saline
Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databases
hhp://www.chibi.ubc.ca/Gemma/home.html
How easy was it to compare? • Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID • Gemma: Increased expression/decreased expression • DRG: Increased expression/decreased expression
– But...Gemma presented results rela3ve to baseline chronic morphine; DRG with respect to saline, so direc3on of change is opposite in the 2 databases
• Analysis: – 1370 statements from Gemma regarding gene expression as a func3on of chronic morphine
– 617 were consistent with DRG; à over half of the claims of the paper were not confirmed in this analysis
– Results for 1 gene were opposite in DRG and Gemma – 45 did not have enough informa3on provided in the paper to make a judgment
NIF annota3on standard
A global view of data Informa*cs should not be an aherthought – You (and the machine) have to be able to find it • Accessible through the web • Annota3ons
– You have to be able to use it • Data type specified and in a usable form
– You have to know what the data mean – Some seman3cs – Context: Experimental metadata – Provenance: Where did the data come from?
Repor3ng neuroscience data within a consistent framework helps enormously
• We live in a linked world: “ Too Big to Know”
• Mul3ple efforts are underway simultaneously – Launched without knowledge of
others – Mine is beher / Not Invented Here
• Coopera3on and coordina3on will allow us to move forward faster – NIF has tried to be a good ci3zen by
sharing exper3se, data, knowledge, tools
Compe**on Coopera*on Coordina*on Collabora*on
NIF team (past and present) Maryann Martone, UCSD, Principal Inves3gator Jeffrey Grethe, UCSD, Co Inves3gator Amarnath Gupta, UCSD, Co Inves3gator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech Arun Rangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University Sridevi Polavarum Tim Clark, Harvard University Paolo Ciccarese
Vadim Astakhov Davis Banks Bill Bug Jonathan Cachat Chris Condit Mark Ellisman Lee Hornbrook Fahim Imam Stephen Larson Jennifer Lawrence Cliff Lee Larry Lui Sarah Maynard Binh Ngo Andrea Arnaud Stagg Xufei Qian Willie Wong Jonathan Pollock, NIH, Program Officer
Karen Skinner, NIH, Program Officer
Thank You…