Knowledge-Based Integration of Neuroscience Data Sources

Knowledge-Based Integration of Neuroscience Data Sources

Amarnath Gupta

Bertram Ludäscher

Maryann Martone

University of California San Diego

A Standard Information Mediation Framework

Client Query

Integrated XML View

DataSource

XML DataSource

DataSource

XMLView

Wrapper Wrapper XMLView

XMLView

MediatorMediatorView Definition

A Neuroscience Question

protein localization

Cerebellar distribution of rat proteins with more than 70%homology with human NCS-1? Any structure specificity?

How about other rodents?

Integrated View

MediatorMediatorView Definition

morphometry neurotransmission

WWW

CaBP, Expasy

Wrapper WrapperWrapper Wrapper

Integration Issues

• Structural Heterogeneity– Resolved by converting to common semistructured data

model

• Heterogeneity in Query Capabilities– Resolved by writing wrappers with binding patterns

and other capability-definition languages

• Semantic Heterogeneity– Schema conflicts

• Partially resolved by mapping rules in the mediator

– Hidden Semantics?

Hidden Semantics:Protein Localization

<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”>

<name>RyR</>….</protein><region h_grid_pos=“1” v_grid_pos=“A”>

<density> <structure fraction=“0.8”>

<name>spine</><amount name=“RyR”>0</>

</> <structure fraction=“0.2”>

<name>branchlet</><amount name=“RyR”>30</>

</>

Molecular layer ofCerebellar Cortex

Purkinje Cell layer ofCerebellar Cortex

Fragment of dendrite

Hidden Semantics: Morphometry<neuron name=“purkinje cell”>

<branch level=“10”> <shaft>

… </shaft>

<spine number=“1”><attachment x=“5.3” y=“-3.2”

z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</>

<length>1.79</> </head>

</spine> …

Branch level beyond 4 is a branchlet

Must be dendritic because Purkinje cells

don’t have somatic spines

The Problem

• Multiple Worlds Integration– compatible terms not directly joinable– complex, indirect associations among schema elements– unstated integrity constraints

• Why not use ontologies?– typical ontologies associate terms along limited number

of dimensions

• What’s needed– a “theory” under which non-identical terms can be

“semantically” joined

Our Approach• Modify the standard Mediation Architecture

– Wrapper • Extend to encode an object-version of the structure schema

– Mediator• Redesign to incorporate auxiliary knowledge sources to

– Correlate object schema of sources– Define additional objects not specified but derivable from sources

• At the Mediator– Use a logic engine to

• Encode the mapping rules between sources• Define integrated views using a combination of exported objects

from source and the auxiliary knowledge sources• Perform query decomposition

• We still use Global-as-View form of mediation

The KIND Architecture

View Definition Rules

Logic Engine Integration Logic

Schema of Registered Sources

Integrated User ViewAuxiliary

KnowledgeSource 1

AuxiliaryKnowledge

Source 2

Object Wrapper

Structure Wrapper

Object Wrapper

Structure Wrapper

Src 1 Src 2

MaterializedViews

The Knowledge-Base• Situate every data object in its anatomical context

– An illustration

– New data is registered with the knowledge-base

– Insertion of new data reconciles the current knowledge-base with the new information by:

• Indexing the data with the source as part of registration

• Extending the knowledge-base

• Creating new views with complex rules to encode additional domain knowledge

F-Logic for the Mediation Engine

• Why F-Logic?– Provides the power of Datalog (with negation) and

object creation through Skolem IDs – Correct amount of “notational sugar” and rules to

provide object-oriented abstraction– Schema-level reasoning– Expressing variable arity

• F-Logic in KIND– Source schema wrapped into F-Logic schema– Knowledge-sources programmed in F-Logic– Definition of Integrated Views

Wrapping into Logic Objects

• Automated Part<!ELEMENT Studies (Study)*><!ELEMENT Study (study_id, … animal, experiments, experimenters><!ELEMENT experiments (experiment)*><!ELEMENT experiment (description, instrument, parameters)>

studyDB[studies study].study[study_id string; … animal animal; experiments experiment; experimenters string].…

• Non-automated Part• Subclasses

• Rules

• Integrity Constraints

mushroom_spine::spine

S:mushroom_spine IF S:spine[head_;neck _].

ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}].

Computing with Auxiliary Sources

• Creating Mediated Classes

• Reasoning with Schema

animal[MR] IF S:source, S.animal [MR] .animal[taxon ‘TAXON’.taxon].X[taxonT] IF X: ‘PROLAB’.animal[name N],

words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2].

union view

association rule

taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string].Schema

subspecies::species::genus:: … kingdom::superkingdomAt Mediator

T:TR, TR::TR1 IFT: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1],Taxon_Rank::Taxon_Rank1.

Class creation byschema reasoning

Integrated View Definition

• Views are defined between sources and knowledge base• Example: protein_distribution

– given: organism, protein, brain_region– KB Anatom:

• recursively traverse the has_a paths under brain_region collect all anatomical_entities

– Source PROLAB:• join with anatomical structures and collect the value of attribute

“image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism

– Mediator:• aggregate over all parents up to brain_region• report distribution

a secondintegrated view

Query Evaluation Example

• protein distribution of Human NCS-1 homologue– from wrapped CaBP website:

• get the amino acid sequence for human NCS-1

– from wrapped Expasy website:• submit amino acid sequence, get ranked homologues

– at Mediator:• select homologues H found in rat, and homology > 0.70

– at Mediator:• for each h in H

– from previous view:» protein_distribution(rat, h, cerebellum, distribution)

• Construct result

Implementation

• System– Flora as F-Logic Engine

– Communicate with ODBC databases through underlying XSB Prolog

– XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers

• Data– Human Brain Project sites

– NPACI Neuroscience Thrust sites

Work in Progress

• Architecture– plug-in architecture for

• domain knowledge sources• conceptual models from data sources

• Functionality– better handling of large data– operations

• expressive query language• operators for domain knowledge manipulation

– query evaluation• query optimization using domain knowledge

• Demonstration– at VLDB 2000

Documents

Knowledge-Based Integration of Neuroscience Data Sources