20
Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Embed Size (px)

Citation preview

Page 1: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Integrated Querying Across Disparate Data Sources

José Luis Ambite & Gully APC Burns

Information Sciences InstituteUniversity of Southern California

Page 2: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

2

Team

Information Integration Infrastructure: Jose Luis Ambite, Craig Knoblock, Maria Muslea, Gowri Kumaraguruparan, Kristina Lerman (USC/ISI)

Domain Collaborators:FBIRN: Naveen Ashish (UCI), Jessica Turner (UCI), Karl Helmer (MGH),

Tim Olsen (WUSTL), Dingying Wei (UCI)

NHPRC: John Nylander, Dave Brink, Liz Moran (NHPRC)

CVRG: Naveen Ashish (UCI), Steve Granite (JHU)

Security: Rachana Ananthakrishnan (UC), Laura Pearlman (USC/ISI)

Data Management: Robert Schuler, Ann Chervenak (USC/ISI)

Knowledge Engineering: Gully Burns, Tom Russ (USC/ISI), Naveen Ashish, Jessica Turner (UCI)

User Interfaces: Naveen Ashish (UCI), Jose Luis Ambite, Pedro Szekely, Craig Rogers, Gowri Kumaraguruparan, Maria Muslea (USC/ISI)

Page 3: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

3

Information Integration

Mediator: uniform structured query access to heterogeneous sources

Challenges:Syntactic (Access/Format) heterogeneity: Wrappers

Structured Sources: DBMS, XML/XQuery DBsSemi-structured Sources: HTML, text, pdfWeb services XML, SOAP, WSDL

Semantic heterogeneity MediatorSchema Source modelingData Record Linkage

Scalability:MediationSecuritySource AdditionRecord LinkageEfficient Query Execution

Decision Support

Application Programs

Mediator

KnowledgeBases

Databases Computer Programs

The Web

Page 4: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

4

Information Mediator• Virtual Integration Architecture:

– Virtual organization: community of data providers and consumers that want to share data for specific purpose

– Autonomous sources: data, control remains at sources; no change to access methods, schemas; data accessed real-time in response to user queries

– Mediator: integrator defines domain schema and describes source contents• Domain schema: agreed upon view of the domain preferred by the virtual

organization• Source descriptions: logical formulas relating source and domain schemas

• Easy to add new sources• Query Answering

– User writes query in domain schema– Mediator:

• Determines sources relevant to user query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources

– Efficient query evaluation• Streaming dataflow

Mediator

DomainSchema

User queries

Reformulation

Optimizer

Execution Engine

DataSource

Data Source Data

Source

Wrapper WrapperSources schemas

Logical SourceDescriptions

Page 5: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

5

BIRN Grid-based Virtual Data Integration Architecture

Relational DBEx: HID

XML DBEx: eXist-db

Web Portal

BIRN Gateway

BIRN Gateway BIRN Gateway BIRN Gateway

OGSA-DQP/DAI

Internet

Internal to Organization

Internet

Internal to Organization

Grid Security Infrastructure(TLS + PKI)

OGSA-DAI OGSA-DAI OGSA-DAI

ISI-Mediator

Web ServiceEx: XNAT

Logical SourceDescriptions

Reconcile Semantics/

Query RewritingClient Program

QueryOptimization/

Execution

SourceWrappers

Security: Encryption,Authentication, Authorization,

Page 6: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

HID@MRN

6

FBIRN Data Integration Use Case:

HID and XNAT

HID@UCI

Human Imaging Database(s)Oracle DB

XNAT

EXtensible Neuroimaging Archive Toolkit Web service API

BIRN MediatorSQL query XML

query

User query: find all male

patients over 50 with t1 scans

Results integratedfrom XNAT and HID

HIDresults

XNAT results(XML)

Domain query Integrated resultsLogical Source

descriptions

Page 7: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

HID@MRN

7

FBIRN Data Integration Use Case:

HID and XNAT

HID@UCI

Human Imaging Database(s)Oracle DB

XNAT

EXtensible Neuroimaging Archive Toolkit Web service API

BIRN MediatorSQL query XML

query

User query: find all male

patients over 50 with t1 scans

Results integratedfrom XNAT and HID

HIDresults

XNAT results(XML)

Domain query Integrated resultsLogical Source

descriptions

Page 8: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

ECG_Mesa(MySQL DB)

CardioVascular Research Grid

BIRN Mediator

Integrated results

Logical Sourcedescriptions

ChesnokovAnalysis(eXistDB

XML DBMS)

Image Metadatadcm4che PACS

(MySQL DB)

WaveformDB(eXistDB

XML DBMS)

DICOM Image Files(file system)

Waveform Files(file system)

Domain query

Same BIRN mediatorJust plug in CVRG sourcedescriptions

and additional wrapper for eXistDB (XML/XQuery database)

Page 9: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

ECG_Mesa(MySQL DB)

CardioVascular Research Grid

BIRN Mediator

Integrated results

Logical Sourcedescriptions

ChesnokovAnalysis(eXistDB

XML DBMS)

Image Metadatadcm4che PACS

(MySQL DB)

WaveformDB(eXistDB

XML DBMS)

DICOM Image Files(file system)

Waveform Files(file system)

Domain query

Use mediator to identify subjectsand files of interest

Page 10: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

10

BIRN NHPRC Data Integration Use Case

• Provide data integration infrastructure for NHPRC:– Colony management, genetics, pathology, …

• BIRN NHPRC Activities: – BIRN/ISI demonstrated Colony Management integration

prototype– BIRN/ISI released data integration system to NHPRC team – NHPRC team developing DNA Banking application using

mediator– Deployment to NPRC Centers in progress

Page 11: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

BIRN and Biomedical Ontologies

Ontology development challenges: • Modeling complex domains is challenging and requires

specialized expertise • The community of ontology development efforts is

large and somewhat daunting to navigate

Our goal: to provide an ontology development process that

• leverages existing ontology development• creates effective dialog between domain users and

biomedical ontologists • informs and documents the design of domain models

for integration • publishes curated domain ontologies to the community

Page 12: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Domain/Ontology Engineering Strategy

Page 13: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Variables and Data Values are the basis of scientific assertions

Page 14: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

e.g., ‘CDNF protects nigral dopaminergic neurons in-vivo’

This statistically-significant effect is the experimental basis for the findings of this study.

Our ontology engineering approach is based on experimental variables

from Lindholm, P. et al. (2007), Nature, 448(7149): p. 73-7

Page 15: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Ontology of Experimental Variables and Values (‘OoEVV’)• Capture semantics of experimental variables

– Based on Measurement Scales (nominal, ordinal, interval, ratio, etc.1)

• Publish to standardized ontology repositories (NCBO, OBO Foundry etc.) – Expressed in RDF / OWL.

• Serve as a focal point of interaction with other ontology curation activities– i.e., Ontology of Biomedical Investigations

(OBI, http://obi-ontology.org/)

1 Stevens, S. S. (1946). "On the theory of scales of measurement." Science 103(2684): 677-680.

Page 16: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

OoEVV Basic Design

Characteristic

MeasurementScale

Measurement Value

measuresvalue

usesMeasurementScale

onMeasurementScale

Variable

External Ontologies

Page 17: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Example: Simple (‘nominal’) Handedness

Handedness

Characteristic

BirnLexOntology: birnlex_2178

Variable

Handedness Variable

Nominal Handedness

Variable

MeasurementValue

Nominal Measurement

Value

Nominal Handedness

Value

lefthanded

ambidextrous

righthanded

Measurement Scale

Nominal Measurement

Scale

Nominal Handedness

Scale

Page 18: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Example: Edinburgh Handedness

Inventory

Handedness

Characteristic

BirnLexOntology: birnlex_2178

Variable

Handedness Variable

Edinburgh Handedness

Variable

MeasurementValue

Ordinal Measurement

Value

Edinburgh Handedness

Value

float values[-100.0, +100.0]

Measurement Scale

Ordinal Measurement

Scale

Edinburgh Handedness

Inventory

Page 19: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

OoEVV elements for FBIRN

• FBIRN (HID/XNAT) domain model – 193 attributes 83 OoEVV variables – Mainly clinical assessments – e.g., Structured Clinical Interview for

DSM Disorders ‘SCID’, Mini-Mental State Examination, Clinical Dementia Rating, etc.)

Page 20: Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information Sciences Institute University of Southern California

Integrated Querying Across Disparate Data Sources

• General information integration infrastructure– Mediators

• bridge semantics across data sources• provide integrated data for analysis and visualization

– Domain model development and curation process• Balance bottom-up/top-down domain/ontology

development and reuse– Security and user data access control built-in

• Approach– Engage research communities – User-lead integration: NHPRC, FBIRN– Build applications incrementally– Enhance capabilities while providing useful tools