Integration of Biological Data (LifeDB)
Presented ByMd. Shazzad Hosain ([email protected])
Supervised ByDr. Hasan Jamil ([email protected])
Wayne State University, Detroit, USA
04/18/23 3
Data Integration Example
Detroit to Bologna air ticket Alitalia, Italy Airline Air France NorthWest Airline Lufthansa etc.
04/18/23 6
Integration Example cont.
CheapAir.com / Expedia.com
Alitalia Lufthansa Air France Delta
myAirFare.com
CheapAir.com Expedia.com ……
04/18/23 7
Integration Approaches
Warehouse Integration
Mediator based Integration
Navigational Integration
04/18/23 8
Warehouse Integration
Materialize data from all sources to local warehouse
Emphasize data translation rather query translation
Advantages: Low network bottleneck, efficient Disadvantages: reliability in terms of most up
to date data, system maintenance
04/18/23 9
Mediator – based Integration
Concentrates on Query translation GAV approach and LAV Approach
04/18/23 10
GAV Approach
Query reformulation easy, but addition or removal of sources are difficult
Preferred when sources are known an stable
S1 S2 S3 S4
Mediator Schema
04/18/23 11
LAV Approach
Query reformulation is difficult but addition or removal of source are easy
Appropriate for large scale ad-hoc integration ARIADNE, Discovery Link, TAMBIS, KIND etc
Mediator Schema
S1 S2 S3 S4
04/18/23 12
Navigational Integration
Some sources provide information that would not/hardly be accessible without point-and-click navigation
04/18/23 22
1. Give it a name called: LocusLink
2. Name them as: Link, LocusID, Org, Symbol, Descriptionrespectively
3. Select appropriate transformations
4. Press <Update & Redraw> button
04/18/23 26
1. Select ‘LocusLink’ table
2. Type in ‘LocusLinkQuery’ as a query name
3. Check these fields to display
4. Double click here
04/18/23 33
Resource Discovery Automatic Schema/Ontology Matching Query Optimization WorkFlows
LifeDB
BioFlow (A declarative WorkFlow Language)
04/18/23 34
Glimpse of BioFlow
GeneBankURL FlyBaseURL
DNA sequence repositories
EMBL formatGeneBank format
Combine these sequence
Reading Frame Predictor (input_seq : FASTA format, species)
Score and predicted DNA region
University of Minnesota
04/18/23 35
BioFlow
workflow open_reading_frame ; use ontology BioSystems ; declare found logical, count int ; define data sequences_1 at GeneBankURL as (seq_1 DNA) ; define tool orf at URL parameter (seq DNA, target organism)
results (score int, predicted_region DNA) ; combine sequences_1, sequences_2 into sequences (seqs); select seqs, orf (seqs, “drosophila”) from sequences ;
Goal is to develop a formal BioFlow language syntax with compositionality, closure property and type safety
04/18/23 36
Research Scope
Resource Discovery Automatic Schema/Ontology Matching Query Optimization WorkFlows
7-8 PhD positions 3-5 years funding