100
Supporting on-the-fly data Integration for bioinformatic s Candidate: Xuan Zhang Advisor: Gagan Agrawal

Supporting on-the-fly data Integration for bioinformatics

  • Upload
    malo

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

Supporting on-the-fly data Integration for bioinformatics. Candidate: Xuan Zhang Advisor: Gagan Agrawal. Road Map. Mission Statement Motivation Implementation Comprehensive Examples Future work Conclusion. Mission Statement. Enhance information integration systems on Functionality - PowerPoint PPT Presentation

Citation preview

Page 1: Supporting  on-the-fly data Integration for bioinformatics

Supporting on-the-fly data Integration for bioinformatics

Candidate: Xuan Zhang

Advisor: Gagan Agrawal

Page 2: Supporting  on-the-fly data Integration for bioinformatics

Road Map

• Mission Statement

• Motivation

• Implementation

• Comprehensive Examples

• Future work

• Conclusion

Page 3: Supporting  on-the-fly data Integration for bioinformatics

Mission Statement

• Enhance information integration systems on– Functionality

• On-the-fly data incorporation• Flat file data process

– Usability• Declarative interface• Low programming requirement

Page 4: Supporting  on-the-fly data Integration for bioinformatics

Motivation

• Integration is essential for biological research– Biological data include

• Sequences: DNA (GenBank), protein (Swiss-Prot)• Structure: RNA (RNAbase), protein (PDB)• Interaction: pathway (KEGG), regulation (GRBase)• Function: disease (OMIM)• 2ndary: protein family (Pfam)

– Biological data is inter-related.

Page 5: Supporting  on-the-fly data Integration for bioinformatics

Motivation

• Challenges of bioinformatics integration– Data volume: overwhelming

• DNA sequence: 100 gigabases (August, 2005)

– Data growth:

exponential

Figure provided by PDB

Page 6: Supporting  on-the-fly data Integration for bioinformatics

Motivation

• Challenges of bioinformatics integration (cont.)– Tools: Many and more– Service interfaces: Variety

• Web pages• Web service• Grid service

Page 7: Supporting  on-the-fly data Integration for bioinformatics

Motivation

• Challenges of bioinformatics integration (cont.)– Inter-operability: Low

• Heterogeneous data sources– Semi-structured by nature– Flat file, relational, object-oriented databases

• Independently developed tools• No data exchange standard

– Little Collaboration

Page 8: Supporting  on-the-fly data Integration for bioinformatics

Road Map

• Mission Statement

• Motivation

• Implementation

• Future

• Conclusion

– Approach Overview– Advantage– Components

Page 9: Supporting  on-the-fly data Integration for bioinformatics

Approach Summary

• Metadata– Declarative description of data– Data mining algorithms for semi-automatic

writing– Reusable by different requests on same data

• Code generation– Request analysis and execution separated– General modules with plug-in data module

Page 10: Supporting  on-the-fly data Integration for bioinformatics

System OverviewUnderstand Data Process Data

Data File User Request

Answ

er

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema Descriptor

CodeGeneration

RequestProcessor

Layout Miner

SchemaMiner

Information Integration System

Page 11: Supporting  on-the-fly data Integration for bioinformatics

Advantages

• Simple interface– At metadata level, declarative

• General data model– Semi-structured data– Flat file data

• Low human involvement– Semi-automatic data incorporation– Low maintenance cost

• OK Performance– Linear scale guaranteed

Page 12: Supporting  on-the-fly data Integration for bioinformatics

Road Map

• Mission Statement

• Motivation

• Implementation

• Future

• Conclusion

– Approach Overview– Advantage– Components

Page 13: Supporting  on-the-fly data Integration for bioinformatics

System Components

• Understand data– Layout mining– Schema mining

• Process data– Wrapper generation– Query Process– Query Process with indices

Page 14: Supporting  on-the-fly data Integration for bioinformatics

Layout Mining

• Goal 1: Separate delimiters from values– D-score: location &

frequency

• Goal 2: Organize delimiters and values– NFA

Data File

Token Parser

Tokens

Delimiter Mining

Candidate Delimiters

Layout Learning

Layout Descriptor

Page 15: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Road Map

• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments

Page 16: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Goals

• Ultimate goal: discover schema about an unknown flat file dataset

• Immediate goal: Assign attributes with meaningful labels

Page 17: Supporting  on-the-fly data Integration for bioinformatics

Our Approach

• Summarize values from bottom up• Use knowledge from

– Ontology– Heuristics

• A head-up: attribute label attribute name– What we can mine

• date

– What we cannot do• Creation date, last modification date, birthday, …

Page 18: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Road Map

• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments

Page 19: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining System

• Major Components– Data Cleaning and

summarization– Score calculation

• Score function• Ontology• Heuristics

– Score Clustering

Raw attribute valuesRaw attribute values

Value cleaning and summarizationValue cleaning and summarization

Attribute summariesAttribute summaries

Score calculationScore calculation

ScoresScoresClusteringClustering

algorithmalgorithm

Cutoff valuesCutoff values

LabelingLabeling

Attribute LabelsAttribute Labels

Page 20: Supporting  on-the-fly data Integration for bioinformatics

• Goal: reduce amount of data

• Collect frequent tokens– Approximate frequent token mining algorithm

Data Summarization

• Goal: reduce amount of data

• Collect frequent tokens– Approximate frequent token mining algorithm

• Token categorization by profile– Token profile: a ordered list of N(numerical),

A(alphabetic) and special characters– Token categories:

• Word, number, else and other user defined categories

Page 21: Supporting  on-the-fly data Integration for bioinformatics

Score Function Template

• Desired property– Simple– Adjustable trade-

off between sensitivity and error tolerance

0.00.10.20.30.40.50.60.70.80.91.0

F_pt B_pt t

Temperature

Page 22: Supporting  on-the-fly data Integration for bioinformatics

Score Clustering

• Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores

• Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation

• N (number of attributes) is not large. Exact answer can be found.

Page 23: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Road Map

• Schema Mining– Overview– Mining System– Core Mining Algorithm

• Mining with ontology• Mining with heuristics

– Experiments

Page 24: Supporting  on-the-fly data Integration for bioinformatics

Use of Ontology

• An observation: a similarity between ontology and schema– Both satisfy “is-a” relation

• E.g “Diabetes is a disease.”• Ontology: “diabetes” is a child of “disease”• Schema: “diabetes” is a valid instance of attribute

“disease”

• Common ancestors in ontology ~ attribute label

Page 25: Supporting  on-the-fly data Integration for bioinformatics

Real-world Complications

• To find an arbitrary value in an ontology– Complete and comprehensive ontology?

• Selective sampling

– Error-free dataset?• Adjustable sensitivity & fault tolerance

• Performance

Page 26: Supporting  on-the-fly data Integration for bioinformatics

Ontology Database

• Goal: to approximate a complete comprehensive ontology database

• Approach– “Complete”: sample popular terms– “Comprehensive”: public ontology databases +

common facts

• Result– 6 major categories– 386 terms

Page 27: Supporting  on-the-fly data Integration for bioinformatics

Ontology Based Metrics (1)

1. Occurrence(term) =Frequent_Count[i],

if term=Frequent_Token[i]

mini:[0, t] Frequent_Count[i],

if term=Frequent_Token[0]|…|Frequent_Token[t]

0, else

2. Strength(term) = Occurrence(term) + Strength(child_term)

Page 28: Supporting  on-the-fly data Integration for bioinformatics

Ontology Based Metrics (2)

• Two factors– Relative strength compared with other concepts– Completeness of ontology as a whole

• Ontology score = product of two factors– Each modulated by the template score function

Page 29: Supporting  on-the-fly data Integration for bioinformatics

Mining With Heuristics (1)

• Use token profile– “number”: {N, N.N}– “date”: {N-A-N, N/N/N}

• Use frequent token counts– “identification”: Frequent_Counts[]=1

• Use other token information– “biological sequence”: length >45, or in 10’s

Page 30: Supporting  on-the-fly data Integration for bioinformatics

Mining With Heuristics (2)

• Use token sequence information– “people name”: length (2~3), separator (“,” or

“and”), profile (not number, date)

• Again, these counts are modulated by the template function to calculate scores

Page 31: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Road Map

• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments

Page 32: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Experiment Design

• Datasets– GenBank, UniProt SWISSPROT and Pfam

• Cutoff values– Exact clustering

• Evaluation– Weighted Cohen’s Kappa

Compare group most, middle and little with true label Y(yes), P(partial) and N(no)

Page 33: Supporting  on-the-fly data Integration for bioinformatics

Result Summary: Kappa

Very goodVery good

GoodGood

ModerateModerate

1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,

7: name, 8: number, 9: organism, 10: publication method, 11: sequence

Page 34: Supporting  on-the-fly data Integration for bioinformatics

Cellular Component (O)

Page 35: Supporting  on-the-fly data Integration for bioinformatics

Date (H)

Page 36: Supporting  on-the-fly data Integration for bioinformatics

Organism Name (O)

Page 37: Supporting  on-the-fly data Integration for bioinformatics

Schema Mining Summary

• According to Kappa tests, results are good or very good

• Possible improvement– Clustering method with better intelligence– Better ontology database– More involved language analysis– Hybrid of bottom-up and top-down approaches

Page 38: Supporting  on-the-fly data Integration for bioinformatics

System Components

• Understand data– Layout mining– Schema mining

• Process data– Metadata description language– Wrapper generation– Query Process– Query Process with indices

Page 39: Supporting  on-the-fly data Integration for bioinformatics

Data Process Overview

• Automatic code generation approach• Input

– Metadata about datasets involved– Optional:

• Implicit data transformation task• Request by users• Indexing functions

• Output– Executable programs

• General modules• Task-specific data module

Page 40: Supporting  on-the-fly data Integration for bioinformatics

Metadata Description

• Two aspects of data in flat files– Logical view of the data– Physical data organization

• Two components of every data descriptor– Schema description– Layout description

• Design goals– Powerful– Easy for writing and interpretation

Page 41: Supporting  on-the-fly data Integration for bioinformatics

Metadata Challenges

• Examples of sequence formats– ALN/ClustalW format – AMPS Block file format – ClustalW – Codata – EMBL – GCG/MSF – GDE – Genebank – Fasta (Pearson) – NBRF/PIR – PDB format – Pfam/Stockholm format – Phylip – Raw – RSF – UniProtKB/Swiss-Prot

List and example provided by EMBL-EBI

>FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

{ name "Short name for sequence" longname "Long (more descriptive) name for sequence" sequence-ID "Unique ID number" creation-date "mm/dd/yy hh:mm:ss" direction [-1|1] strandedness [1|2] type [DNA|RNA||PROTEIN|TEXT|MASK] offset (-999999,999999) group-ID (0,999) creator "Author's name" descrip "Verbose description“ comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" }

LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 DEFINITION Mouse fosB mRNA. ACCESSION X14897 VERSION X14897.1 GI:50991 KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. SOURCE Mus musculus. ORGANISM Mus musculus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 4145) AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and

Bravo,R. TITLE The product of a novel growth factor activated gene, fos B,

interacts with JUN proteins enhancing their DNA binding activity JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE 89251612 PUBMED 2498083COMMENT clone=AC113-1; cell line=NIH3T3. FEATURES Location/Qualifiers source 1..4145

/organism="Mus musculus" /db_xref="taxon:10090“

CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c

• Major Challenges:

1. Various representation

2. Semi-structured data

Page 42: Supporting  on-the-fly data Integration for bioinformatics

Schema Descriptors

• Follow XML DTD standard for semi-structured data

• Simple attribute list for relational data

<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>

[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

Page 43: Supporting  on-the-fly data Integration for bioinformatics

Layout Descriptors

• Overall structure (FASTA example)

DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name

DATASPACE LINESIZE=80 {

// ---- File layout details goes here ----

}DATA {osu/fasta} //File location

}

Page 44: Supporting  on-the-fly data Integration for bioinformatics

File Layout

• Key observations on line-based biological data files– Strings of variable length– Delimiters widely used– Data fields may be divided into variables– Repetitive structures>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Page 45: Supporting  on-the-fly data Integration for bioinformatics

Layout Descriptors

• File layout (FASTA example)

DATASPACE LINESIZE=80 { <

“>” ID “ ” DESCRIPTION < “\n” SEQ >

“\n” | EOF>

}

>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Page 46: Supporting  on-the-fly data Integration for bioinformatics

System Component

• Understand data– Layout mining– Schema mining

• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices

Page 47: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Generation Road Map

• Motivation and overview

• System structure

• Wrapper generation

• Wrapper execution

• Experiments

Page 48: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Generation Motivation

• Wrappers are essential for bioinformatics integration– Heterogeneous data sources– Function: transform data

• Current solutions– Manually written wrappers– Scripts

Page 49: Supporting  on-the-fly data Integration for bioinformatics

Wrapper GenerationAdvantages

• Wrapper generated automatically– Stand-alone programs for integration systems and

workflows– Little human interference. New resources can be

integrated on-the-fly– Direct transformation. No unnecessary intermediate form

needed– Only requires data description at metadata level, one

descriptor/data source

• Transfer data from flat files directly– No DB support required– No other domain or format heuristics

Page 50: Supporting  on-the-fly data Integration for bioinformatics

Wrapper GenerationSystem Overview

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

WRAPINFO

Wrapper generationsystem

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data EntryRepresentation

Application Analyzer

Page 51: Supporting  on-the-fly data Integration for bioinformatics

Layout Parse Tree

• FASTA exampleDATASPACE LINESIZE=80 {

<“>” ID “ ” DESCRIPTION

< “\n” SEQ >“\n” | EOF

>}

DATASPACE rootlinesize = 80

< >

< >

“>”-ID “ “-DESCRIPTION

“\n”-SEQ

“\n”-DUMMY | EOF

Leaf: delimiter-variable (DLM-VAR) pair

Internal node: environment

Page 52: Supporting  on-the-fly data Integration for bioinformatics

Schema Mapping

• Algorithm: strict name matchingfor field ft in target schema

for field fs in source schema

if ft=fs then add pair (fs, ft) to the mapping

• Output– A list of attribute pairs– A editable file for user to verify and modify

Page 53: Supporting  on-the-fly data Integration for bioinformatics

Wrapping Assumptions

• Convert semi-structured (and structured) data to structured data

• Both datasets are stored record-wise

• Order of records not disturbed after wrapping

Semi-structured Structured

Data can be transformed entry by entry

Page 54: Supporting  on-the-fly data Integration for bioinformatics

Application Analyzer

• Task: to generate clear directions for wrapper and organize them in WRAPINFOR

• Sub-tasks– What values to store– How to extract values– How to store values– How to write values

Page 55: Supporting  on-the-fly data Integration for bioinformatics

Important Concepts (1)

• “Useful”– An attribute is useful iff its values are in target

• “Reachable”– node b is reachable from node a, if there exists

a valid layout configuration such that a.DLM and b.DLM defines the boundaries of a.VAR.

i.e “… a.DLM a.VAR b.DLM …”

– A value instance is between• Its own delimiter• The first appearance of its reachable delimiters

Page 56: Supporting  on-the-fly data Integration for bioinformatics

Important Concepts (2)

• Attribute Cardinality– Regular attribute: fixed number of values per

entry• ID

– Semi-structured attribute: varied number of values per entry

• References

Page 57: Supporting  on-the-fly data Integration for bioinformatics

WRAPINFOR

• Contents: information to answer a particular wrapping task

• Forms: in XML– 5 look-up tables

• Delimiter, Usefulness, Cardinality, Label, Reachable

– 3 parameters• one_to_one_total, one_to_multiple_total, complete_in

• Function: plug into general modules to form a functional wrapper

Page 58: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Generation Road Map

• Motivation and overview of our approach

• System structure

• Wrapper generation

• Wrapper execution

• Experiments

Page 59: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Overview

Inputdataset

Datasetbuffer

DataReader

Value buffer

one_to_multiple_values

one_to_one_values

DataWriterOutputdataset

Synchronizer

load run

FARA

run

RA

halt

Page 60: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Structure

• One data module: WRAPINFO

• Three general action module– Synchronizer: central controler– DataReader, DataWriter: interact with datasets

• One value buffer

• Suitable for data grid

• Transform data one entry at a time

Page 61: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Execution

• DataReader– Extract attribute value

• Delimiter table + Reachable table

– Fill value buffer: Label look-up table

• DataWriter– Retrieve from value buffer: Label look-up table– Write target file

• Delimiter table + Reachable table + label table

• Synchronizer– Call DataReader on source: parameters– Call DataWriter on target: parameters

Page 62: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Experiments (1)

TRANSFAC-to-Reference Problem

(in logarithm)

(in logari

thm

)

•Analysis time constant•Execution time linear

Page 63: Supporting  on-the-fly data Integration for bioinformatics

Wrapper Experiments (2)

SWISSPROT-to-FASTA Problem

•Performance comparable to handwritten codes

Page 64: Supporting  on-the-fly data Integration for bioinformatics

System Components

• Understand data– Layout mining– Schema mining

• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices

Page 65: Supporting  on-the-fly data Integration for bioinformatics

Query Execution Road Map

• Motivation

• System Overview

• System Implementation– Languages– System

• Experiments

Page 66: Supporting  on-the-fly data Integration for bioinformatics

Limitation of Wrapper

• Data Wrapping =

Data formatting + Data projection

• Other query types– Selection– Cross Product– Join

New Functionalities• Value examination• Multiple datasets

Page 67: Supporting  on-the-fly data Integration for bioinformatics

Advantages

• Retrieve multiple pieces of information all at once

• Data easily available

• Declarative languages only

• High flexibility

• Low over-head

• Suitable for data grid

Page 68: Supporting  on-the-fly data Integration for bioinformatics

System Enhancedquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

TargetData file

Source/target names

Schema & Layout informationmappings

Query analysis

Query execution

Page 69: Supporting  on-the-fly data Integration for bioinformatics

Query ExecutionRoad Map

• Motivation• System Overview• System Implementation

– Languages• Metadata Description Language• Query Language

– System• Query Analysis• Query Execution

• Experiments

Page 70: Supporting  on-the-fly data Integration for bioinformatics

Query Language• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST

FROM BLASTP, SWISSPROT

BY BLASTP.SP_ID = SWISSPROT.ID

WHERE

POSTBLAST.QUERY = BLASTP.QUERY

POSTBLAST.SP_AC = BLASTP.SP_AC

POSTBLAST.SP_ID = BLASTP.SP_ID

POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE

POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ

POSTBLAST.SCORE = BLASTP.SCORE

POSTBLAST.E_VALUE = BLASTP.E_VALUE

Target dataset

Source datasets

Join criteria

Attribute pairs

Page 71: Supporting  on-the-fly data Integration for bioinformatics

Application AnalyzerEnhancement

• Constant values in query– Pseudo-label look-up table

• Other query information– Parameters: comparing field pairs

• Output: QUERYINFOR

Page 72: Supporting  on-the-fly data Integration for bioinformatics

Query Execution

• Query-Proc Structure

• DataReader and DataWriter– Similar to wrapper

• Value buffer– Store useful values from one data entry of every

source dataset

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

TargetData file

Page 73: Supporting  on-the-fly data Integration for bioinformatics

Enhanced Synchronizer

• Synchronizer– Set up pseudo-attributes: Pseudo label look-up

table– Call DataReader on source 1 and 2; Call

DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters

Page 74: Supporting  on-the-fly data Integration for bioinformatics

Post-BLAST Query

• Goal: Enhance BLAST output to FASTA format

• Query: Join query between BLAST output (source 1) and SWISSPROT (source 2)

• 2 modes– UNIQUE: halt once a

match found in source 2– ALL: search all source 2

entries

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Tim

e (

se

c)

3 5 12

Query Size (Sequence Number)

UNIQUE

ALL

Page 75: Supporting  on-the-fly data Integration for bioinformatics

Chip-Supplement Query• Goal: Look up microarray

genes information into tabular format

• Query: Join query between protein array and yeast genome database

• 2 queries– Chip-Supplement:

• array join genome

– Chip-Supplement-Sorted:• genome join array

0

10

20

30

40

50

60

70

80

90

Tim

e (

se

c)

Chip-Supplement Chip-Supplement-

Sorted

Query Type

UNIQUE

ALL

Page 76: Supporting  on-the-fly data Integration for bioinformatics

OMIM-Plus Query

• Add reverse links of proteins to disease database

• Join query between OMIM database and SWISSPROT database

• Results in OMIM form

• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours

Page 77: Supporting  on-the-fly data Integration for bioinformatics

System Components

• Understand data– Layout mining– Schema mining

• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices

Page 78: Supporting  on-the-fly data Integration for bioinformatics

Query with IndicesRoad Map

• Motivation and Overview

• System

• System Enhancement– Language– System Implementation

• Experiments

Page 79: Supporting  on-the-fly data Integration for bioinformatics

Query With IndicesMotivation

• Goal– Improve the performance of query-proc program

• Index

– Maintain the advantages• Flat file based• Low requirement on programming

Page 80: Supporting  on-the-fly data Integration for bioinformatics

Challenges & Approaches

• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces

• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer

• Metadata about indices– Layout descriptor

Page 81: Supporting  on-the-fly data Integration for bioinformatics

System Revisitquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

Targetdata file

Source/target names

Schema & Layout information mappings

Query analysis

Query execution

Index file Index functions

Page 82: Supporting  on-the-fly data Integration for bioinformatics

Language Enhancement

• Describe indices– Indexing is a property of dataset– Extend layout descriptors

– Maintain query format

DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}

AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …

New meaning of “=“:If index available, use index

retrieving functionElse, compare values directly

Page 83: Supporting  on-the-fly data Integration for bioinformatics

System Enhancement

• Metadata Descriptor Parser+ parse index information

• Application Analyzer+ index information: index look-up table

+ test condition: compare_field_indexing

Page 84: Supporting  on-the-fly data Integration for bioinformatics

Query-Proc Enhancement

• Synchronizer+ if index is applicable, check availability of index

data file• If no, call index generation function

+ Load indices

+ Call index retrieving function first for candidate entry list

Page 85: Supporting  on-the-fly data Integration for bioinformatics

Microarray Gene Information Look-up

• Goal: gather information about genes (120)

• Query: microarray output join genome database

• Index: gene names in genome

0.01 0.72

20.89

81.59

0

10

20

30

40

50

60

70

80

90

Per

form

ance

(se

c)

queryanalysis

indexgeneration

query withindices

query w/oindices

Page 86: Supporting  on-the-fly data Integration for bioinformatics

BLAST-ENHANCE Query

• Goal: Add extra information to BLAST output

• Query: BLAST output join Swiss-Prot database

• Index: protein ID in Swiss-Prot

0

200

400

600

800

1000

1200

Per

form

ance

(se

c)

indexgeneration

query w/indices

query w/oindices

3 5 12

Page 87: Supporting  on-the-fly data Integration for bioinformatics

OMIM-PLUS Query

• Goal: add Swiss-Prot link to OMIM

• Query: OMIM join Swiss-Prot

• Index: protein ID in Swiss-Prot

1

10

100

1000

10000

100000

1000000

10000000

Perf

orm

ance

(sec

)

indexgeneration

query w/indices

query w/oindices

Page 88: Supporting  on-the-fly data Integration for bioinformatics

Homology Search Query

• Goal: find similar sequences

• Query: query sequence list * sequence database

• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values

Page 89: Supporting  on-the-fly data Integration for bioinformatics

Homology Search (1)

• Index (Singh’s algorithm)– Data: yeast

genome– wavelet

coefficients – minimum

bounding rectangles

0

50

100

150

200

250

300

350

Per

form

ance

(sec

)

1 2 3 4 5

Database size (9.8MB)

Index generation

10

20

40

Page 90: Supporting  on-the-fly data Integration for bioinformatics

Homology Search (2)

• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0

5

10

15

20

25

30

perf

orm

ance

(sec

)

1 2 3 4 5

Database size (250MB)

10

20

40

Page 91: Supporting  on-the-fly data Integration for bioinformatics

Road Map

• Mission Statement

• Motivation

• Implementation

• Comprehensive Example

• Future work

• Conclusion

Page 92: Supporting  on-the-fly data Integration for bioinformatics

Gene Name Nomenclature

• It is crucial to identify genes CORRECTLY and UNAMBIGUOUSLY– Genes with multiple names– Multiple gene share same names

• Historically, little central control on naming process“…As biologists strive to make sense of the growing wealth of genomic information, this messy nomenclature is becoming a bugbear…”

Helen Pearson, Nature, 2001

Page 93: Supporting  on-the-fly data Integration for bioinformatics

Gene Name in DBs

• Databases related to genes– Genome databases (main force in nomenclature)

• SGD (yeast)• HGNC (human)• TAIR (a plant)• dictyBase (an one-cell amoeba)

– Curated gene databases• Entrez Gene by NCBI

– Curated gene product databases• Swiss-Prot by SIB and EBI

Page 94: Supporting  on-the-fly data Integration for bioinformatics

Queries About Gene Name

• Gene identifiers usages in databases– How are gene symbols in DB A used in DB B?– How are gene alias in DB A used in DB B?

• Nomenclature across species– Q1-Q2: genome – Entrez Gene, Swiss-Prot– Q3-Q4: Entrez Gene – Swiss-Prot

• Nomenclature over time– Q5-Q7: Swiss-Prot – genome

Page 95: Supporting  on-the-fly data Integration for bioinformatics

Challenges

• Various data representation– Line-based texts– Tabular forms with or without title– Format evolves over time

• Data storage– Large volume– Each file queried limited times

Metadata descriptors

Format and schemalearning

Flat file processing

Page 96: Supporting  on-the-fly data Integration for bioinformatics

Integration System RevisitUnderstand Data Process Data

Data File User Request

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------

Schema Descriptor

CodeGeneration

QueryProcessor

Layout Miner

SchemaMiner

Information Integration System

GenomeEntrez GeneSwiss-Prot

- Join queries

Page 97: Supporting  on-the-fly data Integration for bioinformatics

Nomenclature Results (1)

• Across Species

0

10

20

30

40

50

60

70

80

90

Pe

rce

nta

ge

(%

)

Entrez GeneID

Entrez GeneAlias

Swiss-ProtID

Swiss-ProtAlias

Q1-Q2

SGD

HGNC

TAIR

dictyBase

0

10

20

30

40

50

60

Per

cen

tag

e (%

)

Swiss-Prot ID Swiss-Prot Alias

Q3-Q4

SGD

HGNC

TAIR

dictyBase

Page 98: Supporting  on-the-fly data Integration for bioinformatics

Nomenclature Results (2)

• Over time

Q5: How many gene ID in Swiss-Prot are gene ID in genome?Q6: How many gene ID in Swiss-Prot are alias in genome?Q7: How many gene alias in Swiss-Prot are gene ID in genome?

Page 99: Supporting  on-the-fly data Integration for bioinformatics

Performance

• Linear w.r.t. source 1 size

Page 100: Supporting  on-the-fly data Integration for bioinformatics

Conclusion

• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically

by data mining tools– New data processed automatically by generated

programs

• AdvantagesHigh level interface, flat file based, ok

performance, low maintenance cost