Supporting on-the-fly data Integration for bioinformatics

Supporting on-the-fly data Integration for bioinformatics

Candidate: Xuan Zhang

Advisor: Gagan Agrawal

Road Map

• Mission Statement

• Motivation

• Implementation

• Comprehensive Examples

• Future work

• Conclusion

Mission Statement

• Enhance information integration systems on– Functionality

• On-the-fly data incorporation• Flat file data process

– Usability• Declarative interface• Low programming requirement

Motivation

• Integration is essential for biological research– Biological data include

• Sequences: DNA (GenBank), protein (Swiss-Prot)• Structure: RNA (RNAbase), protein (PDB)• Interaction: pathway (KEGG), regulation (GRBase)• Function: disease (OMIM)• 2ndary: protein family (Pfam)

– Biological data is inter-related.

Motivation

• Challenges of bioinformatics integration– Data volume: overwhelming

• DNA sequence: 100 gigabases (August, 2005)

– Data growth:

exponential

Figure provided by PDB

Motivation

• Challenges of bioinformatics integration (cont.)– Tools: Many and more– Service interfaces: Variety

• Web pages• Web service• Grid service

Motivation

• Challenges of bioinformatics integration (cont.)– Inter-operability: Low

• Heterogeneous data sources– Semi-structured by nature– Flat file, relational, object-oriented databases

• Independently developed tools• No data exchange standard

– Little Collaboration

Road Map


• Motivation

• Implementation

• Future

• Conclusion

– Approach Overview– Advantage– Components

Approach Summary

• Metadata– Declarative description of data– Data mining algorithms for semi-automatic

writing– Reusable by different requests on same data

• Code generation– Request analysis and execution separated– General modules with plug-in data module

System OverviewUnderstand Data Process Data

Data File User Request

Answ

er

Metadata Description

Layout Descriptor---------------------------------------------------

Schema DescriptorLayout Descriptor

---------------------------------------------------


---------------------------------------------------

Schema Descriptor

CodeGeneration

RequestProcessor

Layout Miner

SchemaMiner

Information Integration System

Advantages

• Simple interface– At metadata level, declarative

• General data model– Semi-structured data– Flat file data

• Low human involvement– Semi-automatic data incorporation– Low maintenance cost

• OK Performance– Linear scale guaranteed

Road Map


• Motivation

• Implementation

• Future

• Conclusion

– Approach Overview– Advantage– Components

System Components

• Understand data– Layout mining– Schema mining

• Process data– Wrapper generation– Query Process– Query Process with indices

Layout Mining

• Goal 1: Separate delimiters from values– D-score: location &

frequency

• Goal 2: Organize delimiters and values– NFA

Data File

Token Parser

Tokens

Delimiter Mining

Candidate Delimiters

Layout Learning

Layout Descriptor

Schema Mining Road Map

• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments

Schema Mining Goals

• Ultimate goal: discover schema about an unknown flat file dataset

• Immediate goal: Assign attributes with meaningful labels

Our Approach

• Summarize values from bottom up• Use knowledge from

– Ontology– Heuristics

• A head-up: attribute label attribute name– What we can mine

• date

– What we cannot do• Creation date, last modification date, birthday, …



Schema Mining System

• Major Components– Data Cleaning and

summarization– Score calculation

• Score function• Ontology• Heuristics

– Score Clustering

Raw attribute valuesRaw attribute values

Value cleaning and summarizationValue cleaning and summarization

Attribute summariesAttribute summaries

Score calculationScore calculation

ScoresScoresClusteringClustering

algorithmalgorithm

Cutoff valuesCutoff values

LabelingLabeling

Attribute LabelsAttribute Labels

• Goal: reduce amount of data

• Collect frequent tokens– Approximate frequent token mining algorithm

Data Summarization

• Goal: reduce amount of data

• Collect frequent tokens– Approximate frequent token mining algorithm

• Token categorization by profile– Token profile: a ordered list of N(numerical),

A(alphabetic) and special characters– Token categories:

• Word, number, else and other user defined categories

Score Function Template

• Desired property– Simple– Adjustable trade-

off between sensitivity and error tolerance

0.00.10.20.30.40.50.60.70.80.91.0

F_pt B_pt t

Temperature

Score Clustering

• Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores

• Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation

• N (number of attributes) is not large. Exact answer can be found.


• Schema Mining– Overview– Mining System– Core Mining Algorithm

• Mining with ontology• Mining with heuristics

– Experiments

Use of Ontology

• An observation: a similarity between ontology and schema– Both satisfy “is-a” relation

• E.g “Diabetes is a disease.”• Ontology: “diabetes” is a child of “disease”• Schema: “diabetes” is a valid instance of attribute

“disease”

• Common ancestors in ontology ~ attribute label

Real-world Complications

• To find an arbitrary value in an ontology– Complete and comprehensive ontology?

• Selective sampling

– Error-free dataset?• Adjustable sensitivity & fault tolerance

• Performance

Ontology Database

• Goal: to approximate a complete comprehensive ontology database

• Approach– “Complete”: sample popular terms– “Comprehensive”: public ontology databases +

common facts

• Result– 6 major categories– 386 terms

Ontology Based Metrics (1)

1. Occurrence(term) =Frequent_Count[i],

if term=Frequent_Token[i]

mini:[0, t] Frequent_Count[i],

if term=Frequent_Token[0]|…|Frequent_Token[t]

0, else

2. Strength(term) = Occurrence(term) + Strength(child_term)

Ontology Based Metrics (2)

• Two factors– Relative strength compared with other concepts– Completeness of ontology as a whole

• Ontology score = product of two factors– Each modulated by the template score function

Mining With Heuristics (1)

• Use token profile– “number”: {N, N.N}– “date”: {N-A-N, N/N/N}

• Use frequent token counts– “identification”: Frequent_Counts[]=1

• Use other token information– “biological sequence”: length >45, or in 10’s

Mining With Heuristics (2)

• Use token sequence information– “people name”: length (2~3), separator (“,” or

“and”), profile (not number, date)

• Again, these counts are modulated by the template function to calculate scores



Schema Mining Experiment Design

• Datasets– GenBank, UniProt SWISSPROT and Pfam

• Cutoff values– Exact clustering

• Evaluation– Weighted Cohen’s Kappa

Compare group most, middle and little with true label Y(yes), P(partial) and N(no)

Result Summary: Kappa

Very goodVery good

GoodGood

ModerateModerate

1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,

7: name, 8: number, 9: organism, 10: publication method, 11: sequence

Cellular Component (O)

Date (H)

Organism Name (O)

Schema Mining Summary

• According to Kappa tests, results are good or very good

• Possible improvement– Clustering method with better intelligence– Better ontology database– More involved language analysis– Hybrid of bottom-up and top-down approaches

System Components


• Process data– Metadata description language– Wrapper generation– Query Process– Query Process with indices

Data Process Overview

• Automatic code generation approach• Input

– Metadata about datasets involved– Optional:

• Implicit data transformation task• Request by users• Indexing functions

• Output– Executable programs

• General modules• Task-specific data module


• Two aspects of data in flat files– Logical view of the data– Physical data organization

• Two components of every data descriptor– Schema description– Layout description

• Design goals– Powerful– Easy for writing and interpretation

Metadata Challenges

• Examples of sequence formats– ALN/ClustalW format – AMPS Block file format – ClustalW – Codata – EMBL – GCG/MSF – GDE – Genebank – Fasta (Pearson) – NBRF/PIR – PDB format – Pfam/Stockholm format – Phylip – Raw – RSF – UniProtKB/Swiss-Prot

List and example provided by EMBL-EBI

>FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

{ name "Short name for sequence" longname "Long (more descriptive) name for sequence" sequence-ID "Unique ID number" creation-date "mm/dd/yy hh:mm:ss" direction [-1|1] strandedness [1|2] type [DNA|RNA||PROTEIN|TEXT|MASK] offset (-999999,999999) group-ID (0,999) creator "Author's name" descrip "Verbose description“ comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" }

LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 DEFINITION Mouse fosB mRNA. ACCESSION X14897 VERSION X14897.1 GI:50991 KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. SOURCE Mus musculus. ORGANISM Mus musculus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 4145) AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and

Bravo,R. TITLE The product of a novel growth factor activated gene, fos B,

interacts with JUN proteins enhancing their DNA binding activity JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE 89251612 PUBMED 2498083COMMENT clone=AC113-1; cell line=NIH3T3. FEATURES Location/Qualifiers source 1..4145

/organism="Mus musculus" /db_xref="taxon:10090“

CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c

• Major Challenges:

1. Various representation

2. Semi-structured data

Schema Descriptors

• Follow XML DTD standard for semi-structured data

• Simple attribute list for relational data

<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>

[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string

Layout Descriptors

• Overall structure (FASTA example)

DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name

DATASPACE LINESIZE=80 {

// ---- File layout details goes here ----

}DATA {osu/fasta} //File location

}

File Layout

• Key observations on line-based biological data files– Strings of variable length– Delimiters widely used– Data fields may be divided into variables– Repetitive structures>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

Layout Descriptors

• File layout (FASTA example)

DATASPACE LINESIZE=80 { <

“>” ID “ ” DESCRIPTION < “\n” SEQ >

“\n” | EOF>

}

>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …

System Component


• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices

Wrapper Generation Road Map

• Motivation and overview

• System structure

• Wrapper generation

• Wrapper execution

• Experiments

Wrapper Generation Motivation

• Wrappers are essential for bioinformatics integration– Heterogeneous data sources– Function: transform data

• Current solutions– Manually written wrappers– Scripts

Wrapper GenerationAdvantages

• Wrapper generated automatically– Stand-alone programs for integration systems and

workflows– Little human interference. New resources can be

integrated on-the-fly– Direct transformation. No unnecessary intermediate form

needed– Only requires data description at metadata level, one

descriptor/data source

• Transfer data from flat files directly– No DB support required– No other domain or format heuristics

Wrapper GenerationSystem Overview

DataReader DataWriterSynchronizer

SourceDataset

TargetDataset

WRAPINFO

Wrapper generationsystem

wrapper

Mapping File

Mapping Parser

Schema Mapping

Mapping Generator

Schema Descriptors

Layout Parser

Layout Descriptor

Data EntryRepresentation

Application Analyzer

Layout Parse Tree

• FASTA exampleDATASPACE LINESIZE=80 {

<“>” ID “ ” DESCRIPTION

< “\n” SEQ >“\n” | EOF

>}

DATASPACE rootlinesize = 80

< >

< >

“>”-ID “ “-DESCRIPTION

“\n”-SEQ

“\n”-DUMMY | EOF

Leaf: delimiter-variable (DLM-VAR) pair

Internal node: environment

Schema Mapping

• Algorithm: strict name matchingfor field ft in target schema

for field fs in source schema

if ft=fs then add pair (fs, ft) to the mapping

• Output– A list of attribute pairs– A editable file for user to verify and modify

Wrapping Assumptions

• Convert semi-structured (and structured) data to structured data

• Both datasets are stored record-wise

• Order of records not disturbed after wrapping

Semi-structured Structured

Data can be transformed entry by entry

Application Analyzer

• Task: to generate clear directions for wrapper and organize them in WRAPINFOR

• Sub-tasks– What values to store– How to extract values– How to store values– How to write values

Important Concepts (1)

• “Useful”– An attribute is useful iff its values are in target

• “Reachable”– node b is reachable from node a, if there exists

a valid layout configuration such that a.DLM and b.DLM defines the boundaries of a.VAR.

i.e “… a.DLM a.VAR b.DLM …”

– A value instance is between• Its own delimiter• The first appearance of its reachable delimiters

Important Concepts (2)

• Attribute Cardinality– Regular attribute: fixed number of values per

entry• ID

– Semi-structured attribute: varied number of values per entry

• References

WRAPINFOR

• Contents: information to answer a particular wrapping task

• Forms: in XML– 5 look-up tables

• Delimiter, Usefulness, Cardinality, Label, Reachable

– 3 parameters• one_to_one_total, one_to_multiple_total, complete_in

• Function: plug into general modules to form a functional wrapper

Wrapper Generation Road Map

• Motivation and overview of our approach

• System structure

• Wrapper generation

• Wrapper execution

• Experiments

Wrapper Overview

Inputdataset

Datasetbuffer

DataReader

Value buffer

one_to_multiple_values

one_to_one_values

DataWriterOutputdataset

Synchronizer

load run

FARA

run

RA

halt

Wrapper Structure

• One data module: WRAPINFO

• Three general action module– Synchronizer: central controler– DataReader, DataWriter: interact with datasets

• One value buffer

• Suitable for data grid

• Transform data one entry at a time

Wrapper Execution

• DataReader– Extract attribute value

• Delimiter table + Reachable table

– Fill value buffer: Label look-up table

• DataWriter– Retrieve from value buffer: Label look-up table– Write target file

• Delimiter table + Reachable table + label table

• Synchronizer– Call DataReader on source: parameters– Call DataWriter on target: parameters

Wrapper Experiments (1)

TRANSFAC-to-Reference Problem

(in logarithm)

(in logari

thm

)

•Analysis time constant•Execution time linear

Wrapper Experiments (2)

SWISSPROT-to-FASTA Problem

•Performance comparable to handwritten codes

System Components



Query Execution Road Map

• Motivation

• System Overview

• System Implementation– Languages– System

• Experiments

Limitation of Wrapper

• Data Wrapping =

Data formatting + Data projection

• Other query types– Selection– Cross Product– Join

New Functionalities• Value examination• Multiple datasets

Advantages

• Retrieve multiple pieces of information all at once

• Data easily available

• Declarative languages only

• High flexibility

• Low over-head

• Suitable for data grid

System Enhancedquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR

DataReader DataWriter

Synchronizer

Source data files

TargetData file

Source/target names

Schema & Layout informationmappings

Query analysis

Query execution

Query ExecutionRoad Map

• Motivation• System Overview• System Implementation

– Languages• Metadata Description Language• Query Language

– System• Query Analysis• Query Execution

• Experiments

Query Language• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST

FROM BLASTP, SWISSPROT

BY BLASTP.SP_ID = SWISSPROT.ID

WHERE

POSTBLAST.QUERY = BLASTP.QUERY

POSTBLAST.SP_AC = BLASTP.SP_AC

POSTBLAST.SP_ID = BLASTP.SP_ID

POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE

POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ

POSTBLAST.SCORE = BLASTP.SCORE

POSTBLAST.E_VALUE = BLASTP.E_VALUE

Target dataset

Source datasets

Join criteria

Attribute pairs

Application AnalyzerEnhancement

• Constant values in query– Pseudo-label look-up table

• Other query information– Parameters: comparing field pairs

• Output: QUERYINFOR

Query Execution

• Query-Proc Structure

• DataReader and DataWriter– Similar to wrapper

• Value buffer– Store useful values from one data entry of every

source dataset

QUERYINFOR


Synchronizer

Source data files

TargetData file

Enhanced Synchronizer

• Synchronizer– Set up pseudo-attributes: Pseudo label look-up

table– Call DataReader on source 1 and 2; Call

DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters

Post-BLAST Query

• Goal: Enhance BLAST output to FASTA format

• Query: Join query between BLAST output (source 1) and SWISSPROT (source 2)

• 2 modes– UNIQUE: halt once a

match found in source 2– ALL: search all source 2

entries

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Tim

e (

se

c)

3 5 12

Query Size (Sequence Number)

UNIQUE

ALL

Chip-Supplement Query• Goal: Look up microarray

genes information into tabular format

• Query: Join query between protein array and yeast genome database

• 2 queries– Chip-Supplement:

• array join genome

– Chip-Supplement-Sorted:• genome join array

0

10

20

30

40

50

60

70

80

90

Tim

e (

se

c)

Chip-Supplement Chip-Supplement-

Sorted

Query Type

UNIQUE

ALL

OMIM-Plus Query

• Add reverse links of proteins to disease database

• Join query between OMIM database and SWISSPROT database

• Results in OMIM form

• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours

System Components



Query with IndicesRoad Map

• Motivation and Overview

• System

• System Enhancement– Language– System Implementation

• Experiments

Query With IndicesMotivation

• Goal– Improve the performance of query-proc program

• Index

– Maintain the advantages• Flat file based• Low requirement on programming

Challenges & Approaches

• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces

• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer

• Metadata about indices– Layout descriptor

System Revisitquery

Query parser

Metadatacollection

Datasetdescriptors

Descriptorparser

Application analyzer

QUERYINFOR


Synchronizer

Source data files

Targetdata file

Source/target names

Schema & Layout information mappings

Query analysis

Query execution

Index file Index functions

Language Enhancement

• Describe indices– Indexing is a property of dataset– Extend layout descriptors

– Maintain query format

DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}

AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …

New meaning of “=“:If index available, use index

retrieving functionElse, compare values directly

System Enhancement

• Metadata Descriptor Parser+ parse index information

• Application Analyzer+ index information: index look-up table

+ test condition: compare_field_indexing

Query-Proc Enhancement

• Synchronizer+ if index is applicable, check availability of index

data file• If no, call index generation function

+ Load indices

+ Call index retrieving function first for candidate entry list

Microarray Gene Information Look-up

• Goal: gather information about genes (120)

• Query: microarray output join genome database

• Index: gene names in genome

0.01 0.72

20.89

81.59

0

10

20

30

40

50

60

70

80

90

Per

form

ance

(se

c)

queryanalysis

indexgeneration

query withindices

query w/oindices

BLAST-ENHANCE Query

• Goal: Add extra information to BLAST output

• Query: BLAST output join Swiss-Prot database

• Index: protein ID in Swiss-Prot

0

200

400

600

800

1000

1200

Per

form

ance

(se

c)

indexgeneration

query w/indices

query w/oindices

3 5 12

OMIM-PLUS Query

• Goal: add Swiss-Prot link to OMIM

• Query: OMIM join Swiss-Prot

• Index: protein ID in Swiss-Prot

1

10

100

1000

10000

100000

1000000

10000000

Perf

orm

ance

(sec

)

indexgeneration

query w/indices

query w/oindices

Homology Search Query

• Goal: find similar sequences

• Query: query sequence list * sequence database

• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values

Homology Search (1)

• Index (Singh’s algorithm)– Data: yeast

genome– wavelet

coefficients – minimum

bounding rectangles

0

50

100

150

200

250

300

350

Per

form

ance

(sec

)

1 2 3 4 5

Database size (9.8MB)

Index generation

10

20

40

Homology Search (2)

• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0

5

10

15

20

25

30

perf

orm

ance

(sec

)

1 2 3 4 5

Database size (250MB)

10

20

40

Road Map


• Motivation

• Implementation

• Comprehensive Example

• Future work

• Conclusion

Gene Name Nomenclature

• It is crucial to identify genes CORRECTLY and UNAMBIGUOUSLY– Genes with multiple names– Multiple gene share same names

• Historically, little central control on naming process“…As biologists strive to make sense of the growing wealth of genomic information, this messy nomenclature is becoming a bugbear…”

Helen Pearson, Nature, 2001

Gene Name in DBs

• Databases related to genes– Genome databases (main force in nomenclature)

• SGD (yeast)• HGNC (human)• TAIR (a plant)• dictyBase (an one-cell amoeba)

– Curated gene databases• Entrez Gene by NCBI

– Curated gene product databases• Swiss-Prot by SIB and EBI

Queries About Gene Name

• Gene identifiers usages in databases– How are gene symbols in DB A used in DB B?– How are gene alias in DB A used in DB B?

• Nomenclature across species– Q1-Q2: genome – Entrez Gene, Swiss-Prot– Q3-Q4: Entrez Gene – Swiss-Prot

• Nomenclature over time– Q5-Q7: Swiss-Prot – genome

Challenges

• Various data representation– Line-based texts– Tabular forms with or without title– Format evolves over time

• Data storage– Large volume– Each file queried limited times

Metadata descriptors

Format and schemalearning

Flat file processing

Integration System RevisitUnderstand Data Process Data

Data File User Request


Layout Descriptor---------------------------------------------------


---------------------------------------------------


---------------------------------------------------

Schema Descriptor

CodeGeneration

QueryProcessor

Layout Miner

SchemaMiner

Information Integration System

GenomeEntrez GeneSwiss-Prot

- Join queries

Nomenclature Results (1)

• Across Species

0

10

20

30

40

50

60

70

80

90

Pe

rce

nta

ge

(%

)

Entrez GeneID

Entrez GeneAlias

Swiss-ProtID

Swiss-ProtAlias

Q1-Q2

SGD

HGNC

TAIR

dictyBase

0

10

20

30

40

50

60

Per

cen

tag

e (%

)

Swiss-Prot ID Swiss-Prot Alias

Q3-Q4

SGD

HGNC

TAIR

dictyBase

Nomenclature Results (2)

• Over time

Q5: How many gene ID in Swiss-Prot are gene ID in genome?Q6: How many gene ID in Swiss-Prot are alias in genome?Q7: How many gene alias in Swiss-Prot are gene ID in genome?

Performance

• Linear w.r.t. source 1 size

Conclusion

• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically

by data mining tools– New data processed automatically by generated

programs

• AdvantagesHigh level interface, flat file based, ok

performance, low maintenance cost

Documents

Supporting on-the-fly data Integration for bioinformatics