36
MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

  • Upload
    andres

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe. Overview. MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS. MoBIoS Project. - PowerPoint PPT Presentation

Citation preview

Page 1: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS

A Metric-space DBMS to Support Biological Discovery

Presenter: Enohi I. Ibekwe

Page 2: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Overview

• MoBIoS Project• Motivation• The challenge• Established similarity measures• Metric-space distance measure• Disk-based metric tree index• MoBIoS as a DBMS• Application of MoBIoS

Page 3: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS Project

• Molecular Biological Information System

• Project at UT-Austin center for computational biology and bioinformatics.

• DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data.

Page 4: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Motivation

Develop a DBMS to power Biological Information System

Page 5: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

The Challenge

• Established biological model of similarity measure do not form a metrics.

• Scalable disk-based metric-indexes suffer from the Curse of dimensionality

Page 6: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Established Similarity Measure (I)

• Sequence Homology– Query Sequence– Database of sequences– Substitution Matrix (PAM / BLOSUM)– Similarity Measure

– Global Sequence Alignment (Edit distance)

– Local Sequence Alignment (Most important)

Page 7: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Established Similarity Measure (II)

• Local Sequence Alignment– A local sequence alignment query asks, given a query

sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S

– Main issue: Result is a set of answer.

• A metric distance function must return a single value for each pair of argument

Page 8: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Established Similarity Measure (III)

• Global Sequence Alignment– Given an alphabet A , a similarity substitution matrix

M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example)

– Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number.

Page 9: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Metric-space Distance measure (I)

• Homology Search• Query Sequence: Sub strings of length q (q-grams)

• Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings.

• Substitution Matrix (mPAM)

• Similarity Measure (distance measure)

– Local Alignments is computed from global alignment.

Page 10: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

• mPAM substitution Matrix – Accepted Point Mutation Model.

– PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other.

– mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution.

– mPAM has been validated.(Validation)

Metric-space Distance measure (II)

Page 11: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Metric-space Distance measure (III)

• Computing Local Alignment from Global Alignment (Algorithm)

– Offline

1. Divide database of sequence into sub strings (q-grams)

2. Build metric-space index structure on q-grams

– Online

1. Divide query sequence into sub strings (q-grams)

2. Using global alignment as a distance function to match query q-grams.

Page 12: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Disk-based metric-tree index• Phases

• Initialization• Searching

• Query performance metric• Number of disk I/O ( nodes visited)• Number of distance computation

• Options Exploited• M-Tree• Generalized Hyper plane tree• MVP-Tree (optimal)

Page 13: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Disk-based metric-tree index (initialization)

• M-Tree initialization– Best case : O(nlogn);

– worst case: O(n3)

• Generalized Hyper plane (GH-Tree) initialization– Best case : O(nlogn);

– worst case: O(n2)

• GH-tree: Bi-direction

• M-Tree: Bottom-up

• In practice, both M-Tree and GH-Tree scale linearly

Page 14: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Disk-based metric-tree index (Searching)

Page 15: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS as a DBMS (I)

• Mckoi (Java RDBMS).– Plus metric-space indexing

– Plus Biological data types

– Plus biological semantics

• Life science data store– Biological sequence data

– Mass-spectrometry protein signature

Page 16: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS as a DBMS (III)

• Language Extension– M-SQL

• Data type Extension– Data type for Sequences (DNA,RNA,peptide)– Data type for Mass spectrum

• Semantics Extension– Subsequence Operators– Local alignment

Page 17: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS as a DBMS (IV)

• Semantics Extension – Similarity (metric distance) between data types

• mPAM250• Cosine distance• Lk norms

• Keys Extension – Primary key (metrickey)– Index (metric)

Page 18: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Application of MoBIoS (I)

• MS/MS Protein Identification1. Breakdown protein into fragments called

peptide using a protease enzyme

2. Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra.

Page 19: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Application of MoBIoS(II)

M-SQL Solution Create table protein_sequences

(accesion_id int,

sequence peptide,

primary metrickey(sequence, mPAM250);

Create table digested_sequences (accession_id int,

fragment peptide,

enzyme varchar,

ms_peak int, primary key(enzyme, accession_id);

Create index fragment_sequence on digested_sequences (fragment)

metric(mPAM250);

Create table mass_spectra(accession_id int, enzyme varchar,spectrum spectrum, primary

metrickey(spectrum, cosine_distance);

Page 20: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Application of MoBIoS(III) • M-SQL SolutionSELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E andCosine_Distance(S, MS.spectrum, range1) andDS.accession_id = MS.accession_id = Prot.accesion_id andDS.ms_peak = P andMPAM250(PS, DS.sequence, range2)

Page 21: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

BLAST vs MoBIoS

MoBIoS 1. Molecular Biological

Information System

2. DBMS specialized for storage, retrieval and mining of biological data

3. Sequence Database and query sequence is divided into q-grams and Database is indexed offline.

BLAST1. Basic Local Alignment Search

Tool

2. Utility specialized for retrieval and mining of biological data outside a database

3. Only query sequence is divide and hot-point index is done at query time

Page 22: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

MoBIoS Demo

• MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfound/ccForm.jsp

• PDB : http://www.rcsb.org/pdb/

Page 23: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Conclusion

• Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed.

Page 24: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

References

• http://www.cs.utexas.edu/users/mobios/Publications/miranker-mobios-final-03.pdf

• http://www.cs.utexas.edu/users/mobios/Publications/mao-bibe-03.pdf

• http://www.cs.utexas.edu/users/mobios/

• http://www.mckoi.com/database/

Page 25: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix

ReturnReturn

Page 26: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix I- Metric

A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z,

1. Non-Negativity

d(x,y) > 0 for x = y; d(x,y) = 0 for x = y

2. Symmetry

d(x,y) = d(y,x)

3. Triangular inequality

d(x,y) + d(y,z) = d(x,y)

ReturnReturn

Page 27: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix II - Sequence

• 2 RNA sequences from a DNA strand.

ReturnReturn

Page 28: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix III - PAM

Percent Accepted Mutation(PAM)

A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability.

ReturnReturn

Page 29: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix IV – PAM250• At this evolutionary distance (250 substitutions per hundred

residues)

ReturnReturn

Page 30: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix V - BLOSUM

Blocks Substitution Matrix (BLOSUM)

A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62)

A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability

ReturnReturn

Page 31: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix VI – BLOSUM62• BLOSUM62 matrix is calculated from protein blocks such

that if two sequences are more than 62% identical

ReturnReturn

Page 32: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix VII – mPAM250• Expected time based on 250 PAM distance as a unit.

ReturnReturn

Page 33: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix VIII – mPAM Validation

• Based on benchmark query set by Smith-Waterman.

• Graph shows ROC50 values (Receiver Operating Characteristics)

• Negative x- axis indicate mPAM has better performance

Difference between ROC50 values using mPAM and PAM250

ReturnReturn

Page 34: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix IX - Distance measure

Global Sequence Alignment

Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t.

ReturnReturn

Page 35: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix X – Homology Search

Build Index Structure(Offline)1. Divide the database sequences into a set of overlapping sub

strings of length q (q-grams) with step size 1.2. Build a metric-space index D based on global alignment to

support constant time lookup of exact match.

Homology Search Query (Online)1. Divide the query sequence W into overlapping sub string , F

= {wi | i =0..| W |-q }, of length q with step size 1.

2. For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function.

3. Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W

ReturnReturn

Page 36: MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe

Appendix XI - GSA

ReturnReturn