Upload
alexis
View
26
Download
0
Embed Size (px)
DESCRIPTION
Better Filtering with Gapped q -grams. S. Burkhardt. J. Kärkkäinen. Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken. Outline. Motivation The `classic` q -gram Lemma q -shapes Measuring Filter quality/speed Experimental Results Conclusion. - PowerPoint PPT Presentation
Citation preview
Better Filtering withBetter Filtering withGapped Gapped qq-grams-grams
S. Burkhardt
Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken
J. Kärkkäinen
OutlineOutline Motivation The `classic` q-gram Lemma q-shapes Measuring Filter quality/speed Experimental Results Conclusion
The k-mismatches problem
For a pattern P, a string S, a value k :
find all occurences of P in S with at
most k character replacements.
Filter Algorithms
Filtration Stage:
Examine S with a Filter Criterium
Return areas with potential matches
Verification Stage:
Verify which areas have true matches
Pattern P
A C T C
Find occurences of P with at most k errors
k = 1
String S
G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T
The q-gram Lemma
For a pattern P, a string S, a value k:
Matches to P in S with at most k errors contain at least
|P|-q+1-(kq)
substrings of length q (q-grams) from S.
T C GC G A
G A TA T T
T T AT A C
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
q = 3# of q-grams : |P| - q + 1
k = 1
Error number k : at least t = |P| - q + 1 - (qk) common q-grams in |P| letters
T C G A T T A CT C G A T T A C
|P| = 8
=> t = 8-3+1-1 = 5
In the DPmatrix, onecan count
the numberof matching
q-gramsper diagonal
Use substrings with gaps (q-shapes) compute correct threshold t total length s is called span
3-shape##.#s = 41 gapt = 1
General idea:
3-gram###t = 0
no filter!
OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO
|Q| = 11k = 3
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O
O = match, X = mismatch
We developed a DP based approach for computing the threshold t given a q-shape
and a query length |P|
Judging the quality of Judging the quality of qq-shapes I-shapes I
Observation: The threshold t is not the only factor that influences the
behaviour of a q-shape
We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S
Judging the quality of Judging the quality of qq-shapes II-shapes II
##.# ##.# -----
For t=2 andthe 3-shape ##.#the minimum coverage is 5
The value q (i.e.the number of matching
characters in a shape) determines the expected number of occurences in
a random string S
Judging the quality of Judging the quality of qq-shapes III-shapes III
3-shape: ##.#A,C,G,T}
Expected number ofoccurences of a single 3-shape in S:
occ = |S|
1
||q
The speed of the filter stepis influenced by the expectednumber of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage
Judging the quality of Judging the quality of qq-shapes IV-shapes IV
Speed: value of q
Efficiency: minimum coverage
Good shapes are not neccessarily
regular or predictable in
their form.
Judging the quality of Judging the quality of qq-shapes V-shapes V
Shapes with maximalminimum coverage for:
|Q| = 50, k=5q=6 : ##......#..#..#.#q=9 : ###..#..#.#...#.##q=10: ###..#..#.#..###.#q=11: #######.##.##q=12: ###.#..###.#..###.#
Experimental setup for q-shapes:• 50 million character random (Bernoulli) string S• 1000 random queries of length 500• queries have no approximate matches in queries have no approximate matches in SS• compute threshold for |Q|=50compute threshold for |Q|=50• actual value of |Q| is 500! (to reduce runtime of tests) actual value of |Q| is 500! (to reduce runtime of tests) Experiments show 10x reduced filter efficiency; relative performance between shapes unaffected
Evaluating q-shapes
What we measured for every shape and all queries:A) The total number of occurrences of all shapes Good indicator of the total work for the filter phaseB) The number of diagonals containing at least t shapes Good indicator of the Good indicator of the filter efficiencyfilter efficiency The The experiments show a good correlation betweenA and the predicted values as well as B and the minimumcoverage
Evaluating q-shapes
• An analysis of q-grams with gaps (q-shapes)• Results include:
• experimental evidence for their superiority when compared to standard q-grams• a method to roughly judge their quality, the minimum coverage• a way to calculate the parameters required to us them in a filter algorithm
Our work….
• an algorithm to predict the best shapes • improve the quality measure for q-grams• extension to the k-differences problem (with insertions and deletions)• a thorough analysis of filter behaviour for > k differences (use as a heuristic filter)
Todo….