View
217
Download
0
Category
Tags:
Preview:
Citation preview
Overview
Two stage search Indexing on intervals Frames Scoring Optimization Results Questions + answers
Two-stage searching
Coarse searching Uses heuristics to find promising alignments Interval matching
Fine searching Intensive processing of results from coarse search Calculates the actual score of the alignments by using
more detailed information from the matching sequences
This requires retrieval of sequences from the database which is expensive
Intervals
Overlapping intervals:Sequence: ACCTGACG, with length l=8 Interval length: n=3 Intervals: ACC, CCT, TGA, GAC, ACG
Indexed VS. Exhaustive
Exhaustive:Every sequence in the database is retrieved
and processed, this is costly for large databases
Often use heuristics to reduce the number of sequences that need to be aligned
Popular exhaustive systems FASTA, BLAST1, BLAST2 are based on Wilbur-Lipman interval approach
Indexed VS. Exhaustive
Wilbur-LipmanBuild a hashing structure with all intervals in
the query sequence as keysHash all intervals in the entire database, and
if the interval is present in the hashing structure we know that there is a match
This is faster than walking through the query sequence for every interval in the database
Wilbur-Lipman
A C T G A C T C
A C T
C T G
T G A
G A C
C T C
0 4
5
1
3
2
intervaloffsets
query sequence
hashing structure
Indexed search
Instead of retrieving all sequences from the database, use an index to locate promising alignments
Coststorage of the index
Advantages fast lookup, only index needs to be accessedenables partial retrieval (for fine searching)
Indexed search
FLASH Enables gapped searching by indexing permuted
intervals Example:
Interval length = 5, subsequence length = 3 Sequence: ACCTGATT, results in: ACC, ACT, ACG ACT ACG ATG
Problem: Index becomes huge
Indexed search in CAFE
(similar to RAMDB) Index contains:
Search structure Searching is done on keys, these keys
represent overlapping intervals that occur in the database
Posting lists For every interval a list is stored
containing references to all occurences of this interval. These references contain a sequence id and an offset
Indexed search in CAFE
Posting lists can be long Compressed to reduce used disk space
AA
AC
8 12 1117
4
3
8
6 142
2011
interval sequence idoffsets
Compression techniques
To reduce the size of the index posting lists are compressed
Instead of storing each sequence number of the sequence in which the interval occurs, store the first one and use relative offsets for the rest
Eg. 101, 109, 217, 412, 980, 1013 becomes 101, 8, 108, 195, 568, 33
Compression techniques
Of course the same same can be done with the places where the interval occurs inside a query
Also Elias Gamma and Delta coding But just compression was not good
enoughUse heuristics to reduce the size of the index
even further
Heuristics to decrease index size
Limit indexing of wildcardswildcards are instantiated before being
indexed.Several adjacent wildcards may lead to an
explosion of matches (Eg. NNNNN)Solution: only index intervals containing at
most one wildcard
Heuristics to decrease index size
Another one:Do not create an entry in the index for
intervals that occur in more than x% of the sequences
Coarse searching
Uses FRAMES “A frame is a set of one or more matching
intervals between a database sequence and a query sequence that are at the same relative offset”
FRAMES
Frames are constructed using the offsets stored in the index
A frame can be represented by a combination of sequence id and offset
The information contained in frames is used to determine the most promising alignments
FRAMES
F2 : (9,7), (10,8), (27,25), (28,26), (29,27), (30,28), (31,29), (32,…)
F1 : (17,16), (18,17)
F26 : (53,27), (54,28)
Constructing frames
For every interval match with offsets x and y:Calculate the relative offset z If a frame with this relative offset (Fz) already
exists, append (x, y) to this frameOtherwise, create a new frame containing (x, y)
Frame ranking schemes
First approach: FRAMECOUNTCount the amount of matching intervals for
each frame, and take the maximum over all frames
FRAMES
F2 : Contains 8 matching intervals
F1 : Contains 2 matching intervals
F26 : Contains 2 matching intervals
FRAMECOUNT = 8, resulting from frame F2
Frame ranking schemes
Problem with FRAMECOUNT:Does not take into account relative positioning
of intervals within a single frame COVERAGE:
Amount of matching bases per frameTakes into account overlapping of intervals
Comparing FRAMECOUNT and COVERAGE: A : FRAMECOUNT = 7, COVERAGE = 9
B : FRAMECOUNT = 7, COVERAGE = 21
COVERAGE seems to be a more accurate scoring scheme
Frame ranking schemes
Another scheme: LENGTHAmount of bases between first and last base
contained in matching intervals in a frame
Frame ranking schemes
What we did not understand: “Despite this, the length scheme is particularly attractive since it
ranks highly regions that are longer and, therefore, will rank long homologous alignments ahead of shorter alignments.”
COMBINED scheme: COVERAGE and LENGTH combined Takes into account residues that are not part of matching intervals COMBINED = COVERAGE – k * (LENGTH – COVERAGE) The value for k is determined empirically (<< 1)
Scoring with amino acids
COVERAGE scheme can be improved by using a substitution matrix Instead of counting the number of matches,
use the sum of appropriate values in the substitution matrix
Interval scores can be cached
Normalizing scores
To compensate for increased score with longer sequences:Snorm = (S * 21.21)/(ln l1 * ln l2)
“l1 and l2 are the lengths of the two aligned regions” We assume they mean the lengths of the
aligned sequences, since we have read that this is used by Shpaer et al.
Optimising frames
Apply a fixed ceiling on the amount of frames. Decreases memory usage and CPU time Which frames are most important ?
Frames containing the most discriminating intervals So we should look up the most discriminating intervals first. The database-frequency of each interval occuring in the
query is determined After sorting them, we lookup the intervals with the lowest
frequency first, in this way have the best discrimination.
Optimising frames
NEIGHBOURHOOD schemeGives higher scores for frames that are close
to other framesAllows gaps (indels), by “combining” framesFor every other frame in the same sequence:
add s1/d to the score, where s is the score of the other frame d is the difference in offset from the other frame
Optimising frames
Information stored for frames can be (re)used for fine searching:Matching regions can be used as a starting
point for fine-searchingFrames allow us to partially retrieve all
relevant sequences from the database
Test data
PIR database used to assess accuracy Sequences used that are classified in super families Single member families removed Filtered test set: 1834 sequences Only query sequences with #residues < 500 Precision and recall are measured
Precision = Relevant results/Total results Recall = Relevant results/Total relevant
Test data
Used for assessing speed and index size:Genbank databases:
GENBANK97: 652 mln. nucleotide bases in 1 mln. sequences
GENBANK108: 1797 mln. nucleotide bases in 2.5 mln. sequences
VERTE: 177 mln. nucleotide bases in 0.12 mln. sequences
Results
CAFE compared to BLAST and FASTA: CAFE has similar precision CAFE becomes relatively faster with increasing
database size
Further observations For CAFE, queries take more time and memory for
processing when their intervals occur often in the database
For BLAST and FASTA, performance drops when the entire database can not be stored in memory
Questions Bogdan:
Question 1(overlapping intervals): Why do they use overlapping intervals? Is that
because if you have the string “ABCDEF” and the intervals “ABC” and “DEF”, that than a query interval “BCD” wouldn’t match ?
Answer: We think your intuition is correct, if we do not use
overlapping intervals lots of existing intervals would be missed
Questions Question 2 (optimising frames):
What do they mean with sorting the query intervals,and how does sorting make them discriminate well between sequences?
Answer: Query intervals: the intervals that are generated from the query
sequence. The database-frequency of each interval occuring in the query is determined After sorting them, we lookup the intervals with the lowest frequency first, in
this way have the best discrimination.
Question 3: “Could you explain the two alternatives when the threshold is reached?”
Answer: Option 1: Stop checking matches immediately when the ceiling is reached Option 2: Keep adjusting existing frames with new matches
Questions
Marjolijn:Question:
Can you explain the difference between an inverted index and a fine-grain index?
Answer: Only one type of index is used This index is an inverted list The index also has a high granularity (fine-grain)
Questions
Lee:Question 1 (compression):
What sorts of compression are used to decrease the size of the index ?
Answer: See the part of this presentation about
compression
Questions
Question 2: “They say that the problem with uncompressed lists is that
you can be penalized in time by the disk retrievals. On page 22 they are saying "However, these algorithms are highly reliant on having sufficent memory to store the complete database." So if I understand it right, they store the index on the disk and the database in main memory? Isn't it more common to put the index in main memory and the database on disk?
Answer: The sentence from page 22 is not about CAFE, but about
two other methods: FASTA and BLAST. These methods do not use an inverted index and hence
need fast access to the database.
Questions Jacob:
Question(COMBINED): “I was wondering if the combination of coverage and length as
mentioned on page 14 is sufficient for scoring alignments. I believe that there are lots of examples with different alignments which would score the same. Like this one
CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT :::::: :::TTATCGAATAGCGGCGCTAGCATCGATCATTCTACTTTCAAACTGC
CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT ::: ::: :::TTATCTCTATAGCGGCGGGCGCATCGATCATTCTCTTTCAAACTGC
Answer: We should investigate whether clustering influences the probability of
a successful allignment If so we could try to think of a new scoring scheme:
Scheme: CLUSTERING Calculate the degree of clustering within a frame
Questions
Laurence: Question:
“On page 15 in section 3.3 in paragraph 4 it says that composition alignment between matching intervals in the length metric is difficult to model without fine-searching the region. Therefore they elect not to modify the LENGTH scheme for complex models but rather apply statistical normalisation to incorporate variations in composition the resultant score.
Can you maybe explain in further detail why is chosen to apply statistical normalisation in contrast to modifying the LENGTH scheme, and what does fine-searching the region have to do with this”
Answer: LENGTH misses compositional information Compositional information requires retrieval of database sequences However, when random query and database strings get larger, average
allignment scores get higher. To rule this out we do a normalisation. By doing this, length still plays an important role in scoring.
Questions Bram:
Question: “For COVERAGE they use pre-calculations of the scores to
incorporate variations in composition in the resultant score. In LENGTH they use the Shpaer normalisation for it. At the end of the paragraph they propose to use the Shpaer scheme to normalise the COVERAGE scores. What's the advantage of normalising the COVERAGE scores, if we already did the pre-calculation?”
Answer: You probably think of normalisation as ruling out which substitution
matrix was used, that’s not what is meant. The Shpaer normalisation is applied to values calculated for frames
Questions
Adriano Question:
page 22 “In CAFE, evaluation times depend on query length and on statistics of the intervals in the query; intervals with longer inverted lists require more processing. What does "longer inverted lists" mean? and why does it require more processing?”
Answer: Inverted lists = posting lists for matching intervals For every matching interval we need to process its posting
list in order to constuct frames
Recommended