Upload
ada-glenn
View
215
Download
0
Embed Size (px)
Citation preview
www.strandls.com
Read Alignment Algorithms
www.strandls.com
The Problem
2
• Given a very long reference sequence of length n and given several short strings (reads) of length m each, m << n
• Find the best matching location for each read in the reference
• Where the best location is that which minimizes the number of mismatches • We ignore insertions and deletions for the moment; those will come
later
• Provided the number of mismatches is at most, say 5% of m
www.strandls.com
Indexing the Reference
3
• What if we do not allow any mismatches at all?
• Pre-process the reference sequence so…
• Each query – find the best matching location of a read – can be identified in time proportional to m and independent of n
• The resulting data structure is called an index
• Suffix trees are one possible index• A trie of all suffixes of the reference sequence, with a $ marker at the
end
www.strandls.com
Suffix Trees
4
C G A C G
The Reference
C
C
G
T
T
A C
A G
A C
T
C G CQuery
www.strandls.com
Space Required by Suffix Trees
5
• n-1 internal nodes plus n leaves, so 2n-1 nodes
• 2n-2 tree pointers + n pointers into the reference
• So ~3n pointers
• 36GB!
• Can we make this smaller?
www.strandls.com
Indexing the Reference with Mismatches
6
• What if we allow mismatches?
• So we put the query through the suffix tree but get struck – can’t proceed further
• Next, resume by dropping the first character, but without redoing the work already done
• How?
www.strandls.com
Suffix Links in Suffix Trees
7
C G A C G
The Reference
C
C
G
T
T
A C
A G
A C
T
G C GQuery
www.strandls.com
Indexing with Mismatches (Contd)
8
• For an internal node A with string x leading down from the root to that node and branching into xa and xb
• Let x=cy
• Then there exists a node B with string y leading down from the root to that node
• The suffix link from A leads to this node B
• Such a node exists
• So if you get stuck, you follow the suffix link in constant time and continue from where you left off, to find the longest perfect-match substring starting at each position in the read
• Or alternatively, find all substrings of a certain minimum length that match
• Check explicitly for the number of mismatches at each of these locations
www.strandls.com
Space Required by Suffix Trees & Links
9
• n-1 internal nodes plus n leaves plus n-1 suffix links, so 3n-1 nodes
• 3n-3 tree pointers + n pointers into the reference
• So ~4n pointers
• 48GB!
• Can we make this smaller? Can we fit this tree into an array?
www.strandls.com
A Succinct Data Structure
10
C G A C $
A C $ C GC G A C $C $ C G AG A C $ C$ C G A C
The Reference
All circular shifts, sorted
lexicographically
Burrows-Wheeler
Transform
• Store only the first and last columns and the links back to the reference
• Used in bzip
www.strandls.com
A Succinct Data Structure
11
C G A C $
A C $ C GC G A C $C $ C G AG A C $ C$ C G A C
20314
$
A
G
C $G
The Reference
• The reference can be reconstructed from the first and last columns
• Claim: The ith G in the first column corresponds to the ith G in the last column! Likewise for A,C,G,T.
www.strandls.com
Proof of Claim
12
yG<xG if and only if Gy<Gx; That’s it!
So given a G in the first column, say corresponding to the string Gx– It’s rank r is trivial to find because the first column is sorted, just store
counts for all 4 characters– We need to locate the corresponding G in the last column – In other words, the index of the string xG in the table– Which is the rth G in the last column [The Select Query]
So given a G in the last column, say corresponding to the string xG– Find it’s rank r among G’s in the last column [The Rank Query]– We need to locate the corresponding G in the first column – In other words, the index of the string Gx in the table– Which is the rth G in the first column, trivial to find
www.strandls.com
Select and Rank Queries
13
Given a binary array– SELECT: Given index i, find the ith 1– RANK: Given index i, find how many 1s precede this location
Use a separate array for each of the 4 characters
RANK is easy, just keeps counts at Δ milestones and answer queries by traversing to the nearest milestone in time Δ
– 4n/Δ bytes of storage, O(Δ) time
SELECT needs a bit more, keep counts for Δ-rank milestones – Go to the nearest rank milestone and traverse from there– May need to traverse quite a bit though– So need an extra data structure to get to the next 1, which you store at Δ milestones – So 8n/Δ bits storage, O(Δ) time
Of course we need the 4 n-bit binary arrays as well
So 4n bits + 48n/Δ bytes and O(Δ) time
www.strandls.com
String Matching using Rank-Selects
14
Given a string Gx
Assume inductively we have the band B of indices in the table corresponding to suffixes that begin with x
We want the band B’ that begins with Gx
Take the band B, take the last column, identify the rank of the first and last G in the last column, find their corresponding first column indices; that’s the band
– All doable using RANK alone
At the end you have the band containing all suffixes which begin with Gx
Unless of course, there are none, in which case the band will vanish at some point
We can use this to find matches for say all length 16 substrings of a read
So 4n+48n/Δ bytes and O(mΔ) time per read
www.strandls.com
Indentifying Indices in the Reference
15
We still have to go from a band in the table to indices in the reference
4n bits if we store explicitly
We can use the same trick, store explicitly at Δ milestones
Then, if we have index i with string Gx, then we can go to index i+1 with string xG and so on till we get to a milestone
4n/Δ bytes storage
Time per index is O(Δ)
www.strandls.com
Sorting Circular Shifts
16
It remains to describe the construction of the table in the first place
Given a string S=x0 x1 x2 ….$
– Consider string S’=(x0 x1 x2) (x1 x2 x3) (x3 x4 x5) (x4 x5 x6)….
– Note (x2 x3 x4) and other triplets starting at 2 mod 3 are missing– Rename S’ so identical tuples get the same number and distinct tuples get
different numbers– Recursively sort S’
• How does x0 x1 x2 … compare to x1 x2 x3 … ? – Already available from recursion
• How does x0 x1 x2 … compare to x2 x3 x4 … ?
– Compare x0 , x2 and then x1 x2 … , x3 x4 … – We have info for comparing all pairs of suffixes!
– Sort the 2 mod 3 suffixes and then merge them in– Time T(n)= 2T(n/3)+O(n)
www.strandls.com
A Generalization: Difference Covers
17
v 2v 3v
This string has size |D|n/v
Set D of indices mod v
Time taken to create this string
is O(n |D|)
Sorting suffixes of this string gives the sorted order
of all suffixes which begin at
indices j such that j mod v is in D
www.strandls.com
A Generalization: Difference Covers
For any 2 indices i and j i-j mod v is the distance between some two beads in D
x<v
D is a Difference Cover if distances between beads in D generate 0,1…,v-1
x<v
www.strandls.com
A Generalization: Difference Covers
There exists a Difference Cover of size 1.5*sqrt(v)!
sqrt(v)
sqrt(v)
www.strandls.com
Thank you
20