View
222
Download
1
Category
Preview:
Citation preview
CS262 Lecture 15, Win07, Batzoglou
Multiple Sequence Multiple Sequence AlignmentsAlignments
CS262 Lecture 15, Win07, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
CS262 Lecture 15, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
CS262 Lecture 15, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
CS262 Lecture 15, Win07, Batzoglou
Sparse Dynamic Programming
Back to the LCS problem:
• Given two sequences x = x1, …, xm
y = y1, …, yn
• Find the longest common subsequence Quadratic solution with DP
• How about when “hits” xi = yj are sparse?
CS262 Lecture 15, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
CS262 Lecture 15, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
CS262 Lecture 15, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
Let input be w: w1,…, wn
INITIALIZATION:L: last LIS elt. array L[0] = -inf
L[1] = w1 L[2…n] = +inf
B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]
B[j] iP[i] B[j – 1]
}
That’s it!!!• Running time?
CS262 Lecture 15, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 15, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Why don’t we insert elements (i, j) in w in
increasing row i order?
CS262 Lecture 15, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L = [L1] [L2] [L3] [L4] [L5] …
1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)
Longest common subsequence:s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 15, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (North) to largest (South) value
L is implemented as a balanced binary tree
y
h
l
CS262 Lecture 15, Win07, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
CS262 Lecture 15, Win07, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Is k ever removed?
CS262 Lecture 15, Win07, Batzoglou
Example
x
y
a: 5
c: 3
b: 6
d: 4e: 2
2
56
9101112141516
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
a b c d eV
5
L
li
V(i)
i
5
5
a
8
11
8
c
11 12
9
11
b
15
12
d
13
16
13
3
CS262 Lecture 15, Win07, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so logN per deletion• Each element is deleted at most once: < N logN for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
CS262 Lecture 15, Win07, Batzoglou
Whole-genome Alignment Pipelines
Given N species, phylogenetic tree:
1. Local Alignment between all pairs – BLAST
2. In the order of the tree:1. Synteny mapping: find long regions with lots of collinear alignments
2. In each synteny region,1. Chaining
2. Global alignment
Alternatively, all species are mapped to one reference (e.g., human)
Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment
CS262 Lecture 15, Win07, Batzoglou
Examples
Human Genome BrowserABC
CS262 Lecture 15, Win07, Batzoglou
Whole-genome alignment Rat—Mouse—Human
CS262 Lecture 15, Win07, Batzoglou
Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned
CS262 Lecture 15, Win07, Batzoglou
Gene Recognition
Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov
CS262 Lecture 15, Win07, Batzoglou
The Central Dogma
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
CS262 Lecture 15, Win07, Batzoglou
Gene structure
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
exon = protein-codingintron = non-coding
Codon:A triplet of nucleotides that is converted to one amino acid
CS262 Lecture 15, Win07, Batzoglou
Finding Genes in Yeast
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Intergenic Coding Intergenic
Mean coding length about 1500bp (500 codons)
Transcript
CS262 Lecture 15, Win07, Batzoglou
Finding Genes in Yeast
Yeast ORF distribution
CS262 Lecture 15, Win07, Batzoglou
Introns: The Bane of ORF Scanning
Start codonATG
5’ 3’
Stop codonTAG/TGA/TAA
Splice sites
Intergenic Exon Intron IntergenicExon ExonIntron
Transcript
CS262 Lecture 15, Win07, Batzoglou
Introns: The Bane of ORF Scanning
• Drosophila:
• 3.4 introns per gene on average
• mean intron length 475, mean exon length 397
• Human:
• 8.8 introns per gene on average
• mean intron length 4400, mean exon length 165
• ORF scanning is defeated
CS262 Lecture 15, Win07, Batzoglou
Where are the genes?Where are the genes?
CS262 Lecture 15, Win07, Batzoglou
CS262 Lecture 15, Win07, Batzoglou
Needles in a Haystack
CS262 Lecture 15, Win07, Batzoglou
Signals for Gene Finding
• We need to use more information to help recognize genes
1. Regular gene structure
2. Exon/intron lengths
3. Nucleotide composition
4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites
5. Patterns of conservation
CS262 Lecture 15, Win07, Batzoglou
Regular Gene Structure
• Start, Stop of translation region: Protein-coding starts with ATG ends with TAA / TAG / TGA
• Exon – Intron – Exon – Intron … – Exon
• g[ GT/GC ]gag – Intron – cAGt
• Exon reading frame: NNN – NNN – NNN – NNN – NN… NN – NNN – NNN – NNN – NN… N – NNN – NNN – NNN – NNN…
CS262 Lecture 15, Win07, Batzoglou
Next Exon:Frame 0
Next Exon:Frame 1
CS262 Lecture 15, Win07, Batzoglou
Exon/Intron Lengths
CS262 Lecture 15, Win07, Batzoglou
Nucleotide Composition
• Base composition in exons is characteristic due to the genetic code
Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG
CS262 Lecture 15, Win07, Batzoglou
Biological Signals
• How does the cell recognize start/stop codons and splice sites? In part, from characteristic base composition
• Donor site (start of intron) is recognized by a section of U1 snRNA
U1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGT
M means “A or C”, R means “A or G”
CS262 Lecture 15, Win07, Batzoglou
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
CS262 Lecture 15, Win07, Batzoglou
5’ 3’Donor site
Position
-8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25
Splice Sites
CS262 Lecture 15, Win07, Batzoglou
Splice Sites
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
CS262 Lecture 15, Win07, Batzoglou
• WMM: weight matrix model = PSSM (Staden 1984)• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)
• MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account
Starting with a training set of known splice sites:
• For each position I, calculate Si = ji2(Ci, Xj)
• Choose i* such that Si* is maximal and partition into two subsets, until• No significant dependencies left, or• Not enough sequences in subset
Train separate WMM models for each subset
All donor splice sites
G5
not G5
G5G-1
G5
not G-1
G5G-1
A2
G5G-1
not A2
G5G-1
A2U6
G5G-1A2
not U6
Splice Sites
Recommended