40
262 Lecture 15, Win07, Batzoglou Multiple Sequence Multiple Sequence Alignments Alignments

CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 2: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Saving cells in DP

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 3: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 4: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 5: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse Dynamic Programming

Back to the LCS problem:

• Given two sequences x = x1, …, xm

y = y1, …, yn

• Find the longest common subsequence Quadratic solution with DP

• How about when “hits” xi = yj are sparse?

Page 6: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse Dynamic Programming

15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Page 7: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

x = x1, …, xm

• Find a subsequence

s = s1, …, sk

s1 < s2 < … < sk

Page 8: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:L: last LIS elt. array L[0] = -inf

L[1] = w1 L[2…n] = +inf

B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far

ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]

B[j] iP[i] B[j – 1]

}

That’s it!!!• Running time?

Page 9: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:

• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order

• The 11 example points are inserted in the order given

• a = (y, x), b = (y’, x’) can be chained iff

a is before b in w, and y < y’

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 10: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Why don’t we insert elements (i, j) in w in

increasing row i order?

Page 11: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L = [L1] [L2] [L3] [L4] [L5] …

1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)

Longest common subsequence:s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 12: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj: smallest (North) to largest (South) value

L is implemented as a balanced binary tree

y

h

l

Page 13: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score

V(b)V(a)

Page 14: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

i

j

k

Is k ever removed?

Page 15: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Example

x

y

a: 5

c: 3

b: 6

d: 4e: 2

2

56

9101112141516

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

a b c d eV

5

L

li

V(i)

i

5

5

a

8

11

8

c

11 12

9

11

b

15

12

d

13

16

13

3

Page 16: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so logN per deletion• Each element is deleted at most once: < N logN for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 17: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Whole-genome Alignment Pipelines

Given N species, phylogenetic tree:

1. Local Alignment between all pairs – BLAST

2. In the order of the tree:1. Synteny mapping: find long regions with lots of collinear alignments

2. In each synteny region,1. Chaining

2. Global alignment

Alternatively, all species are mapped to one reference (e.g., human)

Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment

Page 18: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Examples

Human Genome BrowserABC

Page 19: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Whole-genome alignment Rat—Mouse—Human

Page 20: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

Page 21: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Gene Recognition

Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov

Page 22: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

The Central Dogma

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 23: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

Page 24: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Finding Genes in Yeast

Start codonATG

5’ 3’

Stop codonTAG/TGA/TAA

Intergenic Coding Intergenic

Mean coding length about 1500bp (500 codons)

Transcript

Page 25: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Finding Genes in Yeast

Yeast ORF distribution

Page 26: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Introns: The Bane of ORF Scanning

Start codonATG

5’ 3’

Stop codonTAG/TGA/TAA

Splice sites

Intergenic Exon Intron IntergenicExon ExonIntron

Transcript

Page 27: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Introns: The Bane of ORF Scanning

• Drosophila:

• 3.4 introns per gene on average

• mean intron length 475, mean exon length 397

• Human:

• 8.8 introns per gene on average

• mean intron length 4400, mean exon length 165

• ORF scanning is defeated

Page 28: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Where are the genes?Where are the genes?

Page 29: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Page 30: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Needles in a Haystack

Page 31: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Signals for Gene Finding

• We need to use more information to help recognize genes

1. Regular gene structure

2. Exon/intron lengths

3. Nucleotide composition

4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites

5. Patterns of conservation

Page 32: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Regular Gene Structure

• Start, Stop of translation region: Protein-coding starts with ATG ends with TAA / TAG / TGA

• Exon – Intron – Exon – Intron … – Exon

• g[ GT/GC ]gag – Intron – cAGt

• Exon reading frame: NNN – NNN – NNN – NNN – NN… NN – NNN – NNN – NNN – NN… N – NNN – NNN – NNN – NNN…

Page 33: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Next Exon:Frame 0

Next Exon:Frame 1

Page 34: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Exon/Intron Lengths

Page 35: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Nucleotide Composition

• Base composition in exons is characteristic due to the genetic code

Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG

Page 36: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Biological Signals

• How does the cell recognize start/stop codons and splice sites? In part, from characteristic base composition

• Donor site (start of intron) is recognized by a section of U1 snRNA

U1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGT

M means “A or C”, R means “A or G”

Page 37: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

Page 38: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

5’ 3’Donor site

Position

-8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25

Splice Sites

Page 39: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

Splice Sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Page 40: CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 15, Win07, Batzoglou

• WMM: weight matrix model = PSSM (Staden 1984)• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)

• MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account

Starting with a training set of known splice sites:

• For each position I, calculate Si = ji2(Ci, Xj)

• Choose i* such that Si* is maximal and partition into two subsets, until• No significant dependencies left, or• Not enough sequences in subset

Train separate WMM models for each subset

All donor splice sites

G5

not G5

G5G-1

G5

not G-1

G5G-1

A2

G5G-1

not A2

G5G-1

A2U6

G5G-1A2

not U6

Splice Sites