View
214
Download
1
Tags:
Embed Size (px)
Citation preview
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA
exon exon exonintronintronintergene intergene
Find Gene Structures in DNA
Intergene State
First Exon State
IntronState
Hidden Markov Model for Gene Finding
• Intron, Exon, Intergenic states
• Exon frame is encoded in the architecture by defining more states
• Exon states have explicit duration density
• Intron states have geometric duration
• Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)
Comparison-based Methods
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
[human]
[mouse]
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
Comparison of 1196 orthologous genes(Makalowski et al., 1996)
• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%
• 27 proteins were 100% identical.
Human Mouse
Human-mouse homology
Not always: HoxA human-mouse
Twinscan
• Twinscan is an augmented version of the Gencscan HMM.
E I
transitions
duration
emissionsACUAUACAGACAUAUAUCAU
Twinscan Algorithm
1. Align the two sequences (eg. from human and mouse)
2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )
New “alphabet”: 4 x 3 = 12 letters
= { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }
Twinscan Algorithm
3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }
Note:
Emission distributions ek(b) estimated from real genes from human/mouse
eI(x|) < eE(x|): matches favored in exons
eI(x-) > eE(x-): gaps (and mismatches) favored in introns
Example
Human: ACGGCGACGUGCACGU
Mouse: ACUGUGACGUGCACUU
Alignment: ||:|:|||||||||:|
Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|
Recall, eE(A|) > eI(A|)
eE(A-) < eI(A-)
Likely exon
HMMs for simultaneous alignment and gene finding:
Generalized Pair HMMs
A Pair HMM for alignments
MP(xi, yj)
IP(xi)
JP(yj)
1 - 2
1- - 2
1- - 2
BEGIN
END
M JI
Generalized Pair HMMs
Exon GPHMM
d
e
1.Choose exon lengths (d,e).2.Generate alignment of length d+e.
Cross-species gene finding
5’ 3’
Exon1 Exon2 Exon3Intron1 Intron2
CNS CNS CNS
[human]
[mouse]
The SLAM hidden Markov model
Model Time Space
HMM N2T NTPHMM N2TU NTUGHMM D2N 2T NTGPHMM D4N 2TU NTU
N no. states
Dmax durationT length
seq1U length seq2
Computational complexity
Approximate alignment
Reduces
TU -factor
to
hT
Measuring Performance
Example: HoxA2 and HoxA3
SLAM
SGP-2
TwinscanGenscan
TBLASTXSLAM CNS
VISTARefSeq
Suffix Trees
(a short break from biology)
Suffix Trees
• Suffix trees are a method to find all maximal matches between two strings (and much more)
Example: x = dabdac d a b d a c
ca
bd
acc
cca
db
1
4
25
63
Definition of a Suffix Tree
Definition:
For string x = x1…xm, a suffix tree is:
A rooted tree with m leaves
Leaf i: xi…xm
Each edge is a substring
No two edges out of a node, start with same letter
It follows, every substring corresponds to
an initial part of a path from root to a leaf
Naïve Algorithm to Construct a Suffix Tree
1. Initialize tree T: a single root node r
2. Insert special symbol $ at end of x
3. For j = 1 to m
• Find longest match of xi…xm to T, starting from r
• Split edge where match stops: new node w
• Create edge (w, j), and label with unmatched portion of xi…xm
Example of Suffix Tree Construction
1
x = d a b d a $
d a b d a $
1. Insert d a b d a $
a
bd
a$
2
2. Insert a b d a $
$a
db
3
3. Insert b d a $
$
4
4. Insert d a $
$
5
5. Insert a $
$
6
6. Insert $
Memory to Store Suffix Tree
• Can store in O( N ) memory!
• Every edge is labeled with (i, j):
(i,j) denotes xi…xj
• Tree has O( N ) nodes
Proof:1. # leafs # nodes – 1
2. # leafs = |x|
Faster Construction
Several algorithms
O( N ) time,
O( N ) memory with a big constant ~15 bytes/char
Technical but not deep, outside the scope of this course
Optional: Gusfield, chapter 6
Application: find all matches between x, y
1. Build suffix tree for x, mark nodes with x
2. Insert y in suffix tree, mark all nodes y “passes from” with y
The path label of every node marked both 0 and 1, is a common substring
1
x = d a b d a $y = a b a d a $
d a b d a $1. Construct tree for x
a
bd
a$2
$a
db
3
$
4
$
5
$6
xx
x
6. Insert a $
5
6
6. Insert $
4. Insert a d a $
da$
3
5. Insert d a $
y
4
2. Insert a b a d a $
a
y
da
$
1
y
yx
3. Insert b a d a $ ady
2
a$
x
Example of Suffix Tree construction
Application: common substrings of k strings
To find the longest common substring of s1, s2, …sn
1. Build suffix tree for s1,…, sn
2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik
Suffix Arrays
ABRACADABRA$
11 $10 A$ 7 ABRA$ 0
ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$
• Fast O(log n) search for every specific string
• Used for data compression such as bzip2
• Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal Too much memory— ~15n bytes Difficult to implement
• Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory
• Hot topic how to build fast in practice