Upload
penda
View
32
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Repeats!. Introduction. A repeat family is a collection of repeats which appear multiple times in a genome. Our objective is to identify all families of interspersed repeats within a single genome. Challenges when identifying repeat families. Challenges: - PowerPoint PPT Presentation
Citation preview
1
Repeats!
2
Introduction
A repeat family is a collection of repeats which appear multiple times in a genome.
Our objective is to identify all families of interspersed repeats within a single genome
3
Challenges when identifying repeat families
Challenges: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori Many repeat occurrences appear as partial copies
......
4
Why are repeats important
Repeats have been implicated in:
Genome rearrangements (Kazazian, 2004; Achaz et al 2003)
Accelerated loss of gene order (Rocha et al, 2003)
Creation of novel biological functions (Lynch et al, 2002)
Increased rate of evolution under stress (Capy et al, 2000)
5
Identifying repeats de novo
Assume we get a new genome and we know nothing about it, we can:
Use a database of known repeats (RepeatMasker/RepBase) novel repeat elements may not be in the database repetitive gene families are never in the database
Identify repeats de novo using sequence analysis
6
Existing methods for detection of repeat families
Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities:
REPuter (Kurtz et al., 2000) RepeatFinder (Volfovsky et al., 2001) RECON (Bao and Eddy, 2002) RepeatGluer (Pevzner et al., 2004) PILER (Edgar and Myers, 2005) RepeatScout (Price et al, 2005)
Mutational forces at play Over time, indels & substitutions will
affect copies of repeat families:
AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCDTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT
Require alignments (& gaps) to attempt to reconstruct true repeat boundaries
7
8
de novo repeat detection
One approach: self-search with a pairwise local-alignment tool such as BLAST
Number of pairwise alignments grows O(r2) in the copy number of the repeat
Inherent difficulty defining repeat boundaries among collections of pairwise alignments
9
Alternative methods? Local multiple alignment
A single local multiple alignment uses O(N) space for a genome of length N
1. AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC2. AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC3. AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT-4. AACAAGCAGACACTTTTATCCATGGTCGTGGTAC---------5. AACAAGCA----CTTTTATCCATAGTCGTGGTA----------6. ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC
An example local multiple alignment:
10
Local multiple alignment Local multiple alignment has the inherent
potential to avoid pitfalls associated with pairwise alignment.
But multiple alignment under the SP objective function remains intractable…
Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE).
So why not directly construct a multiple alignment?
11
12
Steps 1-3: Chaining seeds from the Input Sequence
The method incorporated three novel ideas:
(1) palindromic spaced seed patterns to match both DNA strands simultaneously
(2) seed extension (chaining) in order of decreasing multiplicity, and
(3) procrastination when low multiplicity matches are encountered.
13
Step 4: Gapped Extension
After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries
This is an essential step to consider, assuming we would like to improve repeat boundary predictions
But how can this be done efficiently?
14
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
Our approach to gapped extension
15
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Dynamically calculate extension window = 70*e -0.01*|Mi| |Mi| = 200 , l = 10
16
TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Use MUSCLE to perform alignment of extension window
17
ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Use HMM to detect & unalign unrelated sequence
18
ACAAGGGCCC--TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
TACGAGCCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TTCATCCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC--TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCGGCCC--TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
AACCCGCCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
ATTTTGCCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGCCCC--TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCGCCCC--TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC--CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Extension successful, continue extending
19
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
20
ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Use HMM to detect & unalign unrelated sequence
21
ACAAGACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
ACGAGACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TCATCTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACAGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCAGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
ACCCGACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
TTTTGTTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
Finished leftward extension, now to the right…
22
ACAAGACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA
ACGAGACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA
TCATCTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACAGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG
AGGCCAGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA
ACCCGACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC
TTTTGTTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG
ATTCCATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension
23
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
HMM approach to gapped extension
Perform MUSCLE alignment on window
24
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
HMM approach to gapped extension
Use HMM to detect & unalign unrelated sequence
25
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GA---GCAGCCA
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTAATTTGACA
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGACTAGGATGG
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAATTAAAAAAATTA
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTAATTTGCTCTAT
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCGGCCCTTATAGG
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAAAAGAGCGCCCG
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCTTTCCCCCCGGC
HMM approach to gapped extension
Extension successful, continue extending
26
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGAGCAGCCACCA
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC
HMM approach to gapped extension
Use MUSCLE to perform alignment of extension window
27
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC-
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA----
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT-
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT----
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT-
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC-
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG
HMM approach to gapped extension
Use HMM to detect & unalign unrelated sequence
28
TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC----GAGCAGCCAC-
TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA----TTTAATTTGA----
TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC----AAGAGCCCCC AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT----AGACTAGGAT-
CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT----TTAAAAAAAT----
AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT----AATTTGCTCT-
TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA----GGCCCTTATA
GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC----AAAGAGCGCC-
CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA----GACCGAATTA...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG----TTTCCCCCCG
HMM approach to gapped extension
Extension failed, stop extending
29
Wait a moment..
The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry.
As a result, it is likely that this method forcibly aligns unrelated sequence.
HMMs to detect alignments of unrelated sequence.
30
Step 5: detecting unrelated sequence
The HMM consists of two hidden states, Homologous and Unrelated.
The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry
i.e. AG=GA=TC=CT.
The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.
31
U HUUUU
0.5
Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry:
UAA = UAT = UTA = UTT = (fAT)/2 * (fAT)/2 UCC = UCG = UGC = UGG = (fGC)/2 * (fGC)/2 UAC = UAG = UTC = UAG = (fAT)/2 * (fGC)/2 UCA = UCT = UGA = UTT = (fGT)/2 * (fAT)/2
32
UU HUUUUUU
0.5
To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.
33
UU HUUUUUUUUUUUU
Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences.
0.5
34
UU HUUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms.
0.5
35
UU HUUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
0.5