35
1 Repeats!

Repeats!

  • Upload
    penda

  • View
    32

  • Download
    3

Embed Size (px)

DESCRIPTION

Repeats!. Introduction. A repeat family is a collection of repeats which appear multiple times in a genome. Our objective is to identify all families of interspersed repeats within a single genome. Challenges when identifying repeat families. Challenges: - PowerPoint PPT Presentation

Citation preview

Page 1: Repeats!

1

Repeats!

Page 2: Repeats!

2

Introduction

A repeat family is a collection of repeats which appear multiple times in a genome.

Our objective is to identify all families of interspersed repeats within a single genome

Page 3: Repeats!

3

Challenges when identifying repeat families

Challenges: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori Many repeat occurrences appear as partial copies

......

Page 4: Repeats!

4

Why are repeats important

Repeats have been implicated in:

Genome rearrangements (Kazazian, 2004; Achaz et al 2003)

Accelerated loss of gene order (Rocha et al, 2003)

Creation of novel biological functions (Lynch et al, 2002)

Increased rate of evolution under stress (Capy et al, 2000)

Page 5: Repeats!

5

Identifying repeats de novo

Assume we get a new genome and we know nothing about it, we can:

Use a database of known repeats (RepeatMasker/RepBase) novel repeat elements may not be in the database repetitive gene families are never in the database

Identify repeats de novo using sequence analysis

Page 6: Repeats!

6

Existing methods for detection of repeat families

Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities:

REPuter (Kurtz et al., 2000) RepeatFinder (Volfovsky et al., 2001) RECON (Bao and Eddy, 2002) RepeatGluer (Pevzner et al., 2004) PILER (Edgar and Myers, 2005) RepeatScout (Price et al, 2005)

Page 7: Repeats!

Mutational forces at play Over time, indels & substitutions will

affect copies of repeat families:

AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCDTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT

Require alignments (& gaps) to attempt to reconstruct true repeat boundaries

7

Page 8: Repeats!

8

de novo repeat detection

One approach: self-search with a pairwise local-alignment tool such as BLAST

Number of pairwise alignments grows O(r2) in the copy number of the repeat

Inherent difficulty defining repeat boundaries among collections of pairwise alignments

Page 9: Repeats!

9

Alternative methods? Local multiple alignment

A single local multiple alignment uses O(N) space for a genome of length N

1. AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC2. AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC3. AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT-4. AACAAGCAGACACTTTTATCCATGGTCGTGGTAC---------5. AACAAGCA----CTTTTATCCATAGTCGTGGTA----------6. ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC

An example local multiple alignment:

Page 10: Repeats!

10

Local multiple alignment Local multiple alignment has the inherent

potential to avoid pitfalls associated with pairwise alignment.

But multiple alignment under the SP objective function remains intractable…

Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE).

So why not directly construct a multiple alignment?

Page 11: Repeats!

11

Page 12: Repeats!

12

Steps 1-3: Chaining seeds from the Input Sequence

The method incorporated three novel ideas:

(1) palindromic spaced seed patterns to match both DNA strands simultaneously

(2) seed extension (chaining) in order of decreasing multiplicity, and

(3) procrastination when low multiplicity matches are encountered.

Page 13: Repeats!

13

Step 4: Gapped Extension

After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries

This is an essential step to consider, assuming we would like to improve repeat boundary predictions

But how can this be done efficiently?

Page 14: Repeats!

14

TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

Our approach to gapped extension

Page 15: Repeats!

15

TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Dynamically calculate extension window = 70*e -0.01*|Mi| |Mi| = 200 , l = 10

Page 16: Repeats!

16

TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Use MUSCLE to perform alignment of extension window

Page 17: Repeats!

17

ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Use HMM to detect & unalign unrelated sequence

Page 18: Repeats!

18

ACAAGGGCCC--TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

TACGAGCCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TTCATCCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC--TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCGGCCC--TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

AACCCGCCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

ATTTTGCCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGCCCC--TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCGCCCC--TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC--CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Extension successful, continue extending

Page 19: Repeats!

19

ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Page 20: Repeats!

20

ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Use HMM to detect & unalign unrelated sequence

Page 21: Repeats!

21

ACAAGACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

ACGAGACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TCATCTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACAGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCAGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

ACCCGACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

TTTTGTTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Finished leftward extension, now to the right…

Page 22: Repeats!

22

ACAAGACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA

ACGAGACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA

TCATCTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACAGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG

AGGCCAGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA

ACCCGACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC

TTTTGTTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG

ATTCCATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA...ATTCGATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

HMM approach to gapped extension

Page 23: Repeats!

23

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC

HMM approach to gapped extension

Perform MUSCLE alignment on window

Page 24: Repeats!

24

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC

HMM approach to gapped extension

Use HMM to detect & unalign unrelated sequence

Page 25: Repeats!

25

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GA---GCAGCCA

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTAATTTGACA

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGACTAGGATGG

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAATTAAAAAAATTA

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTAATTTGCTCTAT

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCGGCCCTTATAGG

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAAAAGAGCGCCCG

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCTTTCCCCCCGGC

HMM approach to gapped extension

Extension successful, continue extending

Page 26: Repeats!

26

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGAGCAGCCACCA

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC

HMM approach to gapped extension

Use MUSCLE to perform alignment of extension window

Page 27: Repeats!

27

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC-

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA----

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT-

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT----

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT-

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC-

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG

HMM approach to gapped extension

Use HMM to detect & unalign unrelated sequence

Page 28: Repeats!

28

TAGTTTAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC----GAGCAGCCAC-

TACGATACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA----TTTAATTTGA----

TTCATTTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCC----AAGAGCCCCC AGAAAAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT----AGACTAGGAT-

CCGATCCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT----TTAAAAAAAT----

AACCCAACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT----AATTTGCTCT-

TTTTTTTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA----GGCCCTTATA

GGAAAGGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC----AAAGAGCGCC-

CCTATCCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA----GACCGAATTA...-TTCG-TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG----TTTCCCCCCG

HMM approach to gapped extension

Extension failed, stop extending

Page 29: Repeats!

29

Wait a moment..

The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry.

As a result, it is likely that this method forcibly aligns unrelated sequence.

HMMs to detect alignments of unrelated sequence.

Page 30: Repeats!

30

Step 5: detecting unrelated sequence

The HMM consists of two hidden states, Homologous and Unrelated.

The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry

i.e. AG=GA=TC=CT.

The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.

Page 31: Repeats!

31

U HUUUU

0.5

Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry:

UAA = UAT = UTA = UTT = (fAT)/2 * (fAT)/2 UCC = UCG = UGC = UGG = (fGC)/2 * (fGC)/2 UAC = UAG = UTC = UAG = (fAT)/2 * (fGC)/2 UCA = UCT = UGA = UTT = (fGT)/2 * (fAT)/2

Page 32: Repeats!

32

UU HUUUUUU

0.5

To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.

Page 33: Repeats!

33

UU HUUUUUUUUUUUU

Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences.

0.5

Page 34: Repeats!

34

UU HUUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms.

0.5

Page 35: Repeats!

35

UU HUUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

0.5