28
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter

Picking Alignments from (Steiner) Trees

  • Upload
    rossa

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Picking Alignments from (Steiner) Trees. Lior Pachter. Fumei Lam. Marina Alexandersson. X. M. Y. Alignment. ATCG--G A-CGTCA. biologically meaningful. Steiner Networks. Pair Hidden Markov Models. fast alignments based on HMM structure. Some basic definitions: - PowerPoint PPT Presentation

Citation preview

Page 1: Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees

Fumei Lam

Marina Alexandersson

Lior Pachter

Page 2: Picking Alignments from (Steiner) Trees

Alignment

Pair Hidden Markov Models

Steiner Networks

ATCG--GA-CGTCA

M

X

Y

biologically meaningful

fast alignmentsbased on HMM structure

Page 3: Picking Alignments from (Steiner) Trees

Some basic definitions:

Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k timesthe distance between u and v in G.

Let V(G)=R2 and E(G)=horizontal and vertical line segments.A Manhattan network is a 1-spanner for a set S of pointsin R2. Vertices in the Manhattan network that are notin S are called Steiner points

Page 4: Picking Alignments from (Steiner) Trees

Example:

S: red points

Manhattan network

Steiner point

Page 5: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points

4-approximation in O(n3) and 8-approximation in O(nlogn)

Page 6: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

Page 7: Picking Alignments from (Steiner) Trees

A(v) = {u:v is the topmost node below and to the left of u}

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations)

v

slide

Page 8: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide

The minimum slide arborescense problem:

Lingas-Pinter-Rivest-Shamir 1982

O(n3) optimal solution using dynamic programming

Page 9: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness

u

v

b

a

Page 10: Picking Alignments from (Steiner) Trees

What is an alignment?

ATCG--GACATTACC-ACAC-GTCA-GATTA-CAAC

Page 11: Picking Alignments from (Steiner) Trees

M

X

Y

M = (mis)matchX = insert seq1Y = insert seq2

Pair HMMsSimple sequence-alignment PHMM

Page 12: Picking Alignments from (Steiner) Trees

M X YM M Y M

Hidden sequence:

AA

TC

C-

GG

-T

-C

GA

Observed sequence:

ATCGGACGTCA

Hidden alignment:

ATCG--GAC-GTCA

Pair HMMstransitionprobabilities

outputprobabilities

Page 13: Picking Alignments from (Steiner) Trees

Using the Pair HMMIn practice, we have observed sequence

ATCGGACGTCA

for which we wish to infer the underlying hidden states

One solution: among all possible sequences of hiddenstates, determine the most likely (Viterbi algorithm).

ATCG--GAC-GTCA

MMXMYYM

Page 14: Picking Alignments from (Steiner) Trees

Viterbi in PHMM ≡ Needleman Wunsch

M

X

Y1-3

1-3

1-3

1-3

1-3

1-3

Match prob: pm

Mismatch prob: pr

Match score: log(pm)Mismatch score: log(pr)Gap score: log(pg)

Gap prob: pg

Page 15: Picking Alignments from (Steiner) Trees

Want to take into account that the sequencesare genomic sequences:

Example: a pair of syntenic genomic regions

Page 16: Picking Alignments from (Steiner) Trees

YX

PHMM

M

X

Y

Page 17: Picking Alignments from (Steiner) Trees

YX

PHMM

• A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight

Page 18: Picking Alignments from (Steiner) Trees

Strategy for Alignment

GATTACATTGATCAGACAGGTGAAGA

GATCTTCATGTAG

Page 19: Picking Alignments from (Steiner) Trees

The CD4 region

human

mouse

50000

50000

0

0

Page 20: Picking Alignments from (Steiner) Trees

5’ 3’

Exon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

Branchpoint

CTGAC

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

Page 21: Picking Alignments from (Steiner) Trees

Suggests a new Steiner problemFind the shortest 1-spanner connecting reds to blues

Page 22: Picking Alignments from (Steiner) Trees

Generalizes the Manhattan network problem (all points red and blue)

Generalizes the Rectilinear Steiner Arborescence problem

Page 23: Picking Alignments from (Steiner) Trees

1985, Trubin - polynomial time algorithm

History of the Rectilinear Steiner Arborescence Problem

1992, Rao-Sadayappan-Hwang-Shor - error in Trubin

2000, Shi and Su - NP complete!

Page 24: Picking Alignments from (Steiner) Trees

Results for unlabeled problem

•An O(n3) 2-approximation algorithm (implemented)

• An O(nlogn) 4-approximation algorithm

• Testing on CD4 region in human/mouse

•Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/

• SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/

Page 25: Picking Alignments from (Steiner) Trees

TAAT GTATTGAGGTATTGAG TGAA

CTG GTTGGTCCTCAG GTG TGTC

ATGTCCACGG

GA GT TACA TC

TTGTACACGGCA G

T GT ACGCT GG

ATGTAAC

ACATGTA

X

CNS

Y

M

D

I

Page 26: Picking Alignments from (Steiner) Trees

The Viterbi graph for a more complicated alignment

PHMM

Page 27: Picking Alignments from (Steiner) Trees

Comparison and Analysis of Performance

Our method has two main steps: (L=length of seqs, n=#HSP)

1. Building the network O(n3) or O(nlogn)

2. Running the Viterbi algorithm O(nL) worst case

for the HMM on the network• Banding algorithms are O(L2) worst case for step 2.

• Chaining algorithms are O(n2) in the case where gap penalties can depend on the sequences.• These strategies do not generalize well for more sophisticated HMMs.

Page 28: Picking Alignments from (Steiner) Trees

Summary

Thanks: Nick Bray and Simon Cawley

SLIM (network build): http://bio.math.berkeley.edu/slim/SLAM (alignment): http://bio.math.berkeley.ed/slam/

ATCG--GA-CGTCA

M

X

Y

Software: