32
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy [email protected] Sep 10 th , 2013 BMI/CS 576

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy [email protected] Sep 10 th, 2013 BMI/CS 576

Embed Size (px)

Citation preview

Page 1: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Pairwise Sequence Alignment

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/Sushmita Roy

[email protected] 10th, 2013

BMI/CS 576

Page 2: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Key concepts in today’s class

• What is pairwise sequence alignment?• How to handle gaps?

– Gap penalties

• What are the issues in sequence alignment?• What are the different types of alignment problems

– Global alignment– Local alignment

• Dynamic programming solution to finding the best alignment• Given a scoring function, find the best alignment of two

sequences– Global– Local

Page 3: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

What is sequence alignment

The task of locating equivalent regions of two or more sequences to maximize their overall similarity

Page 4: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Why sequence alignment?

• Identifying similarity between two sequences

• Sequence specifies the function of a protein

• Similarity in sequence can imply similarity in function.– Assign function to uncharacterized sequences based on characterized

sequences

• Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.

Page 5: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Alignment example

T H I S S E Q U E N C E

T H A T S E Q U E N C E

Page 6: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

How to compare these two sequences?

T H I S S E Q U E N C E

T H A T I S A S E Q U E N C E

Page 7: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Need to incorporate gaps

_ _ _T H I S S E Q U E N C E

T H A T I S A S E Q U E N C E

T H I S _ _ _ S E Q U E N C E

T H A T I S A S E Q U E N C E

Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches

Page 8: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Issues in sequence alignment

• What type of alignment?– Align the entire sequence or part of it?

• How to score an alignment?– the sequences we’re comparing typically differ in length– some amino acid pairs are more substitutable than others

• How to find the alignment?– Algorithms for alignment

• How to tell if the alignment is biologically meaningful?– Computing significance of an alignment

Page 9: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Pairwise alignment: task definition

Given– a pair of sequences (DNA or protein)– a scoring function

Do– determine the common substrings in the sequences

such that the similarity score is maximized

Page 10: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Sequence can change through different mutations

• substitutions: ACGA AGGA

• insertions: ACGA ACGGA

• deletions: ACGA AGAGaps

Page 11: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Scoring function

• Scoring function comprises– Substitution matrix s(a,b): indicates the score of aligning a with b– Gap penalty function γ(g) where g is the gap length

• DNA has a simple scoring function– Consider match and mismatch

• Proteins have a more complex scoring function– Some amino acid pairs might be more substitutable versus others– Scores might capture

• similar physical and chemical properties• evolutionary relationships• More next class

Page 12: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Gap penalty function

• Different gap penalty functions require somewhat different scoring algorithms

• Linear score

– γ(g)=-dg, where d is a fixed cost, g is the gap length

– Longer the gap higher the penalty

• Affine score

– γ(g)=-(d +(g-1)e), d and e are fixed penalties and e<d.

– d : gap open penalty

– e : gap extension penalty

Page 13: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Types of pairwise sequence alignment problems

• Global alignment (Needleman-Wunsch algorithm)– Align the entire sequence

• Local alignment (Smith-Waterman algorithm)– Align part of the sequence

Page 14: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Global alignment

• Given two sequences u and v and a scoring function, find the alignment with the maximal score.

• The number of possible alignments is very big.

Page 15: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Number of possible alignments

CAATGA ATTGATSequence 1 Sequence 2

CAATGA__ATTGAT

CAATGAATTGAT

CAATGA_AT_TGAT

Alignment 1 Alignment 2 Alignment 3

Page 16: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Number of possible alignments

• There are

nn

n

n

n n

π

2

2

2)!()!2(2

≈=⎟⎟⎠

⎞⎜⎜⎝

possible global alignments for 2 sequences of length n

• E.g. two sequences of length 100 have possible alignments

• But we can use dynamic programming to find an optimal alignment efficiently

7710≈

Page 17: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Dynamic programming (DP)

• Divide and conquer

• Solve a problem using pre-computed solutions for subparts of the problem

Page 18: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Dynamic programming for sequence alignment

• Two sequences: AAAC, AAG• x: AAAC

– xi: Denotes the ith letter

• y: AAG– yi denotes the jth letter in y.

• Assume cost of gap is d.• Assume we have aligned x1..xi to y1..yj

• There are three possibilities

AAxi

AAyj

AAAAAG

AA_AAG

AAAAA_

Page 19: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Dynamic programming idea

• Let x’s length be n• Let y’s length be m• construct an (n+1) (m+1) matrix F• F ( j, i ) = score of the best alignment of x1…xi with y1…yj

y

x

A

A

CA G

A

C

score of best alignment ofAAA to AG

Page 20: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

DP for global alignment with linear gap penalty

F( j,i) = max

F( j −1,i −1) +s(x i,y j )

F( j −1,i) − d

F( j,i −1) − d

⎨ ⎪

⎩ ⎪

F(j,i)

F(j-1,i-1) F(j-1,i)

F(j,i-1)-d

-ds(xi,yj)

DP recurrence relation

Score of the best partial alignment between x1..xi and y1.. yj

Page 21: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

DP algorithm sketch: global alignment

• Initialize first row and column of matrix

• Fill in rest of matrix from top to bottom, left to right

• For each F(i, j), save pointer(s) to cell(s) that resulted in best score

• F(m, n) holds the optimal alignment score;

• Trace back from F(m, n) to F(0, 0) to recover alignment

Page 22: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Global alignment example

• Suppose we choose the following scoring scheme

d = 2, where d is the gap penalty

Page 23: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Initializing matrix: global alignment with linear gap penalty

A -d

A -2d

CA G

A -3d

C -4d

0 -3d-d -2d

Page 24: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Global alignment example

A -2

A -4

CA G

A -6

C -8

0 -6-2 -4

1 -3-1

0 -2

-2 -1

-5 -4 -1

-1

-3

Page 25: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Trace back to retrieve the alignment

A -2

A -4

CA G

A -6

C -8

0 -6-2 -4

1 -3-1

0 -2

-2 -1

-5 -4 -1

-1

-3

x:

y:

one optimal alignment

C

C

A

_

A

G

A

A

Page 26: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Computational complexity

• initialization: O(m), O(n) where sequence lengths are m, n• filling in rest of matrix: O(mn)• traceback: O(m + n)• Since sequences have nearly same length, the computational

complexity is

)( 2nO

Page 27: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Summary of DP

• Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1)• works for either DNA or protein sequences, although the

substitution matrices used differ• finds an optimal alignment• the exact algorithm (and computational complexity) depends

on gap penalty function

Page 28: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Local alignment

• Look for local regions sequence similarity.– Aligning substrings of x and y.

• Try to find the best alignment over all possible substrings.

• This seems difficult computationally.– But can be solved by DP.

Page 29: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Local alignment motivation

• Comparing protein sequences that share a domain (independently folded unit) but differ elsewhere

• Comparing DNA sequences that share a similar sequence pattern but differ elsewhere

• More sensitive when comparing highly diverged sequences– Selection on the genome differs between regions– Unconstrained regions are not very alignable

Page 30: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Local alignment DP

Similar to global alignment with a few exceptions• DP recurrence is different

• Top and left row are initialized to 0s.

F( j,i) = max

F( j −1,i −1) +s(x i, y j )

F( j −1,i) − d

F( j,i −1) − d

0

⎨ ⎪ ⎪

⎩ ⎪ ⎪

New in local: starts a new alignment

Page 31: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Local alignment DP algorithm

• Initialization: first row and first column initialized with 0’s

• Traceback:– Find maximum value of F(j, i)

• can be anywhere in matrix

– Stop when we get to a cell with value 0

Page 32: Pairwise Sequence Alignment Sushmita Roy BMI/CS 576  Sushmita Roy sroy@biostat.wisc.edu Sep 10 th, 2013 BMI/CS 576

Local alignment example

0

0

00 00

00 00

0

T

T

A

A

G

0

0

0

0

0

0 0

G

0

A

0

A

0

A

1

0

1

1 2

3

1

1

x:

y:

G

G

A

A

A

A

1