Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

Recap

3 different types of comparisons

1. Whole genome comparison

2. Gene search

3. Motif discovery (shared pattern discovery)

Agenda

More about Shared Pattern Discovery Edit Distance

– Recap– What you need to know for the next quiz

Alignment– More details– More examples

Shared Pattern Discovery

I have 10 rats that all have green eyes I have 10 rats that all have blue eyes What exactly do the 10 rats have in

common that give them green eyes?


Multiple Alignment can be used to measure the strength a genomic pattern found in a set of sequences

– First, completely align the 10 green-eyed rats– Then, align green-eyed rats with blue-eyed rats– Finally, compare the statistical difference

Initially, this is how genes were pin-pointed


Multiple alignment of 10 green-eyed rats

Alignment of blue-eyed rat and green-eyed rat

99.2%similar

99.4%similar

99.1%similar

94.5%similar

99.3%similar

95.2%similar

99.2%similar

94.7%similar

Recap: Exact string matching

Its important to know why exact matching doesn’t work.– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA

Problem: Target can NOT be found in the pattern even though there is a near-match

Sequences either match or don’t match There is no ‘in-between’

Recap: Edit Dist. for Local Search

Question: How many edits are needed to exactly match the target with part of the pattern– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA

Answer: 1 deletion Example of local search Gene finding

Recap: Edit Dist. for Global Comp.

Question: How many edits are needed to exactly match the ENTIRE target the WHOLE pattern– Target: CGTACGAC– Pattern: CGTACGTACGTACGTTCA

Answer: 10 deletions Example of global comparison (whole

genome comparison)

Quiz coming up!

You need to be able to compute optimal edit distance.

You need to fill-in the table.

Edit Distance – Dynamic Programming

A C G TC G C AT

A

C

G

T

G

T

G

C

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Optimal edit distance forTG and TCG

Optimal edit distance for TG and TCGA

Optimal edit distance for TGA and TCG

Final Answer

Optimal edit distance for TGA and TCGA

Edit Distance

int matrix[n+1][m+1];

for (x = 0; x <= n; x++) matrix[x][0] = x;

for (y = 1; y <= m; y++) matrix [0][y] = y;

for (x = 1; x <= n; x++)

for (y = 1; y <= m; y++)

if (seq1[x] == seq2[y])

matrix[x][y] = matrix[x-1][y-1];

else

matrix[x][y] = max(matrix[x][y-1] + 1,

matrix[x-1][y] + 1);

return matrix[n][m];

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

1 2 1 2 3 4 5 6 7

2 3 2 1 2 3 4 5 6

3 2 3 2 1 2 3 4 5

4 3 2 3 2 1 2 3 4

5 4 3 4 3 2 1 2 3

6 5 4 5 4 3 2 3 4

7 6 5 6 5 5 3 2 3

Why Edit Distances Stinks for Genetic Data?

DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC….

…GATCTCCCAGATAGAAGCAGTATTCAGTCA…

… CCTATCAGCAGGATCAAGTATGTCATACTAC…

The edit distance between rat and virus is smaller thanrat and fruit bat.

This is a gene in the rat genome

This is the same gene in the fruit bat

This is a totally unrelatedregion of the AIDS virus

Alignment

We need a more robust way to measure similarity

Alignment meets several requirements

1. It rewards matches

2. It penalizes mismatches

3. Different strategies for penalizing gaps

4. It helps visualize similarity.

Alignment

Two examples

Seq1 G C T A G T A T G C C G A T A C T G A

Seq2 G C T A G A T G C A G A T A C T T G A

Seq3 G C T A G T A T G C C G A T A C G A

Seq4 G A T A G A C G C A G A T G C T T G T

What’s more similar– Seq1 & Seq2, or– Seq3 & Seq4

Alignment

Three steps in the dynamic programming algorithm for alignment

1. Initialization

2. Matrix fill (scoring)

3. Traceback (alignment)

Initialization

Matrix Fill

For each position, Mi,j is defined to be the maximum score at position i,j

Mi,j = MAX [ Mi-1, j-1 + Si,j (match/mismatch),

Mi,j-1 + w (gap in sequence #1),

Mi-1,j + w (gap in sequence #2)

]

Matrix Fill


Mi,j-1 + w (gap in sequence #1),

Mi-1,j + w (gap in sequence #2)

] Si,j = 1 if symbols match, otherwise Si,j = 0 w = 0 (no gap penalty)

Matrix Fill

The score at position 1,1 can be calculated.

The first residue in both sequences is a G

Thus, S1,1 = 1

Thus, M1,1 =

MAX[M0,0 + 1, M1,0 + 0, M0,1 + 0] = MAX[1, 0, 0] = 1.

Matrix Fill

Matrix Fill

Matrix Fill

Matrix Fill

Tracing Back

(Seq #1) A

|

(Seq #2) A

Tracing Back

(Seq #1) A

|

(Seq #2) A

Tracing back the alignment

(Seq #1) TA

|

(Seq #2) A

Tracing Back

(Seq #1) TTA

|

(Seq #2) A

Tracing Back

(Seq #1) GAATTCAGTTA

| | || | |

(Seq #2) GGA_TC_G__A

Robust Scoring


Mi,j-1 + w1 (gap in sequence #1),

Mi-1,j + w2 (gap in sequence #2)

]

Si,j A C G T

A 1.1 0.0 0.3 0.5

w1 -0.5 C 1.3 0.1 0.0

w2 -0.7 G 1.0 0.0

T 1.2

Alignment Scoring

Si,j A C G T

A 1.1 0.0 0.3 0.5

w1 -0.5 C 1.3 0.1 0.0

w2 -0.7 G 1.0 0.0

T 1.2

Seq1 G T A C T A C G A C

Seq2 G A A C G T A G A C

score 1.0 0.5 1.1 1.3 -0.5 1.2 1.1 -0.7 1.0 1.1 1.3

Alignment score = 8.4

Alignment Scoring

Si,j A C G T

A 1.1 0.0 0.3 0.5

w1 -0.5 C 1.3 0.1 0.0

w2 -0.7 G 1.0 0.0

T 1.2



score 1.0 0.5 1.1 1.3 -0.5 1.2 1.1 -0.7 1.0 1.1 1.3

Can you find a better alignment?

Alignment Scoring

Si,j A C G T

A 1.1 0.0 0.3 0.5

w1 -0.5 C 1.3 0.1 0.0

w2 -0.7 G 1.0 0.0

T 1.2



score 1.0 0.5 1.1 1.3 0.0 0.5 0.0 1.0 1.1 1.3

Alignment score = 7.8

Alignment Scoring

Summary: We have a way of rewarding different types of

matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps

– if we knew that didn’t affect biological similarity

We could even reward some types of mismatches– if we knew they were still biological similarity

Alignment scoring

Process

1. Experts (chemists or biologist) look at sequence segments that are known to be biologically similar and compare them to sequence segments that are biologically disimilar.

2. Use direct observation and statistics to develop a scoring scheme

3. Given the scoring scheme, develop an algorithm to compute the maximum scoring alignment.

Alignment – Algorithmic Point of View Align the symbols of two strings.

– Maximize the number of symbols that match.– Minimize the number of symbols that do NOT match

Gaps can be inserted to improve alignments. A scoring system is used to measure the quality of

an alignment.Gap penalty

-8

5-3-4-5T

-34-4-4G

-4-44-3C

-5-4-35A

TGCA

Scoring matrix In practice:– Scoring matrices and

gap penalties are based on biological knowledge and statistical analysis

Local Alignment and Global Alignment In Global Alignment the two strings must be entirely

aligned (every aligned pair of symbols is scored).

In Local Alignment segments from each string are aligned and the rest of the string can be ignored

Global alignment is used to compare the similarity of entire organisms

Local alignment is used to search for genes

A G A G T A C T C A G T A T C T G A T

A C A T A C T A C A G T A T C C A

A G A G T A C T C A G T A T C T G A T

A C A T A C T A C A G T A T C C A

Alignment Scoring Revisited Given a scoring system, the alignment score is the sum of the

scores for each aligned pair of symbols plus the gap penalties

Local AlignmentA G A G T A C T C A G T A T C T G A T

A C A T A C T A C T G T A T C C A

A C G T

A 3 -3 -4 -5

C -3 2 -4 -4

G -4 -4 2 -3

T -5 -4 -3 3

-6

3 3 2 3 -6 2 -5 2 3 3 3 2

Total Score = 15

Alignment - Computer Science Perspective

Given two input strings and a scoring system, find the highest scoring local alignment among all possible alignments.

Fact: The number of possible alignments grows exponentially with the length of the input strings

Solving this problem efficiently was an open problem until Smith and Waterman (1980) designed an efficient dynamic programming algorithm

The algorithm takes O(nm) time where n and m are the lengths of the two input strings

Interesting History

The Smith Waterman algorithm for computing local alignment is considered one of the most important algorithms in computational biology.

However, the algorithm is merely a generalization of the edit distance algorithm, which was already published and well-known in computer science.

Converting the edit distance algorithm to solve the alignment problem is “trivial.”

Smith and Waterman are consider almost legendary for this accomplishment.

It is a perfect example of “being in the right place at the right time.”

Smith Waterman Algorithm

T 0

C 0

43G 0

0038A 0

C

0

G

0

C

0

A

00Dynamic programming table

D[i][j]=MAX( 0, M[i-1][j-1] + S(i,j), M[i-1][j] + w, M[i][j-1] + w);

ii

jj

8-3-4-5T

-37-4-4G

-4-47-3C

-5-4-38A

TGCA

S(i,j)

-5w-4 -5

-5


0

A

0

C

0

G

0

C

0

T

0

A

0

A 0

G 0

C 0

T 0

C 0

A 0

A C G T

A 6 -3 -4 -5

C -3 5 -4 -4

G -4 -4 4 -3

T -5 -4 -3 7

-5

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0 1 2 5 0 0 0

C 0 0 6 1 10 5 0

T 0 0 1 3 5 17 12

C 0 0 5 0 8 12 14

A 0 6 1 1 3 7 18

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6

G 0

C 0

T 0

C 0

A 0

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1

G 0

C 0

T 0

C 0

A 0

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0

C 0

T 0

C 0

A 0

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0 1

C 0

T 0

C 0

A 0

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0 1 2

C 0

T 0

C 0

A 0

0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0 1 2 5

C 0

T 0

C 0

A 0


0

A

0

C

0

G

0

C

0

T

0

A

0

A 0 6 1 0 0 0 0

G 0 1 2 5 0 0 0

C 0 0 6 1 10 5 0

T 0 0 1 3 5 17 12

C 0 0 5 0 8 12 14

A 0 6 1 1 3 7 18

A

AC

T

T

C

C

G

G

CA

A

Documents

Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)