44
Classical Statistics and Scoring of Alignments

Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Embed Size (px)

Citation preview

Page 1: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Classical Statistics and Scoring of AlignmentsClassical Statistics and Scoring of Alignments

Page 2: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Consider a probe of length l and a database of total length m.

How many subsequences of length n are there in l?

l-n+1

How many subsequences of length n are there in m?

m-n+1

SO THERE ARE EXACTLY

(l-m+1) (m-n+1) potential matches of length m between the probe and the database.

Page 3: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Probability: Coin TossProbability: Coin Toss

• There are only two outcomes possible for a coin toss: heads and tails.

• For a fair coin, the probability of getting heads on any one toss is 0.5; p = 0.5

• The probability of getting tails on any one toss is 0.5; q = (1-p) = 0.5

• There are only two outcomes possible for a coin toss: heads and tails.

• For a fair coin, the probability of getting heads on any one toss is 0.5; p = 0.5

• The probability of getting tails on any one toss is 0.5; q = (1-p) = 0.5

Page 4: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• If you do lots of tosses, you expect heads and tails approximately half the time; the probability for one toss has meaning over many tosses, too.

• But if you do a finite number of tosses, you have a small chance of getting exactly half heads and and half tails; most of the time you’ll get almost half and half

• If you do lots of tosses, you expect heads and tails approximately half the time; the probability for one toss has meaning over many tosses, too.

• But if you do a finite number of tosses, you have a small chance of getting exactly half heads and and half tails; most of the time you’ll get almost half and half

Page 5: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• If you do many sets of tosses, you can find out empirically what your chances are of getting say 4 heads out of 10 tosses; P(k=4 heads out of N=10 tosses)

• Or you can figure it out mathematically by thinking about how many ways you could toss a coin 10 times and get 4 heads: combinations formula

• If you do many sets of tosses, you can find out empirically what your chances are of getting say 4 heads out of 10 tosses; P(k=4 heads out of N=10 tosses)

• Or you can figure it out mathematically by thinking about how many ways you could toss a coin 10 times and get 4 heads: combinations formula

Page 6: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Ways to get headsWays to get heads

ways to get dif #’s heads1

1 2 1

1 3 3 1

1 4 6 4 1

1 5 10 10 5 1

1 6 15 20 15 6 1

ways to get dif #’s heads1

1 2 1

1 3 3 1

1 4 6 4 1

1 5 10 10 5 1

1 6 15 20 15 6 1

#trials12

3

4

5

6

Page 7: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Binomial theoremBinomial theorem

• The combinations in the triangle come from the binomial theorem: since we have only two outcomes, you can write the probability of all outcomes as (p + q)n, where n = the number of trials

(p+q)n = pn + n! p(n-1) q + n! p(n-2) q2…..+qn

i!j! i!j!• where i = the number of heads and j = the number

of tails• E.g., what’s the probability of 4 out of 6 heads?

n = 6, i = 4, j = 2: 6!/4!2!(0.5)4(0.5)2

• The combinations in the triangle come from the binomial theorem: since we have only two outcomes, you can write the probability of all outcomes as (p + q)n, where n = the number of trials

(p+q)n = pn + n! p(n-1) q + n! p(n-2) q2…..+qn

i!j! i!j!• where i = the number of heads and j = the number

of tails• E.g., what’s the probability of 4 out of 6 heads?

n = 6, i = 4, j = 2: 6!/4!2!(0.5)4(0.5)2

Page 8: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• You can make a plot of all the possible probability scores against the number of heads you get

• You can make a plot of all the possible probability scores against the number of heads you get

2 tosses

0

1

2

3

0 1 2

Number of heads

Wa

ys o

f o

bta

inin

g h

ea

ds

Series1

10 tosses

050

100150200250300

0 1 2 3 4 5 6 7 8 9 10

Number of heads

Wa

ys o

f o

bta

inin

g

he

ad

sSeries1

Page 9: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Notice that as n increases, the binomial distribution approaches a bell-shaped curve: the gaussian distribution

• The gaussian distribution has the familiar mean = x and standard deviation = s for common statistical analyses: t-test, etc.

• Notice that as n increases, the binomial distribution approaches a bell-shaped curve: the gaussian distribution

• The gaussian distribution has the familiar mean = x and standard deviation = s for common statistical analyses: t-test, etc.

Page 10: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Gaussian DistributionGaussian Distribution

Page 11: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

What about statistics for alignments?What about statistics for alignments?• What we want is a way to find out the

significance of the score we assign to alignments – either the similarity search or the alignments that we will be talking about next week

• We need the probability of finding a particular score

• What we want is a way to find out the significance of the score we assign to alignments – either the similarity search or the alignments that we will be talking about next week

• We need the probability of finding a particular score

Page 12: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Probability of finding a score – simplified exampleProbability of finding a score – simplified example• Let’s consider sequences with an alphabet

of just 2 letters, A and B• Then p = probability of a match to A, and q

= probability of match to B (or no match to A); this is just like tossing a coin and p = q

• For random sequences, we expect a total score of ½ N (N = length of sequences)

• Let’s consider sequences with an alphabet of just 2 letters, A and B

• Then p = probability of a match to A, and q = probability of match to B (or no match to A); this is just like tossing a coin and p = q

• For random sequences, we expect a total score of ½ N (N = length of sequences)

Page 13: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

In-class exercise IIn-class exercise I

• Assume identity scoring, a gapless alignment and a sequence alphabet of A and B. What is the probability of two random sequences of length 4 being identical?

• What is the probability of 3 out of 4 characters identical?

• Assume identity scoring, a gapless alignment and a sequence alphabet of A and B. What is the probability of two random sequences of length 4 being identical?

• What is the probability of 3 out of 4 characters identical?

Page 14: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Probability of finding a score – almost realisticProbability of finding a score – almost realistic• So what if we let there be 4 characters (as

in the nucleic acid case)? We’ll keep identity scoring and no gaps

• Now p(match) = 0.25, and q(no match) = 0.75

• Since there’s still only 2 outcomes we can use the binomial theorem to calculate the distribution of scores just as before (this is like a “loaded” coin)

• So what if we let there be 4 characters (as in the nucleic acid case)? We’ll keep identity scoring and no gaps

• Now p(match) = 0.25, and q(no match) = 0.75

• Since there’s still only 2 outcomes we can use the binomial theorem to calculate the distribution of scores just as before (this is like a “loaded” coin)

Page 15: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

In-class exercise IIIn-class exercise II

• Assume an alphabet of 4 letters A,G,C,T; a gapless alignment, and identity scoring. What is the probability of two random sequences of length 4 being identical?

• What is the probability of 3 out of 4 characters identical?

• Assume an alphabet of 4 letters A,G,C,T; a gapless alignment, and identity scoring. What is the probability of two random sequences of length 4 being identical?

• What is the probability of 3 out of 4 characters identical?

Page 16: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Distribution for 4-character, identity scoringDistribution for 4-character, identity scoring• You can convince

yourself that the probability distributions for scores of alignments using a 4-character alphabet are not gaussian

• This means you can’t use t-tests, etc., to test significance of alignments

• You can convince yourself that the probability distributions for scores of alignments using a 4-character alphabet are not gaussian

• This means you can’t use t-tests, etc., to test significance of alignments

N = 2

00.10.20.30.40.50.6

0 1 2

number of identities

pro

ba

bili

ty o

f id

en

titi

es

Series1

N = 5

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

number of identities

pro

ba

bili

ty o

f id

en

titi

es

Series1

Page 17: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Alignment statisticsAlignment statistics

• No theory for global alignments

• Can use randomized sets

• Statistics for ungapped local alignments well-understood

• No theory for global alignments

• Can use randomized sets

• Statistics for ungapped local alignments well-understood

Page 18: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

FASTA statisticsFASTA statistics

• FASTA generates a random set of proteins sequences and reports the frequency of init1’s (modified hits) against the randomized set

• The tail of an actual search distribution usually contains more hits than are expected from the randomized set; some of those extra hits will represent homologous proteins

• FASTA generates a random set of proteins sequences and reports the frequency of init1’s (modified hits) against the randomized set

• The tail of an actual search distribution usually contains more hits than are expected from the randomized set; some of those extra hits will represent homologous proteins

Page 19: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Some of the other hits in the areas of higher probability of random hits will also represent homologous proteins

• Some of the other hits in the areas of higher probability of random hits will also represent homologous proteins

Page 20: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Karlin-Altschul statisticsKarlin-Altschul statistics

• BLAST initially generates ungapped alignments• Problem: How many hits of score S are generated using a

probe of length m vs. a database of total sequence length n? • Answer: for high scores the number of hits is determined by

the extreme value distribution:• E=Kmn exp(-S)

• BLAST initially generates ungapped alignments• Problem: How many hits of score S are generated using a

probe of length m vs. a database of total sequence length n? • Answer: for high scores the number of hits is determined by

the extreme value distribution:• E=Kmn exp(-S)

Page 21: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• So as the probe length and the data base size go up, the expected number of hits go up linearly. As the cutoff score is raised, the number decreases exponentially

and are parameters which depend on factors such as the scoring matrix used

• So as the probe length and the data base size go up, the expected number of hits go up linearly. As the cutoff score is raised, the number decreases exponentially

and are parameters which depend on factors such as the scoring matrix used

Page 22: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Statistics of alignmentStatistics of alignment

• Consider a probe of length l and a database of total length n. How many subsequences of length m are there in l?

• l-m+1• How many subsequences of length m are

there in n?• n-m+1• SO THERE ARE EXACTLY (l-m+1) (n-m+1)

potential matches of length m between the probe and the database.

• Consider a probe of length l and a database of total length n. How many subsequences of length m are there in l?

• l-m+1• How many subsequences of length m are

there in n?• n-m+1• SO THERE ARE EXACTLY (l-m+1) (n-m+1)

potential matches of length m between the probe and the database.

Page 23: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• The number of matches expected is just the product of the number of potential matches and the probability of a pair matching:– Expectation value = (l-m+1) (n-m+1) pm

• where p is the probability of a match at each position; if each outcome (letter in the alphabet) is of equal probability, p is the inverse of the number of outcomes (letters).

• The number of matches expected is just the product of the number of potential matches and the probability of a pair matching:– Expectation value = (l-m+1) (n-m+1) pm

• where p is the probability of a match at each position; if each outcome (letter in the alphabet) is of equal probability, p is the inverse of the number of outcomes (letters).

Page 24: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Using gapless alignments and identity scoring, the score of a particular comparison between a probe and a database is just the number of matches in the region considered.

• Suppose we extend any hits in both directions until we reach a mismatch; the score obtained will be just the number of consecutive hits m.

• We seek the number of random hits expected of score at least S=M. The expectation value is just

• (l-M+1) (n-M+1) pM

• for DNA (four bases) and large l and n, this is ~ n l 4-S

• Using gapless alignments and identity scoring, the score of a particular comparison between a probe and a database is just the number of matches in the region considered.

• Suppose we extend any hits in both directions until we reach a mismatch; the score obtained will be just the number of consecutive hits m.

• We seek the number of random hits expected of score at least S=M. The expectation value is just

• (l-M+1) (n-M+1) pM

• for DNA (four bases) and large l and n, this is ~ n l 4-S

Page 25: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Statistics of alignmentStatistics of alignment

• Now suppose we modify our extension rules so that to get score S we need to get only i out of m matches, with j misses allowed. The probability of i hits out of m is just

m! pi qj i!j!

• Now suppose we modify our extension rules so that to get score S we need to get only i out of m matches, with j misses allowed. The probability of i hits out of m is just

m! pi qj i!j!

Page 26: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• We can find the expectation value by multiplying by the number of potential m base comparisons. for DNA (four bases) and large l and n, this is

~ n l m! pi qj

i!j!

• We can find the expectation value by multiplying by the number of potential m base comparisons. for DNA (four bases) and large l and n, this is

~ n l m! pi qj

i!j!

Page 27: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• The expectation value for the number of hits of score S in a database of length n with a probe of length l is

~ n l m! pi qj i!j!

• if it takes i hits out of i+j elements to reach S.• Using identity scoring, S=i and for q>>p and small j • this is close to ~ n l m! pi i!j!

• The expectation value for the number of hits of score S in a database of length n with a probe of length l is

~ n l m! pi qj i!j!

• if it takes i hits out of i+j elements to reach S.• Using identity scoring, S=i and for q>>p and small j • this is close to ~ n l m! pi i!j!

Page 28: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• The expression ~ n l m! pi i!j! or ~ n l m! (1/p)-S

i!j!

• is a reasonable approximation for DNA sequences only for j=0 or 1, but for amino acid sequences q is large enough to tolerate several mismatches.

• The expression ~ n l m! pi i!j! or ~ n l m! (1/p)-S

i!j!

• is a reasonable approximation for DNA sequences only for j=0 or 1, but for amino acid sequences q is large enough to tolerate several mismatches.

Page 29: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• More generally,

~ n l m! (1/p)-S i!j!

• is equivalent to K n l (1/p )-S , where K contains both the binomial coefficients and the terms in q.

• For non-identity scoring, we have K n l exp (-S)• which is the general form of the extreme value

distribution; is a scale factor for the score system.

• More generally,

~ n l m! (1/p)-S i!j!

• is equivalent to K n l (1/p )-S , where K contains both the binomial coefficients and the terms in q.

• For non-identity scoring, we have K n l exp (-S)• which is the general form of the extreme value

distribution; is a scale factor for the score system.

Page 30: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Algorithms Algorithms

• An algorithm is an exact set of instructions for carrying out a task using a limited menu of commands

• Algorithms are most often associated with computer programs, but they can also be used by humans to solve problems

• Small number of operations defined in the programming language compared to, say, words or phrases in English

• An algorithm is an exact set of instructions for carrying out a task using a limited menu of commands

• Algorithms are most often associated with computer programs, but they can also be used by humans to solve problems

• Small number of operations defined in the programming language compared to, say, words or phrases in English

Page 31: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Algorithms in minimal CAlgorithms in minimal C• Instructions:

• main () means program is beginning

• Statements have operators:

• =, +, -, *, ++ (increment), % (modulus), () (separating terms)

• Statements are not algebra!

»x = x + 1

• Instructions:

• main () means program is beginning

• Statements have operators:

• =, +, -, *, ++ (increment), % (modulus), () (separating terms)

• Statements are not algebra!

»x = x + 1

Page 32: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Statements end in ;• Statements are about variables

• Variables start with characters (i, x, b5)• Variables need to be declared so the program can

use them• float x (float means any real number); int I (int

means an integer); char b (char means character) ;• Arrays are a list of data that can be associated with

a variable; arrays run from 0 to n-1 where n is the number in brackets

• float x[]; int a[50]; char b[500]; float x [5][4];

• Statements end in ;• Statements are about variables

• Variables start with characters (i, x, b5)• Variables need to be declared so the program can

use them• float x (float means any real number); int I (int

means an integer); char b (char means character) ;• Arrays are a list of data that can be associated with

a variable; arrays run from 0 to n-1 where n is the number in brackets

• float x[]; int a[50]; char b[500]; float x [5][4];

Page 33: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Variables need to be initialized (given a starting value) if they are going to be incremented

• Loops and conditions– For (conditions) { statements; } means do statement

operations while conditions are met– If (conditions) { statements; } else {statements; } means do

first statements while conditions are met, and when conditions are not met do second statements – this is a one-time thing

– While (conditions) { statements; } means keep doing statements while conditions are met, and as soon as they are not met, stop – this is continuous while conditions are met

– {continue; } go to next iteration of current loop skipping intervening statements (curly brackets)

– { break; } means get out of the current loop

• Variables need to be initialized (given a starting value) if they are going to be incremented

• Loops and conditions– For (conditions) { statements; } means do statement

operations while conditions are met– If (conditions) { statements; } else {statements; } means do

first statements while conditions are met, and when conditions are not met do second statements – this is a one-time thing

– While (conditions) { statements; } means keep doing statements while conditions are met, and as soon as they are not met, stop – this is continuous while conditions are met

– {continue; } go to next iteration of current loop skipping intervening statements (curly brackets)

– { break; } means get out of the current loop

Page 34: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Loops have curly brackets around them– Loops can be nested– Loops are operations that continue until something

changes

• Arithmetic: sqrt (x), exp (x), log (x), etc.• Logical operators: == (equals), != (not equal), >, <, <=,

>=, && (boolean AND), | | (boolean OR)• Real programs need an input and an output function;

input reads data into the program, and output records the results of the operations in the program. You will not need to worry about the input and output parts of the algorithms, we will provide them to you.

• Loops have curly brackets around them– Loops can be nested– Loops are operations that continue until something

changes

• Arithmetic: sqrt (x), exp (x), log (x), etc.• Logical operators: == (equals), != (not equal), >, <, <=,

>=, && (boolean AND), | | (boolean OR)• Real programs need an input and an output function;

input reads data into the program, and output records the results of the operations in the program. You will not need to worry about the input and output parts of the algorithms, we will provide them to you.

Page 35: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• Simple algorithm to find the average of N terms x(n)Main() { starts programFloat s, av, x[ ]; declares variablesint i, N; declares variables“Input” (x, N); gets datas = 0; initializes s (value 0)for (i = 0; i<N; i++) { loop and statements = s + x[i];}av = s/N arithmetic statement“Output” (av) average reported} end program

• Simple algorithm to find the average of N terms x(n)Main() { starts programFloat s, av, x[ ]; declares variablesint i, N; declares variables“Input” (x, N); gets datas = 0; initializes s (value 0)for (i = 0; i<N; i++) { loop and statements = s + x[i];}av = s/N arithmetic statement“Output” (av) average reported} end program

Page 36: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• How a simple loop works:for (i = 0; i<N; i++); {s = s + x[i];}

• Initially, we had set S=0, so the first time through the loop i=0 and S is set equal to X(0). Then we increment i by one, so I = 1, and s = x(0) + x(1). This process continues until i = N, at which point s =x(0) + x(1) + … x(N-1); in other words, s is the sum of all the elements of the array x, which is the input.

• How a simple loop works:for (i = 0; i<N; i++); {s = s + x[i];}

• Initially, we had set S=0, so the first time through the loop i=0 and S is set equal to X(0). Then we increment i by one, so I = 1, and s = x(0) + x(1). This process continues until i = N, at which point s =x(0) + x(1) + … x(N-1); in other words, s is the sum of all the elements of the array x, which is the input.

Page 37: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

In-class exerciseIn-class exercise• Use the following data and the algorithm for

the average of N terms to find the average: 3,5,8,9

Main() {Float s, av, x[ ];int i, N;“Input” (x, N);s = 0;for (i = 0; i<N; i++) {s = s + x[i];}av = s/N “Output” (av)}

• Use the following data and the algorithm for the average of N terms to find the average: 3,5,8,9

Main() {Float s, av, x[ ];int i, N;“Input” (x, N);s = 0;for (i = 0; i<N; i++) {s = s + x[i];}av = s/N “Output” (av)}

Page 38: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Nested LoopsNested Loops

• Often we need to cover a space defined by

multiple variables. Suppose we have a matrix

with elements X(i,j):

• X(1,1) X(1,2) X(1,3) …… X (1,n)

• X(2,1) X(2,2) X(2,3) …….X (2,n) • .• .• X(m,1)

• Often we need to cover a space defined by

multiple variables. Suppose we have a matrix

with elements X(i,j):

• X(1,1) X(1,2) X(1,3) …… X (1,n)

• X(2,1) X(2,2) X(2,3) …….X (2,n) • .• .• X(m,1)

Page 39: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

• To write the elements of X(i,j), we can use two nested loops, one indexed to i and the other to j:

• for (i=0, i<m , i++) {• for (j=0, j<n, j++) {• “output” x(i,j)• } • }

• To write the elements of X(i,j), we can use two nested loops, one indexed to i and the other to j:

• for (i=0, i<m , i++) {• for (j=0, j<n, j++) {• “output” x(i,j)• } • }

Outside loop

Inside loop

Page 40: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Another algorithm: simple sequence similarity searchAnother algorithm: simple sequence similarity search• Probe sequence w of length J;

elements w(j), j = 1,2, … J

• Library sequence m of length I; elements m(i), i = 1,2, … I

• Problem: find all exact matches of for the whole length of w in m

• Probe sequence w of length J; elements w(j), j = 1,2, … J

• Library sequence m of length I; elements m(i), i = 1,2, … I

• Problem: find all exact matches of for the whole length of w in m

Page 41: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Search algorithmSearch algorithm• Main () {• int i, j, I, J; char w[ ], m[ ];• “Input” (w, m, I, J);• i = 0, j = 0;• for (i=0; i<I; i ++){

for (j=0; j<J; j ++){if (w [ j ] != m [ i + j ]) {break;}if (j==J) {output i}

}}

}

• Main () {• int i, j, I, J; char w[ ], m[ ];• “Input” (w, m, I, J);• i = 0, j = 0;• for (i=0; i<I; i ++){

for (j=0; j<J; j ++){if (w [ j ] != m [ i + j ]) {break;}if (j==J) {output i}

}}

}

Outer loop

Inner loop

Page 42: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Search algorithmSearch algorithm

• How it works: outside loop in i controls the start of the sequence comparison in m: i is the # of the sequence element in m compared to the first element in w.

• inside loop in j increments the index of both w and m, allowing the comparison of successive elements for each starting point fixed by i.

• How it works: outside loop in i controls the start of the sequence comparison in m: i is the # of the sequence element in m compared to the first element in w.

• inside loop in j increments the index of both w and m, allowing the comparison of successive elements for each starting point fixed by i.

Page 43: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

Sliding Box ModelSliding Box Model• i controls the region of m selected for

comparison

• i controls the region of m selected for comparison

• W1 W2...M1 M2 --------------------------------------------MI

M1 M2 -----------------------MI

Above: box represents the position of w on m for i=1, so that W1 gets compared to M1, W2 to M2, etc. as j increases

W1 W2...

Below: box represents the position of w on m for i=2, the second time through the outer loop. As j increases W1 is compared to M2, W2 to M3, etc.

Page 44: Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there

In-class exerciseIn-class exercise

• W: AB

• M : QLABCD

• Use the simple sequence similarity search algorithm to find matches between w and m

• W: AB

• M : QLABCD

• Use the simple sequence similarity search algorithm to find matches between w and m