34
DNA Properties and DNA Properties and Genetic Sequence Alignment Genetic Sequence Alignment CSE, Marmara University CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse mimoza.marmara.edu.tr/~m.sakalli/cse 546 546 Nov/8/09 Nov/8/09

Devising a scoring system

  • Upload
    zanthe

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

DNA Properties and Genetic Sequence Alignment CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Nov/8/09. Devising a scoring system. Importance: Scoring matrices appear in all analysis involving sequence comparison. - PowerPoint PPT Presentation

Citation preview

Page 1: Devising a scoring system

DNA Properties and DNA Properties and

Genetic Sequence AlignmentGenetic Sequence Alignment

CSE, Marmara University CSE, Marmara University

mimoza.marmara.edu.tr/~m.sakalli/cse546mimoza.marmara.edu.tr/~m.sakalli/cse546

Nov/8/09Nov/8/09

Page 2: Devising a scoring system

Devising a scoring system

Importance:Scoring matrices appear in all analysis involving sequence comparison.The choice of matrix can strongly influence the outcome of the analysis.Understanding theories underlying a given scoring matrix can aid in making proper choice:

Some matrices reflect similarity: good for database searchingSome reflect distance: good for phylogenies

Log-odds matrices, a normalisation method for matrix values:

S is the probability that two residues, i and j, are aligned by evolutionary descent and by chance.qij are the frequencies that i and j are observed to align in sequences known to

be related. pi and pj are their frequencies of occurrence in the set of sequences.

Page 3: Devising a scoring system

Next week: Aligning Two Strings

Represents each row and each column with a number and a symbol of the sequence present up to a given position. For example the sequences are represented as:

www.bioalgorithms.info\Winfried Just

Alignment as a Path in the Edit Graph

0 1 2 2 0 1 2 2 33 4 5 6 7 7 4 5 6 7 7 A T _ A T _ GG T T A T _ T T A T _ A T C A T C GG T _ A _ C T _ A _ C0 1 2 3 0 1 2 3 44 5 5 6 6 7 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4),(3,4), (4,5), (5,5), (6,6), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)

Page 4: Devising a scoring system

and represent indels (insertions and deletions) in v and w scoring 0. represent exact matches scoring 1.

The score of the alignment path in the graph is 5.

Every path in the edit graph corresponds to an alignment:

Alternative AlignmentAlternative Alignment

01223456770122345677v= AT_GTTAT_v= AT_GTTAT_w= ATCGT_A_Cw= ATCGT_A_C 01234556670123455667

v= AT_GTTAT_v= AT_GTTAT_w= ATCG_TA_Cw= ATCG_TA_C

Page 5: Devising a scoring system

Alignment: Dynamic Programming ???

Use this scoring algorithm

Si-1, j-1+1 if vi = wj

Si,j = MAX Si-1, j value from top

Si, j-1 value from left

There are no matches at the beginning of the alignment

First column row i=1, and row j=1 all labeled to be all zero

Page 6: Devising a scoring system

Backtracking ExampleFind a match in row and column 2.

i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T).

Since vi = wj, S(i,j) = Si-1,j-1 +1

S(2,2) = [S(1,1) = 1] + 1 S(2,5) = [S(1,4) = 1] + 1S(4,2) = [S(3,1) = 1] + 1S(5,2) = [S(4,1) = 1] + 1S(7,2) = [S(6,1) = 1] + 1

The simplest form of a sequence similarity The simplest form of a sequence similarity analysis, and solution to analysis, and solution to Longest Common Longest Common Subsequence (LCS) problemSubsequence (LCS) problem

0122345677 0122345677 01223456770122345677v= AT_GTTAT_v= AT_GTTAT_ AT_GTTAT_AT_GTTAT_w= ATCGT_A_C w= ATCGT_A_C ATCG_TA_CATCG_TA_C 0123455667 0123455667 01234556670123455667

Page 7: Devising a scoring system

To solve the alignment replace mismatches with insertions and deletions. The To solve the alignment replace mismatches with insertions and deletions. The score for vertex s(i,j) is the same as in the previous example with zero score for vertex s(i,j) is the same as in the previous example with zero penalties for indels. Once LCS(v,w) created the alignment grid, read the best penalties for indels. Once LCS(v,w) created the alignment grid, read the best alignment by following the arrows backwards from sinkalignment by following the arrows backwards from sink..

LCS(v,w)LCS(v,w) for I for I 1 to n 1 to n

SSi,0 i,0 0 0 for j for j 1 to m 1 to m

SS0,j0,j 0 0 for i for i 1 to n 1 to n for j for j 1 to m 1 to m

ssi-1,ji-1,j

ssi,ji,j max s max si,j-1i,j-1

ssi-1,j-1i-1,j-1 + 1, if + 1, if vvii = w = wjj

“ “ “ “ if sif si,j i,j = s= si-1,ji-1,j

bbi,ji,j “ “ if s“ “ if si,j i,j = s= si,j-1i,j-1

“ “ “ “ if sif si,j i,j = s= si-1,j-1i-1,j-1 + 1 + 1

returnreturn ( (ssn,mn,m, , bb))

Page 8: Devising a scoring system

LCS Runtime.LCS Runtime.

Alignment Problem tries to find the longest path between Alignment Problem tries to find the longest path between vertices (vertices (ii, , jj) and () and (n,mn,m) in the edit graph. If ) in the edit graph. If ii, , jj = 0 and = 0 and n,mn,m = end vertices then alignment is global. = end vertices then alignment is global.

To create the To create the nxmnxm matrix of best scores from vertex (0,0) to matrix of best scores from vertex (0,0) to all other vertices, it takes O(all other vertices, it takes O(nmnm) amount of time. LCS ) amount of time. LCS pseudocode has a nested “for” inside “for” to set up a pseudocode has a nested “for” inside “for” to set up a nxmnxm matrix. This sets up a value matrix. This sets up a value wwjj for every value for every value vvii..

Changing penaltiesChanging penalties

In the LCS Problem, we scored 1 from matches socre 1, and In the LCS Problem, we scored 1 from matches socre 1, and indels score 0indels score 0

Consider penalizing indels and mismatches with negative Consider penalizing indels and mismatches with negative scores. scores.

#matches – #matches – μμ((#mismatches) – #mismatches) – σσ ( (#indels)#indels)

ssi-1,j-1i-1,j-1 +1 +1 if vif vii = w = wjj ssi-1,j-1i-1,j-1 + + (v(vii, w, wjj))ssi,ji,j = = maxmax s s i-1,j-1i-1,j-1 -µ -µ if vif vii ≠ w ≠ wjj

s s i-1,ji-1,j - - if indels if indels s s i-1,ji-1,j - - (v(vii, -, -jj)) s s i,j-1i,j-1 - - if indelsif indels s s i,j-1i,j-1 - - (-, w(-, wjj))

Page 9: Devising a scoring system

Affine Gap Penalties and Manhattan Grids. GapsGaps- contiguous sequence of spaces - contiguous sequence of spaces

in one of the rows are more likely in in one of the rows are more likely in nature.nature.

Score for a gap of length Score for a gap of length xx is: -( is: -(ρρ + + σσxx), ), where where ρρ >0 is the penalty for initiating >0 is the penalty for initiating a gap, larger relative to a gap, larger relative to σσ – less – less penalty for extending the gap.penalty for extending the gap.

The three recurrences for the scoring The three recurrences for the scoring algorithm creates a 3-tiered graph algorithm creates a 3-tiered graph corresponding to three layered corresponding to three layered Manhattan Grids.Manhattan Grids.

The top and bottom levels create/extend The top and bottom levels create/extend gaps in the sequence gaps in the sequence ww and and v v respectively. While the middle level respectively. While the middle level extends matches, mismatches.extends matches, mismatches.

Page 10: Devising a scoring system

Affine Gap Penalty Recurrences

s s i-1,ji-1,j - - σσ

ssi,ji,j = max = max s s i-1,ji-1,j -( -(ρρ++σσ))

s s i,j-1i,j-1 - - σσ

ssi,ji,j = max = max s s i,j-1i,j-1 - ( - (ρρ++σσ))

ssi-1,j-1i-1,j-1 + + δδ (v(vii, w, wjj))

ssi,ji,j = max = max s s i,ji,j

s s i,ji,j

Continue Gap in w (deletion)

Start Gap in w (deletion): from middle

Continue Gap in v (insertion)

Start Gap in v (insertion):from middle

Match or Mismatch

End deletion: from top

End insertion: from bottom

Match score:Match score: +1 +1 Matches: 18 × (+1)Matches: 18 × (+1)

Mismatch score:Mismatch score: +0 +0 Mismatches: 2 ×Mismatches: 2 × 00

Gap penalty:Gap penalty: –1–1 Gaps: 7 × (– 1), ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT----CTGATTCGC---ATCGTCTATCT Score = +1Score = +1

Page 11: Devising a scoring system

The 3 Grids

+δ(vi,wj)

+0+0

-(ρ + σ)

-(ρ + σ)

Away from mid-level

Toward mid-level

Page 12: Devising a scoring system

A Recursive Approach to Alignment and Time complexity

% Choose the best alignment based on these three possibilities:% Choose the best alignment based on these three possibilities:align(seq1, seq2) {align(seq1, seq2) {

if (both sequences empty) {return 0;}if (both sequences empty) {return 0;}if (one string empty) {if (one string empty) {

return(gapscore * num chars in nonempty seq);return(gapscore * num chars in nonempty seq);else {else {

score1 = score(firstchar(seq1),firstchar(seq2))score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2));+ align(tail(seq1), tail(seq2));score2 = align(tail(seq1), seq2) + gapscore;score2 = align(tail(seq1), seq2) + gapscore;score3 = align(seq1, tail(seq2) + gapscore;score3 = align(seq1, tail(seq2) + gapscore;return(min(score1, score2, score3));return(min(score1, score2, score3));

}}}}

}}

What is the recurrence equation for What is the recurrence equation for

the time needed by RecurseAlign?the time needed by RecurseAlign?

3)1(3)( nTnT

3

3

3 3

3 3…

n

3

9

27

3n

Page 13: Devising a scoring system

Affine Gap Penalty Recurrences

s s i-1,ji-1,j - - σσ

ssi,ji,j = max = max s s i-1,ji-1,j -( -(ρρ++σσ))

s s i,j-1i,j-1 - - σσ

ssi,ji,j = max = max s s i,j-1i,j-1 - ( - (ρρ++σσ))

ssi-1,j-1i-1,j-1 + + δδ (v(vii, w, wjj))

ssi,ji,j = max = max s s i,ji,j

s s i,ji,j

Continue Gap in w (deletion)

Start Gap in w (deletion): from middle

Continue Gap in v (insertion)

Start Gap in v (insertion):from middle

Match or Mismatch

End deletion: from top

End insertion: from bottom

Match score:Match score: +1 +1 Matches: 18 × (+1)Matches: 18 × (+1)

Mismatch score:Mismatch score: +0 +0 Mismatches: 2 ×Mismatches: 2 × 00

Gap penalty:Gap penalty: –1–1 Gaps: 7 × (– 1), ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT----CTGATTCGC---ATCGTCTATCT Score = +1Score = +1

Page 14: Devising a scoring system

Needleman-Wunsch

ACTCGACTCGACAGTAGACAGTAG

Match: +1, Mismatch: 0, Gap: –1Match: +1, Mismatch: 0, Gap: –1

Vertical/Horiz. move: Score + (simple) gap Vertical/Horiz. move: Score + (simple) gap penaltypenalty

Diagonal move: Score + match/mismatch scoreDiagonal move: Score + match/mismatch score

Take the MAX of the three possibilitiesTake the MAX of the three possibilities

A C T C G

0 -1 -2 -3 -4 -5

A -1 1

C -2

A -3

G -4

T -5

A -6

G -7

0 -1 -2 -3

0 2 1 0 -1-1 1 2 1 0-2 0 1 2 2-3 -1 1 1 2-4 -2 0 1 1-5 -3 -1 0 2

The optimal alignment score is AT the lower-right corner

Reconstructed backward where the Reconstructed backward where the MAX at each step came from. MAX at each step came from.

Space inefficiency. Space inefficiency.

Page 15: Devising a scoring system

How to generate a multiple alignment?Given a pairwise alignment, just add the third, then the fourth, and so on!!Given a pairwise alignment, just add the third, then the fourth, and so on!! dront dront AGAC AGAC t-rex t-rex – – AC – – AC unicorn unicorn AG – – AG – –

dront dront AGAC AGAC unicorn unicorn AG– – AG– – t-rex t-rex – –AC – –AC

t-rex t-rex AC– – AC– – unicorn unicorn AG– – AG– – dront dront AGAC AGAC

t-rex t-rex – –AC – –AC unicorn unicorn – –AG – –AG A possible general method would be to A possible general method would be to extend the pairwise alignment method into a simultaneous N-extend the pairwise alignment method into a simultaneous N-

wise alignmentwise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, , using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do. this is not difficult to do.

In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a In the case of three sequences to be aligned, one can visualize this reasonable easily: One would set up a three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. three-dimensional matrix (a cube) instead of the two-dimensional matrix for the pairwise comparison. Then one basically performs the same procedure as for the two-sequence case. This time, the result is Then one basically performs the same procedure as for the two-sequence case. This time, the result is a path that goes diagonally a path that goes diagonally through the cube from one corner to the oppositethrough the cube from one corner to the opposite. .

The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of The problem here is that the time to compute this N-wise alignment becomes prohibitive as the number of sequences grows. sequences grows. The The algorithmic complexity is something like O(calgorithmic complexity is something like O(c2n2n)), where c is a constant, and n , where c is a constant, and n is the number of sequences. (See the next section for an explanation of this notation.) This is is the number of sequences. (See the next section for an explanation of this notation.) This is disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 disastrous, as may be seen in a simple example: if a pairwise alignment of two sequences takes 1 second, then four sequences would take 10second, then four sequences would take 1044 seconds (2.8 hours), five sequences 10 seconds (2.8 hours), five sequences 1066 seconds (11.6 seconds (11.6 days), six sequences 10days), six sequences 1088 seconds (3.2 years), seven sequences 10 seconds (3.2 years), seven sequences 101010 seconds (317 years), and so on. seconds (317 years), and so on. The expected number of hits with score The expected number of hits with score SS is: is:

Page 16: Devising a scoring system

MMLMMLFor a hypothesis H and data D we have from Bayes: For a hypothesis H and data D we have from Bayes:

P(H&D) = P(H).P(D|H) = P(D).P(H|D)P(H&D) = P(H).P(D|H) = P(D).P(H|D) (1)(1)P(H) P(H) priorprior probability of hypothesis H probability of hypothesis H P(H|D) P(H|D) posteriorposterior probability of hypothesis H probability of hypothesis H P(D|H) P(D|H) likelihoodlikelihood of the hypothesis, actually a function of the data given H. of the hypothesis, actually a function of the data given H.

From Shannon's Entropy, From Shannon's Entropy, the the message lengthmessage length of an event E, MsgLen(E), of an event E, MsgLen(E), where E has probability P(E),where E has probability P(E), is given by is given by

MsgLen(E) = -log2(P(E)):MsgLen(E) = -log2(P(E)): (2)(2)

From (1) and (2), From (1) and (2), MsgLen(H&D) = MsgLen(H)+MsgLen(D|H) = MsgLen(D)+MsgLen(H|D)MsgLen(H&D) = MsgLen(H)+MsgLen(D|H) = MsgLen(D)+MsgLen(H|D)

Now Now in in inductive inferenceinductive inference one often wants one often wants the hypothesis H withthe hypothesis H with the largest the largest posterior probabilityposterior probability. . max{P(H|D)}max{P(H|D)}

MsgLen(H)MsgLen(H) can usually be estimated well, for some reasonable prior on can usually be estimated well, for some reasonable prior on hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately it is it is often impractical to estimate P(D)often impractical to estimate P(D) which is a pity because which is a pity because it would yield P(H|it would yield P(H|D).D).

However, for two rival hypotheses, H and H' However, for two rival hypotheses, H and H' MsgLen(H|D) - MsgLen(H'|D) = MsgLen(H) + MsgLen(D|H) - MsgLen(H') - MsgLen(H|D) - MsgLen(H'|D) = MsgLen(H) + MsgLen(D|H) - MsgLen(H') -

MsgLen(D|H') = posterior - log odds ratioMsgLen(D|H') = posterior - log odds ratio

Page 17: Devising a scoring system

Consider a transmitter T and a receiver R connected by one of Shannon's Consider a transmitter T and a receiver R connected by one of Shannon's communication channels. T must transmit some data D to R. T and R may have communication channels. T must transmit some data D to R. T and R may have previously agreed on a code book for hypotheses, using common knowledge previously agreed on a code book for hypotheses, using common knowledge and prior expectations. If T can find a good hypothesis, H, (theory, structure, and prior expectations. If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit the data then she may be able to transmit the data pattern, ...) to fit the data then she may be able to transmit the data economically. An economically. An explanationexplanation is a two part message: is a two part message:(i) transmit H taking MsgLen(H) bits, and (ii) transmit D given H taking (i) transmit H taking MsgLen(H) bits, and (ii) transmit D given H taking MsgLen(D|H) bits.MsgLen(D|H) bits.

The message paradigm keeps us "honest": Any information that is not common The message paradigm keeps us "honest": Any information that is not common knowledge must be included in the message for it to be decipherable by the knowledge must be included in the message for it to be decipherable by the receiver; there can be no hidden parameters.receiver; there can be no hidden parameters.This issue extends to inferring (and stating) real-valued parameters to the This issue extends to inferring (and stating) real-valued parameters to the "appropriate" level of precision."appropriate" level of precision.

The method is "safe": If we use an inefficient code it can only make the hypothesis The method is "safe": If we use an inefficient code it can only make the hypothesis look less attractive than otherwise.look less attractive than otherwise.There is a natural hypothesis test: The null-theory corresponds to transmitting There is a natural hypothesis test: The null-theory corresponds to transmitting the data "as is". (That does not necessarily mean in 8-bit ascii; the language the data "as is". (That does not necessarily mean in 8-bit ascii; the language must be efficient.) If a hypothesis cannot better the null-theory then it is not must be efficient.) If a hypothesis cannot better the null-theory then it is not acceptable. acceptable.

Page 18: Devising a scoring system

A more complex hypothesis fits the data better than a simpler model, in A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesisgeneral. We see that MML encoding gives a trade-off between hypothesis complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). The MML principle is one way to justify and realize The MML principle is one way to justify and realize Occam's razorOccam's razor..

Continuous Real-Valued ParametersContinuous Real-Valued ParametersWhen a model has one or more continuous, real-valued parameters they When a model has one or more continuous, real-valued parameters they

must be stated to an "appropriate" level of precision. The parameter must must be stated to an "appropriate" level of precision. The parameter must be stated in the explanation, and only a finite number of bits can be used be stated in the explanation, and only a finite number of bits can be used for the purpose, as part of MsgLen(H). The stated value will often be close for the purpose, as part of MsgLen(H). The stated value will often be close to the to the maximum-likelihood value which minimises MsgLen(D|H)maximum-likelihood value which minimises MsgLen(D|H)..

If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the parameter, the parameter should be stated to high precision. parameter, the parameter should be stated to high precision.

If the -log likelihood varies only slowly with changes in the parameter, the If the -log likelihood varies only slowly with changes in the parameter, the parameter should be stated to low precision. parameter should be stated to low precision.

The simplest case is the multi-state or multinomial distribution where the The simplest case is the multi-state or multinomial distribution where the data is a sequence of independent values from such a distribution. The data is a sequence of independent values from such a distribution. The hypothesis, H, is an estimate of the probabilities of the various states (eg. hypothesis, H, is an estimate of the probabilities of the various states (eg. the bias of a coin or a dice). The estimate must be stated to an the bias of a coin or a dice). The estimate must be stated to an "appropriate" precision, ie. in an appropriate number of bits."appropriate" precision, ie. in an appropriate number of bits.

Page 19: Devising a scoring system

ElementaryElementary ProbabilityProbabilityA A sample-spacesample-space (e.g. S={a,c,g,t}) is the set of possible outcomes of some (e.g. S={a,c,g,t}) is the set of possible outcomes of some experimentexperiment. . Events A, B, C, ..., H, ... An Events A, B, C, ..., H, ... An eventevent is a subset (possibly a singleton) of the sample is a subset (possibly a singleton) of the sample

space, e.g. Purine = {a, g}. space, e.g. Purine = {a, g}. Events have probabilities P(A), P(B), etc. Events have probabilities P(A), P(B), etc. Random variables X, Y, Z, ... A random variable X takes values, with certain Random variables X, Y, Z, ... A random variable X takes values, with certain

probabilities, from the sample space. probabilities, from the sample space. We may write P(X=a), P(a) or P({a}) for the probability that X=a. We may write P(X=a), P(a) or P({a}) for the probability that X=a.

Thomas Bayes (1702-1761). Thomas Bayes (1702-1761). BayesBayes'' TheoremTheoremIf If BB11, B, B22, ..., B, ..., Bkk is a partition of a set B (of causes) then is a partition of a set B (of causes) then

P(BP(Bii|A) = P(A|B|A) = P(A|Bii) P(B) P(Bii) / ∑) / ∑j=1..kj=1..k P(A|B P(A|Bjj) P(B) P(Bjj) w) where i = 1, 2, ..., k here i = 1, 2, ..., k

InferenceInferenceBayes theorem is relevant to Bayes theorem is relevant to inferenceinference because we may be entertaining a number of because we may be entertaining a number of

exclusive and exhaustive hypotheses Hexclusive and exhaustive hypotheses H11, H, H22, ..., H, ..., Hkk, and wish to know which is the , and wish to know which is the bestbest explanation of some observed data D. In that case P(H explanation of some observed data D. In that case P(H ii|D) is called the |D) is called the posterior probabilityposterior probability of H of Hii, "posterior" because it is the probability , "posterior" because it is the probability afterafter the data has the data has been observed. been observed. ∑∑j=1..kj=1..k P(D|H P(D|Hjj) P(H) P(Hjj) = P(D) ) = P(D)    

P(HP(Hii|D) = P(D|H|D) = P(D|Hii) P(H) P(Hii) / P(D)   --) / P(D)   --posteriorposterior

Note that the HNote that the Hii can even be an infinite enumerable set. can even be an infinite enumerable set.

P(HP(Hii) is called the ) is called the prior probabilityprior probability of H of Hii, "prior" because it is the probability , "prior" because it is the probability beforebefore D is D is known. known.

Page 20: Devising a scoring system

Conditional ProbabilityConditional ProbabilityThe probability of B given A is written P(B|A). It is the probability of B provided that A The probability of B given A is written P(B|A). It is the probability of B provided that A is true; we do not care, either way, if A is false. Conditional probability is defined by: is true; we do not care, either way, if A is false. Conditional probability is defined by:

P(A&B) = P(A).P(B|A) = P(B).P(A|B) P(A&B) = P(A).P(B|A) = P(B).P(A|B)    

P(A|B) = P(A&B) / P(B), P(B|A) = P(A&B) / P(A) P(A|B) = P(A&B) / P(B), P(B|A) = P(A&B) / P(A) These rules are a special case of Bayes' theorem for k=2. These rules are a special case of Bayes' theorem for k=2. There are four combinations for two Boolean variables: There are four combinations for two Boolean variables:

AA not Anot A marginmargin

B B A & BA & B not A & B not A & B (A or not A)& B = B(A or not A)& B = B

not B not B A & not BA & not B not A & not Bnot A & not B (A or not A)& not B = not B(A or not A)& not B = not B

marginmargin A = A&(B or not B)A = A&(B or not B) not A = not A &(B or not B)not A = not A &(B or not B)

We can still ask what is the probability of A, say, alone We can still ask what is the probability of A, say, alone P(A) = P(A & B) + P(A & not B) P(A) = P(A & B) + P(A & not B) P(B) = P(A & B) + P(not A & B) P(B) = P(A & B) + P(not A & B)

Page 21: Devising a scoring system

IndependenceIndependenceA and B are said to be independent if the probability of A does not A and B are said to be independent if the probability of A does not

depend on B and v.v.. In that case depend on B and v.v.. In that case P(A|B)=P(A)P(A|B)=P(A) and and P(B|A)=P(B)P(B|A)=P(B) so so P(A&B) = P(A).P(B)P(A&B) = P(A).P(B) P(A & not B) = P(A) . P(not B) P(A & not B) = P(A) . P(not B) P(not A & B) = P(not A).P(B) P(not A & B) = P(not A).P(B) P(not A & not B) = P(not A) . P(not B) P(not A & not B) = P(not A) . P(not B)

A PuzzleA PuzzleI have a dice (made it myself, so it might be "tricky") which has 1, 2, 3, 4, I have a dice (made it myself, so it might be "tricky") which has 1, 2, 3, 4,

5 & 6 on different faces. Opposite faces sum to 7. The results of 5 & 6 on different faces. Opposite faces sum to 7. The results of rolling the dice 100 times (good vigorous rolls on carpet) were: rolling the dice 100 times (good vigorous rolls on carpet) were:   1- 20: 1- 20:  3 1 1 3 3 5 1 4 4 2    3 4 3 1 2 4 6 6 6 6 3 1 1 3 3 5 1 4 4 2    3 4 3 1 2 4 6 6 6 6 21- 40: 21- 40:  3 3 5 1 3 1 5 3 6 5    1 6 2 4 1 2 2 4 5 5 3 3 5 1 3 1 5 3 6 5    1 6 2 4 1 2 2 4 5 5 41- 60: 41- 60:  1 1 1 1 6 6 5 5 3 5    4 3 3 3 4 3 2 2 2 3 1 1 1 1 6 6 5 5 3 5    4 3 3 3 4 3 2 2 2 3 61- 80: 61- 80:  5 1 3 3 2 2 2 2 1 2    4 4 1 4 1 5 4 1 4 2 5 1 3 3 2 2 2 2 1 2    4 4 1 4 1 5 4 1 4 2 81-100: 81-100:  5 5 6 4 4 6 6 4 6 6    6 3 1 1 1 6 6 2 4 5 5 5 6 4 4 6 6 4 6 6    6 3 1 1 1 6 6 2 4 5

Can you Can you learnlearn anything about the dice from these results? What would anything about the dice from these results? What would you you predictpredict might come up at the next roll? How might come up at the next roll? How certaincertain are you of are you of your prediction? your prediction?

Page 22: Devising a scoring system

InformationInformationExamplesExamplesMore probable, less information. More probable, less information. I toss a coin and tell you that it came down `heads'. What is the I toss a coin and tell you that it came down `heads'. What is the informationinformation? A computer ? A computer

scientist immediately says `one bit' (I hope). scientist immediately says `one bit' (I hope). {a,c,g,t} information. log2(b^d), base and digits, Codon’s Protein information{a,c,g,t} information. log2(b^d), base and digits, Codon’s Protein informationSuppose we have a trick coin with two heads and this is common knowledge. I toss the Suppose we have a trick coin with two heads and this is common knowledge. I toss the

coin and tell you that it came down ... heads. How much information? If you had not coin and tell you that it came down ... heads. How much information? If you had not known that it was a trick coin, how much information you would have learned, so the known that it was a trick coin, how much information you would have learned, so the information learned depends on your information learned depends on your prior knowledgeprior knowledge! !

Pulling a coin out of my pocket and tell you that it is a trick coin with two heads. How much Pulling a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare and you may not even have seen one before. coins are very rare and you may not even have seen one before.

A A biasedbiased coin; it has a head and a tail but comes down heads about 75 % of the times and coin; it has a head and a tail but comes down heads about 75 % of the times and tails about 25%, and this is common knowledge. I toss the coin and tell you ... tails, two tails about 25%, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you say? say?

Information: DefinitionInformation: DefinitionThe amount of information in learning of an event `A' which has probability P(A) is The amount of information in learning of an event `A' which has probability P(A) is

I(A) = -logI(A) = -log22(P(A)) bits (P(A)) bits I(A) = -ln(P(A)) nits (aka nats) I(A) = -ln(P(A)) nits (aka nats)

Note that Note that P(A) = 1 => I(A) = -logP(A) = 1 => I(A) = -log22(1) = 0, No information. (1) = 0, No information. P(B) = 1/2 => I(B) = -logP(B) = 1/2 => I(B) = -log22(1/2) = log(1/2) = log22(2) = 1 (2) = 1 P(C) = 1/2P(C) = 1/2nn => I(C) = n => I(C) = n P(D) = 0 => I(D) = ∞ ... think about it P(D) = 0 => I(D) = ∞ ... think about it

Page 23: Devising a scoring system

EntropyEntropyEntropy tells us the Entropy tells us the average average (expected)(expected) information information in a probability in a probability

distribution over a sample space S. It is defined to be distribution over a sample space S. It is defined to be

H = - ∑H = - ∑v in Sv in S {P(v) log {P(v) log22 P(v)} P(v)}

This is for a discrete sample space but can be extended to a continuous This is for a discrete sample space but can be extended to a continuous one by the use of an integral. one by the use of an integral.

ExamplesExamplesThe fair coin The fair coin

H = -1/2 logH = -1/2 log22(1/2) - 1/2 log(1/2) - 1/2 log22(1/2) = 1/2 + 1/2 = 1 bit (1/2) = 1/2 + 1/2 = 1 bit

That biased coin, P(head)=0.75, P(tail)=0.25 That biased coin, P(head)=0.75, P(tail)=0.25

H = - 3/4 logH = - 3/4 log22(3/4) - 1/4 log(3/4) - 1/4 log22(1/4) = 3/4 (2 - log(1/4) = 3/4 (2 - log22(3)) + 2/4 = 2 - 3/4 log(3)) + 2/4 = 2 - 3/4 log22(3) (3)

bits < 1 bit bits < 1 bit

A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8 A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8

H = - 1/2 logH = - 1/2 log22(1/2) - 1/4 log(1/2) - 1/4 log22(1/4) - 1/8 log(1/4) - 1/8 log22(1/8) - 1/8 log(1/8) - 1/8 log22(1/8) = 1(1/8) = 1 3 3//44 bits bits

Page 24: Devising a scoring system

Theorem H1 Theorem H1 (The result is "classic" but this is from notes taken during talks by Chris Wallace (The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).) (1988).) http://www.csse.monash.edu.au/~lloyd/tildeMML/Notes/Information/ http://www.csse.monash.edu.au/~lloyd/tildeMML/Notes/Information/

If (pIf (pii))i=1..Ni=1..N and (q and (qii))i=1..Ni=1..N are probability distributions, i.e. each non-negative and sums to one, then the are probability distributions, i.e. each non-negative and sums to one, then the

expression expression

∑∑1=1..N1=1..N { - p { - pii log log22(q(qii) } is minimized when q) } is minimized when q ii=p=pii

ProofProof: First note that to minimize f(a,b,c) subject to g(a,b,c)=0, we consider f(a,b,c)+λ.g(a,b,c). We : First note that to minimize f(a,b,c) subject to g(a,b,c)=0, we consider f(a,b,c)+λ.g(a,b,c). We have tohave to do this because a, b & c are not independent; they are constrained by g(a,b,c)=0. If we do this because a, b & c are not independent; they are constrained by g(a,b,c)=0. If we were just to set d/da{f(a,b,c)} to zero we would miss any effects that `a' has on b & c through were just to set d/da{f(a,b,c)} to zero we would miss any effects that `a' has on b & c through g( ). We don't know how important these effects are in advance, but λ will tell us.g( ). We don't know how important these effects are in advance, but λ will tell us.We differentiate and set to zero the following: d/da{f(a,b,c)+λ.g(a,b,c)}=0 and similarly for d/db. We differentiate and set to zero the following: d/da{f(a,b,c)+λ.g(a,b,c)}=0 and similarly for d/db. d/dc and d/dd/dc and d/d  λ. λ.

d/d dwd/d dwii { - ∑ { - ∑j=1..Nj=1..N p pjj log log22 w wjj + λ((∑ + λ((∑j=1..Nj=1..N w wjj) - 1) } = - p) - 1) } = - pii/w/wii + λ = 0 + λ = 0

hence whence wii = p = pii / λ / λ

∑ ∑ ppii = 1, and ∑ w = 1, and ∑ wii = 1, so λ = 1 = 1, so λ = 1

hence whence wii = p = pii

Corollary (Information Inequality)Corollary (Information Inequality)∑∑i=1..Ni=1..N { p { pii log log22( p( pii / q / qii) } ≥ 0 ) } ≥ 0

with equality iff pwith equality iff p ii=q=qii, e.g., see Farr (1999). , e.g., see Farr (1999).

Page 25: Devising a scoring system

Kullback-Leibler DistanceKullback-Leibler DistanceThe left-hand side of the information inequality The left-hand side of the information inequality

∑∑i=1..Ni=1..N { p { pii log log22( p( pii / q / qii) } ) } is called the is called the Kullback-Leibler distanceKullback-Leibler distance (also (also relative entropyrelative entropy) of the probability ) of the probability

distribution (qdistribution (qii) from the probability distribution (p) from the probability distribution (p ii). It is always non-). It is always non-negative. Note that it is negative. Note that it is notnot symmetric in general. (The Kullback-Leibler symmetric in general. (The Kullback-Leibler distance is defined on continuous distributions through the use of an distance is defined on continuous distributions through the use of an integral in place of the sum.) integral in place of the sum.)

ExerciseExercise• Calculate the Kullback Leibler distances between the fair and biased Calculate the Kullback Leibler distances between the fair and biased

(above) probability distributions for the four-sided dice. (above) probability distributions for the four-sided dice.

NotesNotes• S. Kullback & R. A. Leibler. S. Kullback & R. A. Leibler. On Information and Sufficiency.On Information and Sufficiency. Annals of Annals of

Math. Stats. 22 pp.79-86 1951 Math. Stats. 22 pp.79-86 1951 • S. Kullback. S. Kullback. Information Theory and Statistics.Information Theory and Statistics. Wiley 1959 Wiley 1959 • C. S. Wallace. C. S. Wallace. Information TheoryInformation Theory. Dept. Computer Science, Monash . Dept. Computer Science, Monash

University, Victoria, Australia, 1988University, Victoria, Australia, 1988• G. Farr. G. Farr. Information Theory and MML InferenceInformation Theory and MML Inference. School of Computer . School of Computer

Science and Software Engineering, Monash University, Victoria, Australia, Science and Software Engineering, Monash University, Victoria, Australia, 19991999

Page 26: Devising a scoring system

InferenceInference, Introduction, Introduction

People often distinguish between People often distinguish between selecting a model class selecting a model class selecting a model from a given class and selecting a model from a given class and estimating the parameter values of a given model estimating the parameter values of a given model to fit some data.to fit some data.

It is argued that although these distinctions are sometimes of practical convenience, that's all: They It is argued that although these distinctions are sometimes of practical convenience, that's all: They are all really one and the same process of are all really one and the same process of inferenceinference. .

A (parameter estimate of a) model (class) is A (parameter estimate of a) model (class) is notnot a prediction, at least not a prediction of future data, a prediction, at least not a prediction of future data, although it might be used to predict future data. It is an explanation of, a hypothesis about, the although it might be used to predict future data. It is an explanation of, a hypothesis about, the process that generated the data. It can be a good explanation or a bad explanation. Naturally we process that generated the data. It can be a good explanation or a bad explanation. Naturally we prefer good explanations. prefer good explanations.

Model ClassModel Classe.g. e.g. polynomial modelspolynomial models form a model class for sequences of points {<x form a model class for sequences of points {<x ii,y,yii>} s.t. x>} s.t. xii<x<xi+1i+1}. The class }. The class

includes constants, straight lines, quadratics, cubics, etc. and note that these have increasing includes constants, straight lines, quadratics, cubics, etc. and note that these have increasing numbers of parameters; they grow more complex. (Note also that the class does not include numbers of parameters; they grow more complex. (Note also that the class does not include functions based on trigonometric functions; since they form another class). functions based on trigonometric functions; since they form another class).

ModelModele.g. The general e.g. The general cubic equationcubic equation y = ax y = ax33+bx+bx22+cx+d is a model for {<x+cx+d is a model for {<xii,y,yii>} s.t. x>} s.t. xii<x<xi+1i+1}. It has four }. It has four

parametersparameters a, b, c & d which can be a, b, c & d which can be estimatedestimated to to fitfit the model to some given data. the model to some given data. It is It is usuallyusually the case that a model has a fixed number of parameters (e.g. four above), but this can the case that a model has a fixed number of parameters (e.g. four above), but this can

become blurred if hierarchical parameters or dependent parameters crop up. become blurred if hierarchical parameters or dependent parameters crop up. Some writers reserve `model' for a model (as above) where all the parameters have fixed, e.g. Some writers reserve `model' for a model (as above) where all the parameters have fixed, e.g.

inferred, values; if there is any ambiguity I will (try to) use inferred, values; if there is any ambiguity I will (try to) use model instancemodel instance for the latter. e.g. The for the latter. e.g. The normal distribution N(μ,σ) is a model and N(0,1) is a model instance. normal distribution N(μ,σ) is a model and N(0,1) is a model instance.

Parameter EstimationParameter Estimatione.g. If we e.g. If we estimateestimate the parameters of a cubic to be a=1, b=2, c=3 & d=4, we get the particular cubic the parameters of a cubic to be a=1, b=2, c=3 & d=4, we get the particular cubic

polynomial y = xpolynomial y = x33+2x+2x22+3x+4. +3x+4.

Page 27: Devising a scoring system

Inference Inference Hypothesis ComplexityHypothesis Complexity

Over-fittingOver-fittingOver-fittingOver-fitting often appears as selecting a too complex model for the data. e.g. Given often appears as selecting a too complex model for the data. e.g. Given

ten data points from a physics experiment, a 9ten data points from a physics experiment, a 9thth-degree polynomial could be -degree polynomial could be fitted through them, exactly. This would almost certainly be a ridiculous thing to fitted through them, exactly. This would almost certainly be a ridiculous thing to do. That small amount of data is probably better described by a straight line or do. That small amount of data is probably better described by a straight line or by a quadratic with any minor variations explained as "noise" and experimental by a quadratic with any minor variations explained as "noise" and experimental error. error.

Parameter estimation provides another manifestation of over-fitting: stating a Parameter estimation provides another manifestation of over-fitting: stating a parameter value too accurately is also over-fitting. parameter value too accurately is also over-fitting.

• e.g. I toss a coin three times. It lands `heads' once and `tails' twice. What do we e.g. I toss a coin three times. It lands `heads' once and `tails' twice. What do we infer? p=0.5, p=0.3, p=0.33, p=0.333 or p=0.3333, etc.? In fact we learn almost infer? p=0.5, p=0.3, p=0.33, p=0.333 or p=0.3333, etc.? In fact we learn almost nothing, except that the coin does have a head on one side and a tail on the nothing, except that the coin does have a head on one side and a tail on the other; p>0 and p<1. other; p>0 and p<1.

• e.g. I toss a coin 30 times. It lands `heads' 10 times and `tails' 20 times. We are e.g. I toss a coin 30 times. It lands `heads' 10 times and `tails' 20 times. We are probably justified in starting to suspect a bias, perhaps 0.2<p<0.4. probably justified in starting to suspect a bias, perhaps 0.2<p<0.4.

• The accuracy with which the bias of a coin can be estimated can be made The accuracy with which the bias of a coin can be estimated can be made precise; see the [2-State / Binomial Distribution]. precise; see the [2-State / Binomial Distribution].

Classical statistics has developed a variety of Classical statistics has developed a variety of significance testssignificance tests to judge whether a to judge whether a model is justified by the data. An alternative is described below. model is justified by the data. An alternative is described below.

Page 28: Devising a scoring system

Inference: Inference: MMLMMLAttempts to minimise the discrepancy between given data, D, and values implied by a Attempts to minimise the discrepancy between given data, D, and values implied by a

hypothesis, H, almost always results in over-fitting, i.e. a too complex hypothesis (model, hypothesis, H, almost always results in over-fitting, i.e. a too complex hypothesis (model, parameter estimate,...). e.g. If a quadratic gives a certain root mean squared (RMS) error, parameter estimate,...). e.g. If a quadratic gives a certain root mean squared (RMS) error, then a cubic will in general give a smaller RMS value. then a cubic will in general give a smaller RMS value.

Some Some penaltypenalty for the complexity of H is needed to give teeth to the so-called "law of diminishing for the complexity of H is needed to give teeth to the so-called "law of diminishing returns". The minimum message length (MML) criterion is to consider a two-part message returns". The minimum message length (MML) criterion is to consider a two-part message (remember [Bayes]): (remember [Bayes]):

2-part message: H; (D|H) 2-part message: H; (D|H) probabilities: P(H&D) = P(H).P(D|H) probabilities: P(H&D) = P(H).P(D|H) message length: msgLen(H&D) = msgLen(H) + msgLen(D|H) message length: msgLen(H&D) = msgLen(H) + msgLen(D|H)

A complex hypothesis has a small prior probability, P(H), a large msgLen(H); it had better make A complex hypothesis has a small prior probability, P(H), a large msgLen(H); it had better make a big saving on msgLen(D|H) to pay for its msgLen(H). The name `minimum message a big saving on msgLen(D|H) to pay for its msgLen(H). The name `minimum message length' is after Shannon's mathematical theory of communication. length' is after Shannon's mathematical theory of communication.

The first part of the two-part message can be considered to be a "header", as in data The first part of the two-part message can be considered to be a "header", as in data compression or data communication. Many file compression algorithms produce a header in compression or data communication. Many file compression algorithms produce a header in the compressed file which states a number of parameter values etc., which are necessary the compressed file which states a number of parameter values etc., which are necessary for the data to be decoded. for the data to be decoded.

The use of a The use of a priorprior, P(H), is considered to be controversial in classical statistics. , P(H), is considered to be controversial in classical statistics.

NotesNotesThe idea of using compression to guide inference seems to have started in the 1960s. The idea of using compression to guide inference seems to have started in the 1960s. • R. J. Solomonoff. R. J. Solomonoff. A Formal Theory of Inductive Inference, I and II.A Formal Theory of Inductive Inference, I and II. Information and Control Information and Control 77

pp1-22 and pp224-254, 1964 pp1-22 and pp224-254, 1964 • A. N. Kolmogorov. A. N. Kolmogorov. Three approaches to the Quantitaive Definition of Information.Three approaches to the Quantitaive Definition of Information. Problems Problems

of Information and Transmission of Information and Transmission 11(1) pp1-7, 1965 (1) pp1-7, 1965 • G. J. Chaitin. G. J. Chaitin. On the Length of Programs for Computing Finite Binary Sequences.On the Length of Programs for Computing Finite Binary Sequences. JACM JACM

1313(4) pp547-569, Oct' 1966 (4) pp547-569, Oct' 1966 • C. S. Wallace and D. M. Boulton. C. S. Wallace and D. M. Boulton. An Information Measure for ClassificationAn Information Measure for Classification. CACM . CACM 1111(2) (2)

pp185-194, Aug' 1968pp185-194, Aug' 1968

Page 29: Devising a scoring system

Inference: Inference: More FormallyMore Formally• ThetaTheta: variously the parameter-space, model class etc. : variously the parameter-space, model class etc. • theta or H: variously a particular parameter value, hypothesis, model, etc theta or H: variously a particular parameter value, hypothesis, model, etc • P(theta) or P(H): prior probability of parameter value, hypothesis etc. P(theta) or P(H): prior probability of parameter value, hypothesis etc. • XX: data space, set of all possible observations : data space, set of all possible observations • D: data, an observation, D:D: data, an observation, D:XX, often x:, often x:XX • P(D|H): the P(D|H): the likelihoodlikelihood of the data D, is also written as f(D|H) of the data D, is also written as f(D|H) • P(H|D) or P(theta|D): the P(H|D) or P(theta|D): the posterior probabilityposterior probability of an estimate or hypothesis etc. of an estimate or hypothesis etc.

given observed data given observed data • P(D, H) = P(D & H): P(D, H) = P(D & H): joint probabilityjoint probability of D and H of D and H • m(D) :m(D) :XX->->ThetaTheta, function mapping observed data D onto an inferred parameter , function mapping observed data D onto an inferred parameter

estimate, hypothesis, etc. as appropriate estimate, hypothesis, etc. as appropriate

Maximum LikelihoodMaximum LikelihoodThe The maximum likelihoodmaximum likelihood principle is to choose H so as to maximize P(D|H) principle is to choose H so as to maximize P(D|H) e.g. e.g.

Binomial Distribution (Bernouilli Trials)Binomial Distribution (Bernouilli Trials)

A coin is tossed N times, landing heads, #head-times, and landing `tails', #tail=N-A coin is tossed N times, landing heads, #head-times, and landing `tails', #tail=N-#head times. We want to infer p=P(head). The #head times. We want to infer p=P(head). The likelihoodlikelihood of <#head,#tail> is: of <#head,#tail> is: P(#head|p) = pP(#head|p) = p#head#head.(1-p).(1-p)#tail#tail

Take the -ve log because (it's easier and) maximizing the likelihood is equivalent to Take the -ve log because (it's easier and) maximizing the likelihood is equivalent to minimizing the negative log likelihood: minimizing the negative log likelihood: - log- log22(P(#head|p)) = -#head.log(P(#head|p)) = -#head.log22(p) -#tail.log(p) -#tail.log22(1-p) (1-p)

Page 30: Devising a scoring system

- log- log22(P(#head|p)) = -#head.log(P(#head|p)) = -#head.log22(p) -#tail.log(p) -#tail.log22(1-p) (1-p) differentiate with respect to p and set to zero: differentiate with respect to p and set to zero:

-#head/p + #tail/(1-p) = 0 -#head/p + #tail/(1-p) = 0 #head/p = #tail/(1-p) #head/p = #tail/(1-p) #head.(1-p) = (N-#head).p #head.(1-p) = (N-#head).p #head = N.p #head = N.p p = #head / N, q = 1-p = #tail / N p = #head / N, q = 1-p = #tail / N

So the maximum-likelihood inference for the bias of the coin is #head/N. So the maximum-likelihood inference for the bias of the coin is #head/N. To sow some seeds of doubt, note that if the coin is thrown just once, the estimate for To sow some seeds of doubt, note that if the coin is thrown just once, the estimate for

p must be either 0.0 or 1.0, which seems rather silly, although one could argue that p must be either 0.0 or 1.0, which seems rather silly, although one could argue that such a small number of trials is itself rather silly. Still, if the coin were thrown 10 such a small number of trials is itself rather silly. Still, if the coin were thrown 10 times and happened to land heads 10 times, which is conceivable, an estimate of times and happened to land heads 10 times, which is conceivable, an estimate of 1.0 is not sensible. 1.0 is not sensible.

e.g. Normal Distributione.g. Normal DistributionGiven N data points, the maximum likelihood estimators for the parameters of a normal Given N data points, the maximum likelihood estimators for the parameters of a normal

distribution, μ', σ' are given by: distribution, μ', σ' are given by: μ' = {∑μ' = {∑i=1..Ni=1..N  x  xii} / N   --the sample mean } / N   --the sample mean

σ'σ'22 = ∑ = ∑i=1..Ni=1..N  (x  (xii - μ') - μ')22} / N --the sample variance } / N --the sample variance Note that σ'Note that σ'22 is biased, e.g. if there are just two data values it is implicitly assumed that is biased, e.g. if there are just two data values it is implicitly assumed that

they lie on opposite sides of the mean which plainly is not necessarily the case, i.e. they lie on opposite sides of the mean which plainly is not necessarily the case, i.e. the variance is the variance is underunder-estimated. -estimated.

NotesNotesThis section follows Farr (p17..., 1999)This section follows Farr (p17..., 1999)• G. Farr. G. Farr. Information Theory and MML InferenceInformation Theory and MML Inference. School of Computer Science and . School of Computer Science and

Software Engineering, Monash University, 1999 Software Engineering, Monash University, 1999

Page 31: Devising a scoring system

HMM (PFSM)A Markov Model is a probabilistic process defined over a A Markov Model is a probabilistic process defined over a

finite sequence of events – or states; (s_1, …, s_n), the finite sequence of events – or states; (s_1, …, s_n), the probability of each of which depends only on the events probability of each of which depends only on the events proceeding. Each state-transition generates a character proceeding. Each state-transition generates a character from the from the alphabetalphabet of the process. The probability of a of the process. The probability of a sequence x with a certain pattern appearing next, in a sequence x with a certain pattern appearing next, in a given state S(j), pr(x(t)=S(j)), may depend on the prior given state S(j), pr(x(t)=S(j)), may depend on the prior history of t-1 events.history of t-1 events.

Probabilistic Finite State Automata (PFSA) or Machine.Probabilistic Finite State Automata (PFSA) or Machine.Order of the model is the length of the history, or the context Order of the model is the length of the history, or the context

upon which the probabilities of the next outcome upon which the probabilities of the next outcome depends on. depends on.

00thth order has no dependency to the past pr{x(t) = S(i)} = order has no dependency to the past pr{x(t) = S(i)} = pr{x(t‘) = S(j)}. pr{x(t‘) = S(j)}.

22ndnd order depends upon the two previous states pr{x(t) = S(j)} order depends upon the two previous states pr{x(t) = S(j)} | { pr{x(t-1) = S(m), pr{x(t-2) = S(n) }}, for I,m,n = 1..k. | { pr{x(t-1) = S(m), pr{x(t-2) = S(n) }}, for I,m,n = 1..k.

A Hidden Markov Model (HMM) is an MM but the states are A Hidden Markov Model (HMM) is an MM but the states are hidden. For example, in the case of deleted genetic hidden. For example, in the case of deleted genetic sequences where nucleotide density information issequences where nucleotide density information is lost, lost, or neither the architecture of the model nor the transitions or neither the architecture of the model nor the transitions are known - are known - hiddenhidden. .

One must search for an automaton that One must search for an automaton that explainsexplains the data the data well (pdf). well (pdf).

It is necessary to weigh the trade-off between the complexity It is necessary to weigh the trade-off between the complexity of the model and its fit to the data. of the model and its fit to the data.

Page 32: Devising a scoring system

a Hidden Markov Model (HMM) represents stochastic sequences as Markov chains a Hidden Markov Model (HMM) represents stochastic sequences as Markov chains where the states are not directly observed, but are associated with a pdf. The where the states are not directly observed, but are associated with a pdf. The generation of a random sequence is then generation of a random sequence is then the result of a random walk in the the result of a random walk in the chainchain (i.e. the browsing of a random sequence of states S = { (i.e. the browsing of a random sequence of states S = {s1; sn} and the } and the result of a draw (called an emission) at each visit of a state.result of a draw (called an emission) at each visit of a state.

The sequence of states, which is the quantity of interest in most of the pattern recognition problems, can be observed only through the stochastic processes defined into each state (i.e. before being able to associate a set of states to a set of observations, the parameters of the pdfs of each state must be known. The true sequence of states is therefore hidden by a first layer of stochastic processes. HMMs are dynamic models, in the sense that they are specifically designed to account for some macroscopic structure of the random sequences. Consider random sequences of observations as the result of a series of independent draws in one or several Gaussian densities. To this simple statistical modeling scheme, HMMs add the specification of some statistical dependence between the (Gaussian) densities from which the observations are drawn.

Page 33: Devising a scoring system

Assessing significance requires a Assessing significance requires a distributiondistributionI have an apple of diameter 5”. How unusual?I have an apple of diameter 5”. How unusual?

Is a match significant?Is a match significant?Depends on:Depends on:

Scoring systemScoring systemDatabaseDatabaseSequence to search forSequence to search for

*** Length*** Length*** Composition*** Composition

How do we determine theHow do we determine the random sequencesrandom sequences??Generating “random” sequencesGenerating “random” sequences

Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25Doesn’t reflect natureDoesn’t reflect nature

Use sequences from a databaseUse sequences from a databaseMight have genuine homologyMight have genuine homology

?? We want unrelated sequences?? We want unrelated sequencesRandom shuffling of sequencesRandom shuffling of sequences

Preserves compositionPreserves compositionRemoves true homologyRemoves true homology

Match ScoreFre

qu

ency

or

cou

nts

What distribution do we expect to see?What distribution do we expect to see?

The The meanmean of of nn random (i.i.d.) events random (i.i.d.) events tends towards a Gaussian tends towards a Gaussian distribution. Example: Throw distribution. Example: Throw nn dice dice and compute the mean.and compute the mean.

Distribution of means:Distribution of means:

n = 2

n = 1000

Page 34: Devising a scoring system

The extreme value distribution

This means that if we get the match scores for our This means that if we get the match scores for our sequence with sequence with nn other sequences, the other sequences, the meanmean would would follow a Gaussian distribution.follow a Gaussian distribution.

The The maximummaximum of of nn (i.i.d.) random events tends towards (i.i.d.) random events tends towards the the extreme value distributionextreme value distribution as as nn grows large. grows large.

x

ex

eexf1

2

2

2

2

1

x

exf

Extreme Value:

Gaussian: