29
Computational Biology, Part A More on Sequence Operations Robert F. Murphy Robert F. Murphy Copyright Copyright 1997, 2001. 1997, 2001. All rights reserved. All rights reserved.

Computational Biology, Part A More on Sequence Operations Robert F. Murphy Copyright 1997, 2001. All rights reserved

Embed Size (px)

Citation preview

Computational Biology, Part AMore on Sequence OperationsComputational Biology, Part AMore on Sequence Operations

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1997, 2001. 1997, 2001.

All rights reserved.All rights reserved.

Representation and Matching of SequencesRepresentation and Matching of Sequences

Representation of SequencesRepresentation of Sequences

characterscharacters simplestsimplest easy to read, edit, etc.easy to read, edit, etc.

bit-codingbit-coding more compact, both on disk and in memorymore compact, both on disk and in memory comparisons more efficientcomparisons more efficient

Matching one character - with character variablesMatching one character - with character variables Assume two character variables "C” and “Q” Assume two character variables "C” and “Q”

test for test for exactexact match match If(Q=C) {...}If(Q=C) {...}

need complicated statements to handle need complicated statements to handle wildcardswildcards If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’| If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’|

C=‘V’ | C=‘N’)|Q=‘C’&...)) {...}C=‘V’ | C=‘N’)|Q=‘C’&...)) {...}

can build into a can build into a functionfunction If(TestBase(Q,C)) {...}If(TestBase(Q,C)) {...}

Efficient method to match one characterEfficient method to match one character Convert char to int 0-25Convert char to int 0-25 Create 26x26 matrix showing which Create 26x26 matrix showing which

matches whichmatches which Lookup two characters to be compared to Lookup two characters to be compared to

find valuefind value

Bit-codingBit-coding

let the following binary values represent each baselet the following binary values represent each base A="0001A="0001 C="0010C="0010 G="0100G="0100 T="1000T="1000

thenthen G = 4G = 4 A or C = "0011 = 3A or C = "0011 = 3 A,G or T = "1101 = 13A,G or T = "1101 = 13 etc.etc.

Matching one character - with bit codingMatching one character - with bit coding Assume two integer variables “I” and “J”Assume two integer variables “I” and “J”

test for exact matchtest for exact match If(I=J) {...}If(I=J) {...}

test for match with wildcards (no lookup!)test for match with wildcards (no lookup!) If(I&J) {...}If(I&J) {...}

Matching more than one character - pattern matchingMatching more than one character - pattern matching Example: recognition site for a restriction enzymeExample: recognition site for a restriction enzyme

Input sequence string into variable Input sequence string into variable SeqSeq Define Define Site Site as string of characters or masksas string of characters or masks

EcoRI recognizes GAATTCEcoRI recognizes GAATTC AccI recognizes GTMKACAccI recognizes GTMKAC

Create function to search a sequence for that siteCreate function to search a sequence for that site Find(Site,LenSite,Seq,LenSeq)Find(Site,LenSite,Seq,LenSeq) for each position in for each position in SeqSeq, see if , see if SiteSite matches starting there matches starting there

Automating Probability Calculations using Nucleotide Frequencies

Automating Probability Calculations using Nucleotide Frequencies

Automating the CalculationAutomating the Calculation

Goal: Calculate probability of occurrence of a Goal: Calculate probability of occurrence of a sequence that may include ambiguous basessequence that may include ambiguous bases

What we need is a way to consider all What we need is a way to consider all possible allowed nucleotides at each position possible allowed nucleotides at each position in all in all allowedallowed combinations combinations

When using dinucleotide probabilities, have When using dinucleotide probabilities, have to be careful about how the probabilities are to be careful about how the probabilities are combinedcombined

IllustrationIllustration

Question: What is the probability of Question: What is the probability of observing sequence feature observing sequence feature ARTART ( (AA followed by a purine {either followed by a purine {either AA or or GG}, }, followed by a followed by a TT) using dinucleotide ) using dinucleotide probabilities?probabilities?

Which is right?Which is right?

ppARTART=p=pAA(p(p**AAAA+p+p**

AGAG)(p)(p**ATAT+p+p**

GTGT) [eq.1]) [eq.1]

ppARTART=p=pAA(p(p**AAAApp**

ATAT+p+p**AGAGpp**

GTGT) [eq.2]) [eq.2]

ExpansionsExpansions

ppARTART=p=pAA(p(p**AAAA+p+p**

AGAG)(p)(p**ATAT+p+p**

GTGT) [eq.1]) [eq.1]

ppARTART=p=pAApp**AAAApp**

AT AT + p+ pAApp**AAAApp**

GT GT

+ p+ pAApp**AGAGpp**

AT AT + p+ pAApp**AGAGpp**

GTGT))

ppARTART=p=pAA(p(p**AAAApp**

ATAT+p+p**AGAGpp**

GTGT) [eq.2]) [eq.2]

ppART= ART= ppAApp**AAAApp**

AT + AT + ppAApp**AGAGpp**

GTGT

ProofProof

ppARTART=p=pAATAAT+p+pAGTAGT

ppAATAAT=p=pAApp**AAAApp**

ATAT

ppAGTAGT=p=pAApp**AGAGpp**

GTGT

ppART= ART= ppAApp**AAAApp**

AT + AT + ppAApp**AGAGpp**

GTGT

This matches equation 2 on previous slideThis matches equation 2 on previous slide

Need further convincing? Need further convincing?

Imagine that pImagine that p**AAAA=0 and p=0 and p**

GTGT=0 (but all =0 (but all

other pother p** are non-zero) are non-zero) Then pThen pARTART should be zero since there is no should be zero since there is no

way to create either AAT or AGTway to create either AAT or AGT This is predicted by eq. 2 but not by eq. 1This is predicted by eq. 2 but not by eq. 1

More complicated probability illustrationMore complicated probability illustration What is the probability of observing the What is the probability of observing the

sequence feature sequence feature ARYTARYT ( (AA followed by a followed by a purine {either purine {either AA or or GG}, followed by a }, followed by a pyrimidine {either pyrimidine {either CC or or TT}, followed by a }, followed by a TT)?)?

Using equal mononucleotide frequenciesUsing equal mononucleotide frequencies ppAA = p = pCC = p = pGG = p = pTT = 1/4 = 1/4

ppARYTARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 =

1/641/64

Illustration (continued)Illustration (continued)

Using observed mononucleotide Using observed mononucleotide frequencies:frequencies: ppARYTARYT = p = pAA (p (pAA + p + pGG) (p) (pCC + p + pTT) p) pTT

Using dinucleotide frequencies:Using dinucleotide frequencies: ppARYTARYT = p = pAA (p (p**

AA AA (p(p**ACACpp**

CTCT + p + p**ATATpp**

TTTT) + ) +

p p**AG AG (p(p**

GCGCpp**CTCT + p + p**

GTGTpp**TTTT) )) )

Illustration (continued)Illustration (continued)

Using dinucleotide frequencies:Using dinucleotide frequencies:

A

+A=AA

+G=AG

+C=AAC

+T=AAT

+C=AGC

+T=AGT

+T=AACT

+T=AATT

+T=AGCT

+T=AGTT

A R Y T

Multiply then addMultiply then add

We conclude that for such strings our rule We conclude that for such strings our rule should be “multiply dinucleotide should be “multiply dinucleotide probabilities along each allowed path and probabilities along each allowed path and then add the results”then add the results”

How do we program this?How do we program this?

““forfor” loops?” loops? Nested “Nested “ifif” structure?” structure? Other?Other?

Will this work?Will this work?

result=monoprob(seq(1));result=monoprob(seq(1));for i=2 to nfor i=2 to n

{{temp=0.temp=0.for j=1 to 4 for j=1 to 4 /*for each base*//*for each base*/

{{if(seq(i)&mask(j)) if(seq(i)&mask(j)) temp=temp+diprob(seq(i-temp=temp+diprob(seq(i-1),seq(i))1),seq(i))

}}result=result*tempresult=result*temp}}

No to forNo to for

No, it generates add then multiplyNo, it generates add then multiply

A recursive solutionA recursive solution

Some programming languages allow Some programming languages allow recursionrecursion - the calling (invoking) of a - the calling (invoking) of a function by itselffunction by itself

This is useful here because we can This is useful here because we can branchbranch when we encounter an ambiguous base and when we encounter an ambiguous base and consider all alternatives consider all alternatives separatelyseparately

Allows multiplication down the branches Allows multiplication down the branches and then additionand then addition

Site Probability Calculation via RecursionSite Probability Calculation via Recursion Illustration: Make a function that prints out Illustration: Make a function that prints out

all possible sequences that can match a all possible sequences that can match a restriction siterestriction site

(Demo Program PossibleSites.c)(Demo Program PossibleSites.c) (found in (found in

/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoPrograms ms or or Mellon: BioServer: Comp. Biol. 03-310: Demo Mellon: BioServer: Comp. Biol. 03-310: Demo Programs: PossibleSites ƒPrograms: PossibleSites ƒ))

PossibleSites.cPossibleSites.c /* PossibleSites.c

Prints out all possible sites that can match a string of IUB codes

January 22, 1997 - R.F. Murphy */

#include <stdio.h> #include <string.h>

void PossibleSites(char SiteString[], int Index);

short Test1(char SiteString[], int Index);

short Test2(char SiteString[], int Index);

short Test3(char SiteString[], int Index);

short Test4(char SiteString[], int Index);

void main(void) { char Site[10]; do { printf("Enter a string of IUB

codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); }

void PossibleSites(char SiteString[], int Index)

{ if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString,

Index)) ; else if (Test3(SiteString,

Index)) ; else if (Test4(SiteString,

Index)) ; else { printf("Illegal character (%c)

encountered\n",SiteString[Index]) ;

PossibleSites(SiteString,Index+1); } } return; }

short Test1(char SiteString[], int Index)

{ /* printf("In Test1: Index %d,

SiteString[Index] %c\n",Index,SiteString[Index]); */

switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; }

Unwind here

Test for each type of ambiguous base

short Test2(char SiteString[], int Index)

{ char Save;

/* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */

Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString,

Index); SiteString[Index]='G'; PossibleSites(SiteString,

Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString,

Index); SiteString[Index]='T'; PossibleSites(SiteString,

Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString,

Index); SiteString[Index]='C'; PossibleSites(SiteString,

Index); break;

case 'W': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }

short Test3(char SiteString[], int Index)

{ char Save;

/* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */

Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString,

Index); SiteString[Index]='G'; PossibleSites(SiteString,

Index); SiteString[Index]='T'; PossibleSites(SiteString,

Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString,

Index); SiteString[Index]='G'; PossibleSites(SiteString,

Index); SiteString[Index]='T'; PossibleSites(SiteString,

Index); break;

case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString,

Index); SiteString[Index]='C'; PossibleSites(SiteString,

Index); SiteString[Index]='T'; PossibleSites(SiteString,

Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString,

Index); SiteString[Index]='C'; PossibleSites(SiteString,

Index); SiteString[Index]='G'; PossibleSites(SiteString,

Index); break; default: return false; } SiteString[Index] = Save; return true; }

short Test4(char SiteString[], int Index)

{ char Save;

/* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */

Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U

(iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString,

Index); SiteString[Index]='C'; PossibleSites(SiteString,

Index); SiteString[Index]='G'; PossibleSites(SiteString,

Index); SiteString[Index]='T'; PossibleSites(SiteString,

Index); break; default: return false; } SiteString[Index] = Save; return true; }