Entropy & Information content By Thomas Nordahl Petersen

  • View
    215

  • Download
    2

Embed Size (px)

Text of Entropy & Information content By Thomas Nordahl Petersen

  • Entropy, Information contents &Logo plots

    By Thomas Nordahl Petersen

  • Mutiple alignment of acceptor sites from 268 yeast DNA sequencesWhat is the biological signal around the site ?What are the important positionsHow can it be visualized ?

    Biological information

    Logo plot with Information Content

    Exon Intron Exon

  • Entropy - Definition

    Entropy of random variable is a measure of the uncertainty

    In Thermodynamics G=H-TSThe entropy S of a system is the degree of disorder

  • Entropy - Definition

    Entropy of a distribution of amino acidsThe Shannon entropy:

    H(p) = - a pa log2(pa),where p is an amino

    acid distribution.

    H(p) is measured in bits: log2(2) = 1, log2(4)=2

    Mutiple alignment of 3 sequences

    Seq1: A L P K

    Seq2: A V P R

    Seq3: A I K R

    High entropy - high disorder

    Low entropy - low disorder

  • Entropy - example

    H(p) = - a pa log2(pa)

    Mutiple alignment of 3 sequences

    Seq1: A L R

    Seq2: A V R

    Seq3: A I K

    Pos1: H(p)= -[1*log2(1)] = 0

    Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=

    Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =

  • Relative EntropyThe Kullback-Leiber distance D

    How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.

    D(p||q) = a pa log2(pa/qa)

    Normally a background distribution of the amino acids is

    obtained as frequencies from a large database like UniProt.

    Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87

    Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46

    Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16

    Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07

    Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

  • Information content

    D(p||q) = a pa log2(pa/qa)

    Often the Information content is used as a measure of the

    degree of conservation.

    I = a pa log2(pa/qa)

    A special case is that where all amino acids have the same background distribution: qa = 1/20

  • Information content

    I = a pa log2(pa/(1/20))

    = a pa [log2pa - log2(1/20)]

    = -H(p) - a palog2(1/20)

    = -H(p) + a palog2(20)

    = -H(p) + log2(20)

    = -H(p) + 4.32

  • Information content

    I = -H(p) + 4.32 = a palog2pa + 4.32

    The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.

    Mutiple alignment of 3 sequences:

    Seq1: A L R

    Seq2: A V R

    Seq3: A I K

    Pos1: I = -[1*log2(1)]+ 4.32 = 4.32

    Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =

    Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=

  • Logo plots - HowTo

  • Logo plots - Information Content

    Calculate Information Content

    I = apalog2pa + log2(4), Maximal value is 2 bits

    Total height at a position is the Information Content measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment.

    ~0.5 each

    Completely conserved

  • Programs to make a Logo plot

    WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats

    Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

  • WebLogo - http://weblogo.berkeley.edu/

  • WebLogo - http://weblogo.berkeley.edu/

  • Find important positions

    >sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase

    MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR

    SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV

    NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG

    VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL

    TTTSFEGTCL

    What is the next step ?

    Find homologous sequences - how ?

    Blast or PsiBlast Download sequences Make a mutiple alignment ClustalW or others or use Blast2Logo program

  • Mutiple alignment programs

  • Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

  • Important positions

    Important positions in proteins are conserved

    positions => high Information Content.

    Conserved for a reason:

    Functionally important positions Catalytic residues

    Structurally important positions Manitain the correct fold of the protein

  • Blast2logo

    Runs iterative blast i.e. Psi-Blast

    Searching for homologues sequences by use

    of Position Specific Scoring Matrices (PSSM).

    Iteration - use Blosum62 scoring matrix Iteration - make profile of seq found in iteration 1 Iteration - make profile of seq found in iteration 2 Iteration - Calculate aa freq at each position in

    query sequence. Correct for low counts and weight

    seq such that very similar seq are down weighted

  • Important positions - counting

  • Example. Where is the active site?

    Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

  • Exercise

    Calculate nucleotide frequencies from a mutiple alignment of human donor sites

    Calculate Entropy and Information content

    Draw (by hand) a Logo plot

    Use 2 Logo plot programs

    Learn to interpret Logo & frequency plots

    Active site residues & structural residues