Entropy & Information content By Thomas Nordahl Petersen

Entropy, Information contents &Logo plots

By Thomas Nordahl Petersen

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

• Mutiple alignment of acceptor sites from 268 yeast DNA sequences– What is the biological signal around the site ?– What are the important positions– How can it be visualized ?

Biological information

Sequence-logo

• Logo plot with Information Content

Exon Intron Exon

Entropy - Definition

• Entropy of random variable is a measure of the uncertainty

• In Thermodynamics G=H-TS– The entropy S of a system is the degree of disorder

Entropy - Definition

• Entropy of a distribution of amino acids– The Shannon entropy:

H(p) = - a pa log2(pa), where p is an amino acid distribution.

H(p) is measured in bits: log2(2) = 1, log2(4)=2

Mutiple alignment of 3 sequencesSeq1: A L P KSeq2: A V P RSeq3: A I K R

High entropy - high disorderLow entropy - low disorder

Entropy - example

H(p) = - a pa log2(pa)

Mutiple alignment of 3 sequencesSeq1: A L RSeq2: A V RSeq3: A I K

Pos1: H(p)= -[1*log2(1)] = 0

Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =

Relative EntropyThe Kullback-Leiber distance D

How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.

D(p||q) = a pa log2(pa/qa)

Normally a background distribution of the amino acids isobtained as frequencies from a large database like UniProt.

Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

Information content

D(p||q) = a pa log2(pa/qa) Often the Information content is used as a measure of thedegree of conservation.

I = a pa log2(pa/qa)

A special case is that where all amino acids have the same background distribution: qa = 1/20

Information content

• I = a pa log2(pa/(1/20)) • = a pa [log2pa - log2(1/20)]

• = -H(p) - a palog2(1/20)

• = -H(p) + a palog2(20)

• = -H(p) + log2(20)

• = -H(p) + 4.32

Information content

• I = -H(p) + 4.32 = a palog2pa + 4.32

The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.

Mutiple alignment of 3 sequences:Seq1: A L RSeq2: A V RSeq3: A I K

Pos1: I = -[1*log2(1)]+ 4.32 = 4.32

Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

A 94 88 84 75 78 78 71 69 70 60 68 77 32 49 87 93 93 134 9 266 0 86 66 85 81 89 81 88 82

C 31 45 52 44 56 46 62 54 56 51 46 37 30 42 32 44 30 25 122 1 0 38 65 52 43 62 62 57 43

T 113 110 113 117 104 117 111 120 118 125 136 140 182 155 122 100 124 75 137 0 0 72 85 82 91 83 73 67 96

G 30 25 19 32 30 27 24 25 24 32 18 14 24 22 27 31 21 34 0 1 268 72 52 49 53 34 52 56 47

Count nucleotides at each position:

A 0,35 0,33 0,31 0,28 0,29 0,29 0,26 0,26 0,26 0,22 0,25 0,29 0,12 0,18 0,32 0,35 0,35 0,50 0,03 0,99 0,00 0,32 0,25 0,32 0,30 0,33 0,30 0,33 0,31

C 0,12 0,17 0,19 0,16 0,21 0,17 0,23 0,20 0,21 0,19 0,17 0,14 0,11 0,16 0,12 0,16 0,11 0,09 0,46 0,00 0,00 0,14 0,24 0,19 0,16 0,23 0,23 0,21 0,16

T 0,42 0,41 0,42 0,44 0,39 0,44 0,41 0,45 0,44 0,47 0,51 0,52 0,68 0,58 0,46 0,37 0,46 0,28 0,51 0,00 0,00 0,27 0,32 0,31 0,34 0,31 0,27 0,25 0,36

G 0,11 0,09 0,07 0,12 0,11 0,10 0,09 0,09 0,09 0,12 0,07 0,05 0,09 0,08 0,10 0,12 0,08 0,13 0,00 0,00 1,00 0,27 0,19 0,18 0,20 0,13 0,19 0,21 0,18

Convert to frequencies:

Frequency-logo:

Logo plots - HowTo

Logo plots - Information Content

Sequence-logo

Calculate Information Content

I = apalog2pa + log2(4), Maximal value is 2 bits

• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.

~0.5 each

Completely conserved

Programs to make a Logo plot

• WebLogo• Requires a mutiple alignment as input• Protein or DNA sequences• More output formats

• Blast2Logo• Requires a fasta file as input• Only protein sequences• Runs PSI-blast and makes a table of frequencies• pdf logo plot

WebLogo - http://weblogo.berkeley.edu/

http://weblogo.berkeley.edu/






WebLogo - http://weblogo.berkeley.edu/







Find important positions>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesteraseMKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

What is the next step ?

1 Find homologous sequences - how ?

- Blast or PsiBlast- Download sequences- Make a mutiple alignment- ClustalW or others- or use Blast2Logo program

Mutiple alignment programs

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/





Important positions

Important positions in proteins are conservedpositions => high Information Content.

Conserved for a reason:• Functionally important positions

• Catalytic residues

• Structurally important positions• Manitain the correct fold of the protein

Blast2logo

Runs iterative blast i.e. Psi-Blast

Searching for homologues sequences by useof Position Specific Scoring Matrices (PSSM).

1. Iteration - use Blosum62 scoring matrix2. Iteration - make profile of seq found in iteration 13. Iteration - make profile of seq found in iteration 24. Iteration - Calculate aa freq at each position inquery sequence. Correct for low counts and weightseq such that very similar seq are down weighted

Important positions - counting

Example. Where is the active site?• Sequence profiles might show you where to look!• The active site could be around

• S9, G42, N74, and H195

Exercise

1. Calculate nucleotide frequencies from a mutiple alignment of human donor sites

2. Calculate Entropy and Information content

3. Draw (by hand) a Logo plot

4. Use 2 Logo plot programs

5. Learn to interpret Logo & frequency plots

6. Active site residues & structural residues

Documents

Entropy & Information content By Thomas Nordahl Petersen