View
215
Download
2
Embed Size (px)
Entropy, Information contents &Logo plots
By Thomas Nordahl Petersen
Mutiple alignment of acceptor sites from 268 yeast DNA sequencesWhat is the biological signal around the site ?What are the important positionsHow can it be visualized ?
Biological information
Logo plot with Information Content
Exon Intron Exon
Entropy - Definition
Entropy of random variable is a measure of the uncertainty
In Thermodynamics G=H-TSThe entropy S of a system is the degree of disorder
Entropy - Definition
Entropy of a distribution of amino acidsThe Shannon entropy:
H(p) = - a pa log2(pa),where p is an amino
acid distribution.
H(p) is measured in bits: log2(2) = 1, log2(4)=2
Mutiple alignment of 3 sequences
Seq1: A L P K
Seq2: A V P R
Seq3: A I K R
High entropy - high disorder
Low entropy - low disorder
Entropy - example
H(p) = - a pa log2(pa)
Mutiple alignment of 3 sequences
Seq1: A L R
Seq2: A V R
Seq3: A I K
Pos1: H(p)= -[1*log2(1)] = 0
Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=
Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =
Relative EntropyThe Kullback-Leiber distance D
How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.
D(p||q) = a pa log2(pa/qa)
Normally a background distribution of the amino acids is
obtained as frequencies from a large database like UniProt.
Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87
Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46
Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16
Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07
Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71
Information content
D(p||q) = a pa log2(pa/qa)
Often the Information content is used as a measure of the
degree of conservation.
I = a pa log2(pa/qa)
A special case is that where all amino acids have the same background distribution: qa = 1/20
Information content
I = a pa log2(pa/(1/20))
= a pa [log2pa - log2(1/20)]
= -H(p) - a palog2(1/20)
= -H(p) + a palog2(20)
= -H(p) + log2(20)
= -H(p) + 4.32
Information content
I = -H(p) + 4.32 = a palog2pa + 4.32
The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.
Mutiple alignment of 3 sequences:
Seq1: A L R
Seq2: A V R
Seq3: A I K
Pos1: I = -[1*log2(1)]+ 4.32 = 4.32
Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =
Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=
Logo plots - HowTo
Logo plots - Information Content
Calculate Information Content
I = apalog2pa + log2(4), Maximal value is 2 bits
Total height at a position is the Information Content measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment.
~0.5 each
Completely conserved
Programs to make a Logo plot
WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats
Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot
WebLogo - http://weblogo.berkeley.edu/
WebLogo - http://weblogo.berkeley.edu/
Find important positions
>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase
MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR
SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV
NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG
VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL
TTTSFEGTCL
What is the next step ?
Find homologous sequences - how ?
Blast or PsiBlast Download sequences Make a mutiple alignment ClustalW or others or use Blast2Logo program
Mutiple alignment programs
Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/
Important positions
Important positions in proteins are conserved
positions => high Information Content.
Conserved for a reason:
Functionally important positions Catalytic residues
Structurally important positions Manitain the correct fold of the protein
Blast2logo
Runs iterative blast i.e. Psi-Blast
Searching for homologues sequences by use
of Position Specific Scoring Matrices (PSSM).
Iteration - use Blosum62 scoring matrix Iteration - make profile of seq found in iteration 1 Iteration - make profile of seq found in iteration 2 Iteration - Calculate aa freq at each position in
query sequence. Correct for low counts and weight
seq such that very similar seq are down weighted
Important positions - counting
Example. Where is the active site?
Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195
Exercise
Calculate nucleotide frequencies from a mutiple alignment of human donor sites
Calculate Entropy and Information content
Draw (by hand) a Logo plot
Use 2 Logo plot programs
Learn to interpret Logo & frequency plots
Active site residues & structural residues