Upload
patia
View
47
Download
0
Embed Size (px)
DESCRIPTION
Sequence motifs, information content, logos, and HMM’s. Morten Nielsen, CBS, BioCentrum, DTU. Outline. Multiple alignments and sequence motifs Weight matrices and consensus sequence Sequence weighting Low (pseudo) counts Information content Sequence logos Mutual information - PowerPoint PPT Presentation
Citation preview
Sequence motifs, information content, logos, and HMM’s
Morten Nielsen,
CBS, BioCentrum,
DTU
Outline
Multiple alignments and sequence motifs Weight matrices and consensus sequence
– Sequence weighting– Low (pseudo) counts
Information content– Sequence logos– Mutual information
Example from the real world HMM’s and profile HMM’s
– TMHMM (trans-membrane protein) – Gene finding
Links to HMM packages
Multiple alignment and sequence motifs
Core Consensus sequence Weight matrices Problems
– Sequence weights– Low counts
----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG
Consensus
Sequences weighting 1 - Clustering
----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW----
*********
} Homologous sequencesWeight = 1/n (1/3)
Consensus sequence
YRQELDPLV
Previous
FVVEFDLPG
Sequences weighting 2 - (Henikoff & Henikoff)
W FVVEADLPG 0.37FVVEFALPG 0.43FVVEFDLPG 0.32YLQDSDPDS 0.59MKQFINMWQ 0.90LMQNYDPNL 0.68PAERDSDVV 0.75LKLTLTNLI 0.85VWIEMQWCD 0.84YRLRWDPRD 0.51WRPDIVLEN 0.71VLENNVDGV 0.59YCNVLVSPD 0.71FRSACSISV 0.75
• Waa’ = 1/rs• r: Number of different aa in a column• s: Number occurrences• Normalize so Waa= 1 for each column• Sequence weight is sum of Waa
F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055Y: s=3, w`=1/21, w = 0.073M,P,W: s=1, w’=1/7, w = 0.218L,V: s=2, w’=1/14, w = 0.109
Low count correction
--------MLEFVVEADLPGIKA----------------MLEFVVEFALPGIKA----------------MLEFVVEFDLPGIAA-------------------YLQDSDPDSFQD---------GSDTITLPCRMKQFINMWQE-----------RNQEERLLADLMQNYDPNLR---------------YDPNLRPAERDSDVVNVSLK--------------NVSLKLTLTNLISLNEREEA-----EREEALTTNVWIEMQWCDYR-----------------WCDYRLRWDPRDYEGLWVLR---LWVLRVPSTMVWRPDIVLEN---------------------IVLENNVDGVFEVALYCNVL------------YCNVLVSPDGCIYWLPPAIF-------PPAIFRSACSISVTYFPFDW---- *********
Limited number of data Poor sampling of
sequence space I is not found at position
P1. Does this mean that I is forbidden?
No! Use Blosum matrix to estimate pseudo frequency of I
P1
Low count correction using Blosum matrices
# I L V
L 0.1154 0.3755 0.0962
V 0.1646 0.1303 0.2689
Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur
Estimate low (pseudo) count correction using this approach
As more data are included the pseudo count correction becomes less important
NL = 2, NV=2, Neff=12 =>fI = (2*0.1154 + 2*0.1646)/12 = 0.05
pI* = (Neff * pI + * fI)/(Neff+) = (12*0 + 10*0.05)/(12+10) = 0.02
Information content
Information and entropy– Conserved amino acid regions contain high degree of
information (high order == low entropy)– Variable amino acid regions contain low degree of information
(low order == high entropy) Shannon information
D = log2(N) + pi log2 pi (for proteins N=20, DNA N=4)
Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)
Variable region pA=0.05, pC=0.05, .., D = 0
Sequence logo
Height of a column equal to D Relative height of a letter is
pA
Highly useful tool to visualize sequence motifs
High information position
MHC class IILogo from 10 sequences
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
Frequency matrix
A R N D C Q E G H I L K M F P S T W Y V 2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 13 8 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 8 3 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 8 8 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 7 4 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 4 5 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 18 2 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 3 3 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 7 3 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18
Frequencies x 100
More on Logos
Information contentD = pi log2 (pi/qi)
Shannon, qi = 1/N = 0.05
D = pi log2 (pi) - pi log2 (1/N)
= log2 N - pi log2 (pi)
Kullback-Leibler, qi = background frequency– V/L/A more frequent than for instance C/H/W
Mutual information
I(i,j) = aai aaj
P(aai, aaj) *
log[P(aai, aaj)/P(aai)*P(aaj)]
P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV
GILGFVFTL WLSLLVPFVFLPSDFFPS
P1 P6
Mutual information
313 binding peptides 313 random peptides
Weight matrices
Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts
Now a weight matrix is given as
Wij = log(pij/qj) Here i is a position in the motif, and j an amino acid.
qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and
adding L values from matrix
Example from real life
10 peptides from MHCpep database
Bind to the MHC complex
Relevant for immune system recognition
Estimate sequence motif and weight matrix
Evaluate on 528 peptides
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Example from real life (cont.)
Raw sequence counting
– No sequence weighting – No pseudo count– Prediction accuracy 0.45
Sequence weighting– No pseudo count– Prediction accuracy 0.5
Example from real life (cont.)
Sequence weighting and pseudo count
– Prediction accuracy 0.60
Motif found on all data (485)
– Prediction accuracy 0.79
Hidden Markov Models
Weight matrices do not deal with insertions and deletions
In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension
HMM is a natural frame work where insertions/deletions are dealt with explicitly
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
Example from A. Krogh Core region defines the
number of states in the HMM (red)
Insertion and deletion statistics is derived from the non-core part of the alignment (blue)
Core of alignment
.8
.2
ACGT
ACGT
ACGT
ACGT
ACGT
ACGT.8
.8 .8.8
.2.2.2
.2
1
ACGT .2
.2
.2
.4
1. .4 1. 1.1.
.6.6
.4
HMM construction
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G• 5 transitions in gap region
• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
Align sequence to HMM - Null model
Score depends strongly on length
Null model is a random model. For length L the score is
0.25L
Log-odd score for sequence S
Log( P(S)/0.25L)
ACA---ATG = 4.9
TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97
Note!
HMM’s and weight matrices
Note. In the case of un-gapped alignments HMM’s become simple weight matrices
It still might be useful to use a HMM tool package to estimate a weight matrix– Sequence weighting– Pseudo counts
EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAKCSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAKKAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF
EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTTCSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTTKAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGNRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE
EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAILCSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAILKAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQRNRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R
EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVYCSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVYKAPB_MOUSE EKCGKEFCEF---------------------NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR
Profile HMM’s
Insertion
Deletion
Profile HMM’s
All M/D pairs must be visited once
TMHMM (trans-membrane HMM)(Sonnhammer, von Heijne, and Krogh)
Model TM length distribution.Power of HMM.Difficult in alignment.
Combination of HMM’s -Gene finding
x cccxxxxxxxxATGccc cccTAAxxxxxxxx
Inter-genicregion
Region aroundstart codon
Coding region
Region aroundstop codon
Start codon
Stop codon
HMM packages
HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.
SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa
Cruz. Freely available to academia, nominal license fee for commercial users.
META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines
features of PSSM search and profile HMM search.
NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial users.– Allows HMM architecture construction.
Simple Hmmer command
hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa
hmmbuild - build a hidden Markov model from an alignmentHMMER 2.2g (August 2001)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: A2.fsa
File format: a2mSearch algorithm configuration: Multiple domain (hmmls)
Model construction strategy: Fast/ad hoc (gapmax 0.0)Null model used: (default)
Sequence weighting method: G/S/C tree weights- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Alignment: #1Number of sequences: 232
Number of columns: 9Determining effective sequence number ... done. [192]
Weighting sequences heuristically ... done.Constructing model architecture ... done.Converting counts to probabilities ... done.
Setting model name, etc. ... done. [A2.fasta]Constructed a profile HMM (length 9)
Average score: -6.42 bitsMinimum score: -15.47 bitsMaximum score: -0.84 bits
Std. deviation: 2.72 bits
>HLA-A.0201 16 Example_for_LigandSLLPAIVEL>HLA-A.0201 16 Example_for_LigandYLLPAIVHI>HLA-A.0201 16 Example_for_LigandTLWVDPYEV>HLA-A.0201 16 Example_for_LigandSXPSGGXGV>HLA-A.0201 16 Example_for_LigandGLVPFLVSV