Sequence motifs, information content, logos, and HMM’s

Sequence motifs, information content, logos, and HMM’s

Morten Nielsen,

CBS, BioCentrum,

DTU

Outline

Multiple alignments and sequence motifs Weight matrices and consensus sequence

– Sequence weighting– Low (pseudo) counts

Information content– Sequence logos– Mutual information

Example from the real world HMM’s and profile HMM’s

– TMHMM (trans-membrane protein) – Gene finding

Links to HMM packages

Multiple alignment and sequence motifs

Core Consensus sequence Weight matrices Problems

– Sequence weights– Low counts

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG

Consensus

Sequences weighting 1 - Clustering

----------MLEFVVEADLPGIKA------------------MLEFVVEFALPGIKA------------------MLEFVVEFDLPGIAA---------------------YLQDSDPDSFQD-----------GSDTITLPCRMKQFINMWQE-------------RNQEERLLADLMQNYDPNLR-----------------YDPNLRPAERDSDVVNVSLK----------------NVSLKLTLTNLISLNEREEA-------EREEALTTNVWIEMQWCDYR-------------------WCDYRLRWDPRDYEGLWVLR-----LWVLRVPSTMVWRPDIVLEN-----------------------IVLENNVDGVFEVALYCNVL--------------YCNVLVSPDGCIYWLPPAIF---------PPAIFRSACSISVTYFPFDW----

*********

} Homologous sequencesWeight = 1/n (1/3)

Consensus sequence

YRQELDPLV

Previous

FVVEFDLPG

Sequences weighting 2 - (Henikoff & Henikoff)

W FVVEADLPG 0.37FVVEFALPG 0.43FVVEFDLPG 0.32YLQDSDPDS 0.59MKQFINMWQ 0.90LMQNYDPNL 0.68PAERDSDVV 0.75LKLTLTNLI 0.85VWIEMQWCD 0.84YRLRWDPRD 0.51WRPDIVLEN 0.71VLENNVDGV 0.59YCNVLVSPD 0.71FRSACSISV 0.75

• Waa’ = 1/rs• r: Number of different aa in a column• s: Number occurrences• Normalize so Waa= 1 for each column• Sequence weight is sum of Waa

F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055Y: s=3, w`=1/21, w = 0.073M,P,W: s=1, w’=1/7, w = 0.218L,V: s=2, w’=1/14, w = 0.109

Low count correction

--------MLEFVVEADLPGIKA----------------MLEFVVEFALPGIKA----------------MLEFVVEFDLPGIAA-------------------YLQDSDPDSFQD---------GSDTITLPCRMKQFINMWQE-----------RNQEERLLADLMQNYDPNLR---------------YDPNLRPAERDSDVVNVSLK--------------NVSLKLTLTNLISLNEREEA-----EREEALTTNVWIEMQWCDYR-----------------WCDYRLRWDPRDYEGLWVLR---LWVLRVPSTMVWRPDIVLEN---------------------IVLENNVDGVFEVALYCNVL------------YCNVLVSPDGCIYWLPPAIF-------PPAIFRSACSISVTYFPFDW---- *********

Limited number of data Poor sampling of

sequence space I is not found at position

P1. Does this mean that I is forbidden?

No! Use Blosum matrix to estimate pseudo frequency of I

P1

Low count correction using Blosum matrices

# I L V

L 0.1154 0.3755 0.0962

V 0.1646 0.1303 0.2689

Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur

Estimate low (pseudo) count correction using this approach

As more data are included the pseudo count correction becomes less important

NL = 2, NV=2, Neff=12 =>fI = (2*0.1154 + 2*0.1646)/12 = 0.05

pI* = (Neff * pI + * fI)/(Neff+) = (12*0 + 10*0.05)/(12+10) = 0.02

Information content

Information and entropy– Conserved amino acid regions contain high degree of

information (high order == low entropy)– Variable amino acid regions contain low degree of information

(low order == high entropy) Shannon information

D = log2(N) + pi log2 pi (for proteins N=20, DNA N=4)

Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins)

Variable region pA=0.05, pC=0.05, .., D = 0

Sequence logo

Height of a column equal to D Relative height of a letter is

pA

Highly useful tool to visualize sequence motifs

High information position

MHC class IILogo from 10 sequences

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Frequency matrix

A R N D C Q E G H I L K M F P S T W Y V 2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 13 8 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 8 3 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 8 8 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 7 4 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 4 5 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 18 2 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 3 3 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 7 3 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18

Frequencies x 100

More on Logos

Information contentD = pi log2 (pi/qi)

Shannon, qi = 1/N = 0.05

D = pi log2 (pi) - pi log2 (1/N)

= log2 N - pi log2 (pi)

Kullback-Leibler, qi = background frequency– V/L/A more frequent than for instance C/H/W

Mutual information

I(i,j) = aai aaj

P(aai, aaj) *

log[P(aai, aaj)/P(aai)*P(aaj)]

P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10

log(0.22/0.10) > 0

ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV

GILGFVFTL WLSLLVPFVFLPSDFFPS

P1 P6

Mutual information

313 binding peptides 313 random peptides

Weight matrices

Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts

Now a weight matrix is given as

Wij = log(pij/qj) Here i is a position in the motif, and j an amino acid.

qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and

adding L values from matrix

Example from real life

10 peptides from MHCpep database

Bind to the MHC complex

Relevant for immune system recognition

Estimate sequence motif and weight matrix

Evaluate on 528 peptides

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Example from real life (cont.)

Raw sequence counting

– No sequence weighting – No pseudo count– Prediction accuracy 0.45

Sequence weighting– No pseudo count– Prediction accuracy 0.5

Example from real life (cont.)

Sequence weighting and pseudo count

– Prediction accuracy 0.60

Motif found on all data (485)

– Prediction accuracy 0.79

Hidden Markov Models

Weight matrices do not deal with insertions and deletions

In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension

HMM is a natural frame work where insertions/deletions are dealt with explicitly

HMM (a simple example)

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

Example from A. Krogh Core region defines the

number of states in the HMM (red)

Insertion and deletion statistics is derived from the non-core part of the alignment (blue)

Core of alignment

.8

.2

ACGT

ACGT

ACGT

ACGT

ACGT

ACGT.8

.8 .8.8

.2.2.2

.2

1

ACGT .2

.2

.2

.4

1. .4 1. 1.1.

.6.6

.4

HMM construction

ACA---ATG

TCAACTATC

ACAC--AGC

AGA---ATC

ACCG--ATC

• 5 matches. A, 2xC, T, G• 5 transitions in gap region

• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

Align sequence to HMM

ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2

ACAC--AGC = 1.2x10-2

AGA---ATC = 3.3x10-2

ACCG--ATC = 0.59x10-2

Consensus:

ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2

Exceptional:

TGCT--AGG = 0.0023x10-2

Align sequence to HMM - Null model

Score depends strongly on length

Null model is a random model. For length L the score is

0.25L

Log-odd score for sequence S

Log( P(S)/0.25L)

ACA---ATG = 4.9

TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97

Note!

HMM’s and weight matrices

Note. In the case of un-gapped alignments HMM’s become simple weight matrices

It still might be useful to use a HMM tool package to estimate a weight matrix– Sequence weighting– Pseudo counts

EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAKCSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAKKAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF

EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTTCSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTTKAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGNRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE

EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAILCSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAILKAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQRNRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R

EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVYCSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVYKAPB_MOUSE EKCGKEFCEF---------------------NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR

Profile HMM’s

Insertion

Deletion

Profile HMM’s

All M/D pairs must be visited once

TMHMM (trans-membrane HMM)(Sonnhammer, von Heijne, and Krogh)

Model TM length distribution.Power of HMM.Difficult in alignment.

Combination of HMM’s -Gene finding

x cccxxxxxxxxATGccc cccTAAxxxxxxxx

Inter-genicregion

Region aroundstart codon

Coding region

Region aroundstop codon

Start codon

Stop codon

HMM packages

HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.

SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa

Cruz. Freely available to academia, nominal license fee for commercial users.

META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines

features of PSSM search and profile HMM search.

NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial users.– Allows HMM architecture construction.

Simple Hmmer command

hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa

hmmbuild - build a hidden Markov model from an alignmentHMMER 2.2g (August 2001)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: A2.fsa

File format: a2mSearch algorithm configuration: Multiple domain (hmmls)

Model construction strategy: Fast/ad hoc (gapmax 0.0)Null model used: (default)

Sequence weighting method: G/S/C tree weights- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Alignment: #1Number of sequences: 232

Number of columns: 9Determining effective sequence number ... done. [192]

Weighting sequences heuristically ... done.Constructing model architecture ... done.Converting counts to probabilities ... done.

Setting model name, etc. ... done. [A2.fasta]Constructed a profile HMM (length 9)

Average score: -6.42 bitsMinimum score: -15.47 bitsMaximum score: -0.84 bits

Std. deviation: 2.72 bits

>HLA-A.0201 16 Example_for_LigandSLLPAIVEL>HLA-A.0201 16 Example_for_LigandYLLPAIVHI>HLA-A.0201 16 Example_for_LigandTLWVDPYEV>HLA-A.0201 16 Example_for_LigandSXPSGGXGV>HLA-A.0201 16 Example_for_LigandGLVPFLVSV

Documents

Sequence motifs, information content, logos, and HMM’s