69
Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences, such as CpG islands. 2. However functional biological sequences typically come in families, so what we are after is identifying the relationship of an individual sequences to a sequence family. 3. A multiple sequence alignment can show how the sequences in a family relate to each other. 4. Some positions in a multiple sequence alignment are more conserved than others (e.g. helices as opposed to loop regions). 5. Therefore it would be desirable, when identifying new sequence members to concentrate more on conserved features. 6. For this task we will discuss a special type of HMM, well suited to modelling multiple alignments; we will call these profile HMMs. 7. Profile HMMs are the most popular application of HMMs in molecular biology.

4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 2: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 3: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 4: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 5: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 6: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 7: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs for sequence families

1. So far we have concentrated on the intrinsic properties of single sequences,

such as CpG islands.

2. However functional biological sequences typically come in families, so what

we are after is identifying the relationship of an individual sequences to a

sequence family.

3. A multiple sequence alignment can show how the sequences in a family

relate to each other.

4. Some positions in a multiple sequence alignment are more conserved than

others (e.g. helices as opposed to loop regions).

5. Therefore it would be desirable, when identifying new sequence members to

concentrate more on conserved features.

6. For this task we will discuss a special type of HMM, well suited to modelling

multiple alignments; we will call these profile HMMs.

7. Profile HMMs are the most popular application of HMMs in molecular biology.

Page 8: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Position specific score matrix

A common feature of protein multiple sequence alignments is that gaps tend to line up

with each other, leaving solid blocks with no insertions or deletions.

A probabilistic model for a conserved region (block) would be to specify independent

probabilities ei(α) of observing amino acid α in position i. The probability of a new

sequence x, is:

where L is the length of the block. To test for family membership we evaluate, as usual

log-odds ratio:

The values log(ei(a)/qa) behave like elements in a score matrix s(a, b) where the

second index is position i, rather than amino acid b.

Such an approach is known as a position specific score matrix (PSSM).

Page 9: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Position specific score matrix

A common feature of protein multiple sequence alignments is that gaps tend to line up

with each other, leaving solid blocks with no insertions or deletions.

A probabilistic model for a conserved region (block) would be to specify independent

probabilities ei(α) of observing amino acid α in position i. The probability of a new

sequence x, is:

where L is the length of the block. To test for family membership we evaluate, as usual

log-odds ratio:

The values log(ei(a)/qa) behave like elements in a score matrix s(a, b) where the

second index is position i, rather than amino acid b.

Such an approach is known as a position specific score matrix (PSSM).

1

( | ) ( )L

i i

i

P x M e x=

=∏

Page 10: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Position specific score matrix

A common feature of protein multiple sequence alignments is that gaps tend to line up

with each other, leaving solid blocks with no insertions or deletions.

A probabilistic model for a conserved region (block) would be to specify independent

probabilities ei(α) of observing amino acid α in position i. The probability of a new

sequence x, is:

where L is the length of the block. To test for family membership we evaluate, as usual

log-odds ratio:

The values log(ei(a)/qa) behave like elements in a score matrix s(a, b) where the

second index is position i, rather than amino acid b.

Such an approach is known as a position specific score matrix (PSSM).

1

( | ) ( )L

i i

i

P x M e x=

=∏

1

( )log

i

Li i

i x

e xS

q=

=∑

Page 11: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Position specific score matrix

A common feature of protein multiple sequence alignments is that gaps tend to line up

with each other, leaving solid blocks with no insertions or deletions.

A probabilistic model for a conserved region (block) would be to specify independent

probabilities ei(α) of observing amino acid α in position i. The probability of a new

sequence x, is:

where L is the length of the block. To test for family membership we evaluate, as usual

log-odds ratio:

The values log(ei(a)/qa) behave like elements in a score matrix s(a, b) where the

second index is position i, rather than amino acid b.

Such an approach is known as a position specific score matrix (PSSM).

1

( | ) ( )L

i i

i

P x M e x=

=∏

1

( )log

i

Li i

i x

e xS

q=

=∑

Page 12: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs: αs and es

Page 13: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs: transition structure

•Match

•Insertion

•Deletion

Page 14: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs: Viterbi

Let VjM(i) be the log-odds score of the best path matching

subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj (match).

Similarly VjI(i), Vj

D(i) are the scores for state Ij (insertion) and Dj(deletion) respectively.

Then we can write:

Page 15: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs: Viterbi

Let VjM(i) be the log-odds score of the best path matching

subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj (match).

Similarly VjI(i), Vj

D(i) are the scores for state Ij (insertion) and Dj(deletion) respectively.

Then we can write:

Page 16: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Profile HMMs: model construction

Page 17: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 18: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

v

Page 19: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 20: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 21: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 22: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Begin

Page 23: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Begin

Page 24: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

((

(

Page 25: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

(

Page 26: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

(

Page 27: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 28: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 29: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

(

Page 30: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

(

Page 31: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

((( (

Page 32: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

((

Page 33: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

End

Page 34: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

End

Page 35: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 36: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 37: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 38: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 39: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 40: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 41: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 42: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 43: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 44: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 45: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 46: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

hmmalign - align sequences to a profile HMM

hmmbuild - construct profile HMM(s) from multiple sequence alignment(s)

hmmconvert - convert profile file to a HMMER format

hmmemit - sample sequences from a profile HMM

hmmfetch - retrieve profile HMM(s) from a file

hmmpress - prepare an HMM database for hmmscan

hmmscan - search sequence(s) against a profile database

hmmsearch - search profile(s) against a sequence database

hmmsim - collect score distributions on random sequences

hmmstat - display summary statistics for a profile file

jackhmmer - iteratively search sequence(s) against a protein database

phmmer - search protein sequence(s) against a protein sequence database

SOURCE: http://hmmer.janelia.org/

HMMER

Page 47: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 48: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

the number of

sequences that the

HMM was trained on

Page 49: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

the number of

sequences that the

HMM was trained on

the effective total number

of sequences determined

by hmmbuild during

sequence weighting, for

combining observed

counts with Dirichlet prior

information in

parameterizing the model

Page 50: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

the number of

sequences that the

HMM was trained on

the effective total number

of sequences determined

by hmmbuild during

sequence weighting, for

combining observed

counts with Dirichlet prior

information in

parameterizing the model

This number is calculated

from the training sequence

data, and used in conjunction

with the alignment map

information to verify that a

given alignment is indeed the

alignment that the map is for

Page 51: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

Reference annotation flag;.Reference

column annotation is picked up from a

Stockholm alignment file’s #=GC RF line.

the number of

sequences that the

HMM was trained on

the effective total number

of sequences determined

by hmmbuild during

sequence weighting, for

combining observed

counts with Dirichlet prior

information in

parameterizing the model

This number is calculated

from the training sequence

data, and used in conjunction

with the alignment map

information to verify that a

given alignment is indeed the

alignment that the map is for

Page 52: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

Reference annotation flag;.Reference

column annotation is picked up from a

Stockholm alignment file’s #=GC RF line.

Consensus structure annotation flag; is picked

up from a Stockholm file’s #=GC SS_cons line

the number of

sequences that the

HMM was trained on

the effective total number

of sequences determined

by hmmbuild during

sequence weighting, for

combining observed

counts with Dirichlet prior

information in

parameterizing the model

This number is calculated

from the training sequence

data, and used in conjunction

with the alignment map

information to verify that a

given alignment is indeed the

alignment that the map is for

Page 53: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the number of

match states in

the model.

Symbol alphabet

type

Reference annotation flag;.Reference

column annotation is picked up from a

Stockholm alignment file’s #=GC RF line.

Consensus structure annotation flag; is picked

up from a Stockholm file’s #=GC SS_cons line

Map annotation flag. The HMM/alignment map

annotates each match state with the index of the

alignment column from which it came. It can be

used for quickly mapping any subsequent HMM

alignment back to the original multiple alignment,

via the model

the number of

sequences that the

HMM was trained on

the effective total number

of sequences determined

by hmmbuild during

sequence weighting, for

combining observed

counts with Dirichlet prior

information in

parameterizing the model

This number is calculated

from the training sequence

data, and used in conjunction

with the alignment map

information to verify that a

given alignment is indeed the

alignment that the map is for

Page 54: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Statistical parameters

needed for E-value

calculations.

Page 55: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Flags the start of the

main model section

Statistical parameters

needed for E-value

calculations.

Page 56: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Flags the start of the

main model section

these are the model’s

overall average match

state emission

probabilities, which are

used as a background

residue composition in the

“filter null” model.

Statistical parameters

needed for E-value

calculations.

Page 57: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

contains information for the core model’s BEGIN node. This is

stored as model node 0, and match state 0 is treated as the BEGIN

state.

The begin state is mute, so there are no match emission

probabilities. The first line is the insert 0 emissions. The second line

contains the transitions from the begin state and insert state 0.

These seven numbers are:

B → M1, B → I0, B → D1; I0 → M1, I0 → I0;

then a 0.0 and a ’*’, because by convention, nonexistent

transitions from the nonexistent delete state 0 are set to log 1 = 0

and log 0 = -∞ = ‘*’.

Page 58: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Line 1: Match emission line

Line 2: Insert emission line

Line 3: State transition line

Page 59: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

Line 1: Match emission line

Line 2: Insert emission line

Line 3: State transition line

Page 60: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

the MAP annotation

the RF annotation

the CS annotation

… for this node

Line 1: Match emission line

Line 2: Insert emission line

Line 3: State transition line

Page 61: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However
Page 62: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

Page 63: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

Page 64: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

Page 65: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

The “independent E-value”, the E-

value that the sequence/profile

comparison would have

received if this were the only

domain envelope found in it,

excluding any others. This is a

stringent measure of how reliable

this particular domain may be. The

independent E-value uses the total

number of targets in the target

database.

Page 66: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

The “independent E-value”, the E-

value that the sequence/profile

comparison would have

received if this were the only

domain envelope found in it,

excluding any others. This is a

stringent measure of how reliable

this particular domain may be. The

independent E-value uses the total

number of targets in the target

database.

The mean posterior

probability of aligned

residues in the alignment;

a measure of how reliable

the overall alignment is

Page 67: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

The “independent E-value”, the E-

value that the sequence/profile

comparison would have

received if this were the only

domain envelope found in it,

excluding any others. This is a

stringent measure of how reliable

this particular domain may be. The

independent E-value uses the total

number of targets in the target

database.

The start/end of the domain envelope on the

sequence, numbered 1..L for a sequence

of L residues. The envelope defines a

subsequence for which their is substantial

probability mass supporting a homologous

domain

The mean posterior

probability of aligned

residues in the alignment;

a measure of how reliable

the overall alignment is

Page 68: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

The “independent E-value”, the E-

value that the sequence/profile

comparison would have

received if this were the only

domain envelope found in it,

excluding any others. This is a

stringent measure of how reliable

this particular domain may be. The

independent E-value uses the total

number of targets in the target

database.

The start/end of the alignment

of this domain with respect to

the sequence

The start/end of the domain envelope on the

sequence, numbered 1..L for a sequence

of L residues. The envelope defines a

subsequence for which their is substantial

probability mass supporting a homologous

domain

The mean posterior

probability of aligned

residues in the alignment;

a measure of how reliable

the overall alignment is

Page 69: 4. Profile HMMs - eclass.uoa.gr · Profile HMMs for sequence families 1. So far we have concentrated on the intrinsic properties of single sequences , such as CpGislands . 2. However

score: The bit score for this

domain.

The biased composition

(null2) score correction that

was applied to the domain bit

score.

The “conditional E-value”, a

permissive measure of how reliable

this particular domain

may be. The conditional E-value is

calculated on a smaller search

space than the independent

Evalue. The conditional E-value

uses the number of targets that

pass the reporting thresholds.

The “independent E-value”, the E-

value that the sequence/profile

comparison would have

received if this were the only

domain envelope found in it,

excluding any others. This is a

stringent measure of how reliable

this particular domain may be. The

independent E-value uses the total

number of targets in the target

database.

The start/end of the alignment

of this domain with respect to

the profile

The start/end of the alignment

of this domain with respect to

the sequence

The start/end of the domain envelope on the

sequence, numbered 1..L for a sequence

of L residues. The envelope defines a

subsequence for which their is substantial

probability mass supporting a homologous

domain

The mean posterior

probability of aligned

residues in the alignment;

a measure of how reliable

the overall alignment is