35
Protein Domain Analysis Using Hidden Markov Models Liangjiang (LJ) Wang [email protected] March 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 17

Protein Domain Analysis Using Hidden Markov Models

  • Upload
    gaia

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

PLPTH 890 Introduction to Genomic Bioinformatics Lecture 17. Protein Domain Analysis Using Hidden Markov Models. Liangjiang (LJ) Wang [email protected] March 10, 2005. Outline. Basic concepts and biological problems. Search for protein domains: The Pfam database, - PowerPoint PPT Presentation

Citation preview

Page 1: Protein Domain Analysis Using Hidden Markov Models

Protein Domain Analysis Using

Hidden Markov Models

Liangjiang (LJ) Wang

[email protected]

March 10, 2005

PLPTH 890 Introduction to Genomic Bioinformatics

Lecture 17

Page 2: Protein Domain Analysis Using Hidden Markov Models

Outline

• Basic concepts and biological problems.

• Search for protein domains:

– The Pfam database,

– Other domain/motif databases.

• Protein domain modeling:

– Hidden Markov Models (HMM),

– Construction of the Pfam protein domain models using HMMER.

Page 3: Protein Domain Analysis Using Hidden Markov Models

Biological Problem #1

You identified a new gene, which might be involved in a very interesting biological process. BLAST search in GenBank resulted in a few homologous sequences with unknown function. What else can you do to understand the function of the gene product and/or to localize the possible conserved domain in the protein?

Page 4: Protein Domain Analysis Using Hidden Markov Models

Biological Problem #2

Suppose there is a novel gene identified in mammals, C. elegans and Drosophila, but not yet in plants. This gene is involved in an interesting biological process (e.g., apoptosis). You are interested in finding the orthologous gene in Arabidopsis. However, BLAST search using each of the known sequences failed to identify an Arabidopsis homologue. What else can you try?

Page 5: Protein Domain Analysis Using Hidden Markov Models

Orthologs, Paralogs and Homologs

X Y

X X

X1 X2

Y Y

Ya Yb

Ancestralorganism

Speciation

A B

A B

X1 and X2 are orthologs with same function.

Paralogs Ya and Yb may have different but related functions.

Duplication

Homologs

Page 6: Protein Domain Analysis Using Hidden Markov Models

Protein DomainsDomains represent evolutionarily conserved amino acid sequences carrying functional and structural information of a protein. Domain analysis helps understand the biological function of a gene product.

bZIP

Page 7: Protein Domain Analysis Using Hidden Markov Models

Protein Domain Analysis Using HMM

Multiple Sequence Alignment

HMMER Search

Hidden Markov Models

Your Sequence Set

>TC50726 AIKLNDVKSCQGTAFWMAPEVVRGKVKGYGLPADIWSLGCTVLEMLTGQVPYAPMECISAMFRIGKGELPPVPDTLSRDARDFILQCLKVNPDDRPTAAQLLDHKFVQRSFSQSSGSASPHIPRRS>UFO_ARATH MDSTVFINNPSLTLPFSYTFTSSSNSSTTTSTTTDSSSGQWMDGRIWSKLPPPLLDRVIAFLPPPAFFRTRC

Page 8: Protein Domain Analysis Using Hidden Markov Models

Comparison of Search Approaches

BLAST HMM Threading

Sensitivity Speed

Low

Very Fast

High

Fast

Very High

Very Slow

Page 9: Protein Domain Analysis Using Hidden Markov Models

The Pfam Database

• Pfam is a database of multiple alignments and hidden Markov models (HMMs) of common conserved protein domains.

• The alignments use a non-redundant protein set composed of SWISS-PROT and TrEMBL.

• Pfam consists of parts A and B. Pfam-A contains curated domain families with high-quality alignments. Pfam-B contains families that were generated automatically by clustering the remaining sequences after removal of Pfam-A domains.

• Pfam is available at http://pfam.wustl.edu/.

Page 10: Protein Domain Analysis Using Hidden Markov Models

Other Domain/Motif Databases

• ProDom: http://www.toulouse.inra.fr/prodom.html; contains domain families automatically generated from the SWISS-PROT and TrEMBL (Pfam-B).

• SMART: Simple Modular Architecture Research Tool; available at http://smart.embl-heidelberg.de/; contains domain families that are widely represented among nuclear, signaling and extracellular proteins.

• TIGRFAMs: http://www.tigr.org/TIGRFAMs; is a collection of manually curated protein families of hidden Markov models; contains models of full-length proteins and shorter protein regions.

Page 11: Protein Domain Analysis Using Hidden Markov Models

More Domain/Motif Databases

• PROSITE: http://www.expasy.org/prosite/; consists of biologically significant sites, patterns and profiles; uses regular expression to represent most patterns.

• PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/; a collection of protein fingerprints (conserved motifs, ungapped alignments), which may be used to assign new sequences to known protein families.

• Blocks: http://blocks.fhcrc.org/; consists of short ungapped alignments corresponding to the most highly conserved regions of proteins.

Page 12: Protein Domain Analysis Using Hidden Markov Models

Even More Domain/Motif Databases

• InterPro: http://www.ebi.ac.uk/interpro; an integrated and curated collection of protein families, domains and motifs from PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs.

• CDD: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd; contains domains derived from Pfam, SMART and models curated at NCBI.

• 3Dee: http://www.compbio.dundee.ac.uk/3Dee/; contains structural domain definitions for all protein chains in the Protein Databank (PDB); clustered by both sequence and structural similarity.

Page 13: Protein Domain Analysis Using Hidden Markov Models

Why So Many Domain/Motif Databases?

• Different representations of patterns:– PROSITE: regular expression.– ProDom: multiple alignment and consensus.– Pfam: multiple alignment and HMM.

• Different approaches or focuses:– SMART: focused on signaling proteins.– PRINTS and Blocks: highly conserved segments.

– 3Dee: structural domain definitions.

• “Meta-sites” (databases of databases):– InterPro: an integrated collection, derived from

several domain/motif databases.

Page 14: Protein Domain Analysis Using Hidden Markov Models

Protein Domain Modeling

• Machine learning concepts.

• Hidden Markov Models (HMM).

• HMMER (a software tool for constructing and searching HMM).

• Construction of the Pfam protein domain models.

Page 15: Protein Domain Analysis Using Hidden Markov Models

Machine Learning

• The study of computer algorithms that automatically improve performance through experience.

• In practice, this means: we have a set of examples from which we want to extract some rules (regularities) using computers.

• Two types of machine learning:– Supervised: learn with a teacher (using a set

of input-output training examples).– Unsupervised: let the machine explore the

data space and find some interesting patterns.

Page 16: Protein Domain Analysis Using Hidden Markov Models

Learning from Examples

• Learning refers to the process in which a model is generalized (induced) from given examples (training dataset).

• Error-correction learning: for each of the given examples, a computer program– makes a prediction based on what was

already learned (i.e., model parameters).– compares the prediction with the given output

to calculate the error.– adjusts the model parameters in some way

(learning algorithm) to minimize the error.

Page 17: Protein Domain Analysis Using Hidden Markov Models

Common Pitfalls - Training Dataset

Data space Data instances sampled

Too few examples(overfitting)

Samplingproblem

Good

(“Garbage in, garbage out”)

Page 18: Protein Domain Analysis Using Hidden Markov Models

Hidden Markov Model (HMM)

• A class of probabilistic models that are generally applicable to time series or linear sequences.

• Widely used in speech recognition since early 1970s. David Haussler’s group at UC Santa Cruz introduced HMMs for biological sequence profiles in 1994.

• HMM turns a multiple alignment into a position-specific scoring system that can be used to search for remotely homologous sequences.

Page 19: Protein Domain Analysis Using Hidden Markov Models

The Occasionally Dishonest Casino Problem

The casino has two dies: a fair and a loaded die. They use the fair die most of the time, but occasionally (P = 0.05) switch to the loaded die and may switch back to a fair die with probability 0.1. The loaded die has probability 0.5 of a six and probability 0.1 for the numbers one to five. The fair die has probability 0.167 for each number.

Rolls 521462536316562646465251 SymbolDie FFFFFFFFLLLLLLLLLLFFFFFF State/Path

HMM

The state sequence or path is hidden (HMM). Transition probabilities: P(L|F) = 0.05; P(F|F) = 0.95. Emission probabilities: P(6|L) = 0.5; P(6|F) = 0.167.

Page 20: Protein Domain Analysis Using Hidden Markov Models

An HMM for the Casino Problem

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

Fair Loaded

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

EmissionProbability

0.05

0.1

TransitionProbability

0.95 0.9

Page 21: Protein Domain Analysis Using Hidden Markov Models

An HMM for 5’ Splice Site Recognition

(Eddy, 2004)

States: E – Exon 5 – 5’ splice site I – Intron

An observation (nucleotide sequence) corresponds to a state path (or paths) through the HMM.

Page 22: Protein Domain Analysis Using Hidden Markov Models

Finding the Best Hidden State Path

(Eddy, 2004)

The probability P of a state path, given the model and an observation (sequence), is the product of all the emission and transition probabilities along the path.

Page 23: Protein Domain Analysis Using Hidden Markov Models

Calculating the Probability of a State Path

22.41)1.04.09.01.09.04.09.01.09.04.09.0

4.09.04.00.195.01.0)25.09.0(25.00.1ln(ln 17

P

Page 24: Protein Domain Analysis Using Hidden Markov Models

How to Model a Protein Domain?

A.A. EDQILIKARNTEAARRSRVIANYL SymbolDomX? NNNNNNNNYYYYYYYYYYNNNNNN State/Path

Consider a two-state HMM:Is there a domain X (Yes/No)?

Seq1 KGIQEF--GADWYKVAK--NVGNKSPEQCILRFLQSeq2 ALVKKHGQG-EWKTIAS--NLNNRTEQQCQHRWLRSeq3 SGVRKYGEG-NWSKILLHYKFNNRTSVMLKDRWRT

Is this sufficient for modeling a protein domain?

How to represent position-dependent amino acid distribution?

What about insertions and deletions?

No

Page 25: Protein Domain Analysis Using Hidden Markov Models

An HMM for Protein Domain Recognition

(Eddy, 1996) States:

M - match D - delete I - insert

Page 26: Protein Domain Analysis Using Hidden Markov Models

HMM Parameterization (Training)

• HMM parameters are estimated from the multiple sequence alignment.

– Basic: maximum likelihood estimation.

– Advanced: the MAP construction algorithm.

(See Durbin et al., Biological sequence analysis, p.107-124)

• A High-quality alignment is essential for the model construction. This includes selection of sequences and manual editing of the multiple sequence alignment generated by the ClustalW program.

Page 27: Protein Domain Analysis Using Hidden Markov Models

Scoring a Sequence with an HMM• The task is to find the hidden state path with the

highest probability, given the model and an observation (sequence).

– The Viterbi algorithm (dynamic programming).

– The forward algorithm.

– The backward algorithm.(See Durbin et al., Biological Sequence Analysis, p.55-61)

Page 28: Protein Domain Analysis Using Hidden Markov Models

HMM versus PWM• Advantages:

– A HMM has position-dependent amino acid distributions, which are represented as emission probabilities at each match state. (also PWM)

– Insertion/deletion gap penalties are handled using transition probabilities. (Usually not with PWM)

– The possible dependence of an amino acid on its preceding neighbor can be represented using the transition probabilities. (Not with PWM)

• Problems:– Long-range interactions between amino acids.– Requirement of multiple sequence alignments.

Page 29: Protein Domain Analysis Using Hidden Markov Models

HMMER

• A software package for constructing and searching HMMs.

• Source code and binary distribution for various platforms (UNIX, Linux and Macintosh PowerPC) are available at http://hmmer.wustl.edu/. Follow the detailed User’s Guide for software installation.

• Multiple sequence alignment: ClustalW or ClustalX (with Windows interface), available at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/.

• Sequences in FASTA format.

Page 30: Protein Domain Analysis Using Hidden Markov Models

HMMER Programs• hmmbuild: build a model from a multiple

sequence alignment.

• hmmalign: align multiple sequences to a HMM.

• hmmcalibrate: determine appropriate statistical significance parameters for an HMM prior to database searches.

• hmmsearch: search a sequence database with an HMM.

• hmmpfam: search an HMM database with one or more sequences.

• hmmconvert and hmmindex.

Page 31: Protein Domain Analysis Using Hidden Markov Models

Construction of the Pfam HMMs

PROSITE, literature

Family definition

Seed alignment

HMM profile

Full alignment

ClustalW, editing

hmmbuild

hmmalign

If the HMM doesn’t find all members

(representative, stable)

(complete, volatile)

Page 32: Protein Domain Analysis Using Hidden Markov Models

A Solution to Problem #2

Collect known sequences in literature

Do multiple alignment (ClustalX, editing)

Create an HMM profile using hmmbuild

Search an Arabidopsis sequence dataset using the HMM and hmmsearch

Page 33: Protein Domain Analysis Using Hidden Markov Models

Other Tools for Protein Pattern Analysis

• SignalP:

– For predicting signal peptide and cleavage site.

– Available at http://www.cbs.dtu.dk/services/SignalP/.

• PSORT:

– For predicting protein localization sites in cells.

– Available at http://psort.nibb.ac.jp/.

• TMHMM:

– For predicting transmembrane segments.

– Available at http://www.cbs.dtu.dk/services/TMHMM/.

Page 34: Protein Domain Analysis Using Hidden Markov Models

Summary

• Hidden Markov Model (HMM) is well suited to represent protein domains.

• Since HMMs are constructed from aligned sequence families, HMM search is often more sensitive than BLAST for detecting remotely related homologues.

• Resources are available for modeling and searching for protein domains/motifs.

Page 35: Protein Domain Analysis Using Hidden Markov Models

PROSITE vs. Perl RegExpPDOC00269 (Heat shock hsp70 signature)PROSITE: [IV]-D-L-G-T-[ST]-x-[SC]Perl: [IV]DLGT[ST]\w[SC]

PDOC50884 (Part of Zinc finger Dof-type signature)PROSITE: C-x(2)-C-x(7)-[CS]-x(13)-C-x(2)-CPerl: C\w{2}C\w{7}[CS]\w{13}C\w{2}C

PDOC00081 (Part of Cytochrome P450 signature)PROSITE: [FW]-[SGNH]-x-[GD]-{F}-[RKHPT]-{P}-CPerl: [FW][SGNH]\w[GD][^F][RKHPT][^P]C

PDOC00036 (Part of bZIP domain signature)PROSITE: [KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}Perl: [KR]\w{1,3}[RKSAQ]N[^VL]\w[SAQ]{2}[^L]