Protein function and classification Hsin-Yu Chang

Protein function and classification

Hsin-Yu Chang

www.ebi.ac.uk

Classifying proteins into families and identifying protein homologues can help scientists to

characterise unknown proteins.

Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in 2009. Which model organism they used for this study ?

1. Tetrahymena thermophila

2. Saccharomyces cerevisiae3. Mouse

4. Human

A single Tetrahymena thermophila cell has

40,000 telomeres, whereas a human cell only has 92.

1984Discovery of telomerase Greider and Blackburn

1989Telomere hypothesis of

cell senescenceSzostak

1995 Clone hTR1995/1997 Clone hTERT

1997 Telomerase knockout mouse

1998 Ectopic expression of telomerase in normal human epithelial cells cause the extension of their lifespan

1999/2000…Telomerase/telomere

dysfunctions and cancer

Gilson and Ségal-Bendirdjian, Biochimie, 2010.

Can we identify human telomerase from Tetrahymea protein sequence?

Let’s pretend that human telomerase has not been

identified and we only know the protein sequences

of Tetrahymena telomerase. How can we find the

human telomerase?

BLAST (Basic Local Alignment Tool)

: compares protein sequences to sequence databases

and calculates the statistical significance of matches.

BLAST

Advantages:

• Relatively fast

• User friendly

• Very good at recognising similarity between closely related sequences

Drawbacks:

• sometimes struggle with multi-domain proteins

• less useful for weakly-similar sequences (e.g., divergent homologues)

Using Tetrahymena telomerase protein sequences as a query in BLAST, you will find a few human proteins that have very low identity.

Tetrahymena and putative human telomerase (AAC51724.1) have poor protein sequence match.

Can we presume this protein is a telomerase

homologue from humans? Can we find more

information about it before pursuing it further?

Telomerase ribonucleoprotein complex - RNA binding domain

Reverse transcriptasedomain

Search for protein signatures (such as domains) in AAC51724.1

Plan experiments and find out more!

AAC51724.1 shares 23% identity with Tetrahymena telomerase. It also contains the

same domains as telomerase.

But, where can we search for information

about the protein domains?

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

Protein databases that use signature approaches

HAMAP

Construction of protein signatures

• Construction of a multiple sequence alignment (MSA) from characterised protein sequences.

• Modelling the pattern of conserved amino acids at

specific positions within a MSA.

• Use these models to infer relationships with the characterised sequences

Three different protein signature approaches

Patterns

Single motif methods

Fingerprints

Multiple motif methods

Profiles & Hidden Markov

Models (HMMs)

Full alignment methods

Sequence alignment

Patterns

Patterns

Sequence alignmentMotif

Pattern signature

[AC] – x -V- x(4) - {ED}Regular expression

PS00000

Pattern sequences

ALVKLISGAIVHESATCHVRDLSCCPVESTIS

Patterns are usually directed against functional sequence features such as: active sites, binding sites, etc.

PDOC00199

[SAG]-G-G-T-G-[SA]-GTubulin signature

A conserved motif in tubulins

Patterns

Advantages:

• Strict - a pattern with very little variability and can produce highly accurate matches

Drawbacks:

• Simple but less flexible

Fingerprints

Fingerprints: a multiple motif approach

Sequence alignment

Motif 2 Motif 3Motif 1Define motifs

Fingerprint signature

PR00000

Motif sequencesxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxx

Weight matrices

Telomerase signature (PR01365)

Motif 1 Motif 2 Motif 3 Motif 4

The significance of motif context

order

interval

• Identify small conserved regions in proteins

• Several motifs characterise family

1 2 3

• Good at modeling the often small differences between closely related proteins

• Distinguish individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificity

Fingerprints

Amino acids relatively well conserved across all chloride channel protein family members

Amino acids uniquely conserved in chloride channel protein 3 subfamily members.

Profiles & HMMs

Sequence alignment

Entire domain Define coverage

Whole protein

Use entire alignment of domain or protein family xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Build model (Profile or HMMs)

Profile or HMM signature

Profiles & HMMs

Profiles

Start with a multiple sequence alignment

Amino acids at each position in the alignment are scored according to the frequency

with which they occur

Scores are weighted according to

evolutionary distance using a BLOSUM matrix

• Good at identifying homologues

HMMs

Amino acid frequency at each position in the alignment and their transition probabilities

are encoded

Insertions and deletions are also modelled

Start with a multiple sequence alignment

• Very good at identifying evolutionarily distant homologues

• Can model very divergent regions of alignment

Advantages

Three different protein signature approaches

PatternsSingle motif

methods

FingerprintsMultiple motif

methods

Profiles & HMMs

hidden Markov models

Full alignment methods

www.ebi.ac.uk/interpro

FingerprintsPatternsProfiles &

HMMshidden Markov

models

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

The aim of InterPro

Family entry: description, proteins matched and more information.

Domain entry: description, proteins matched and more information.

Site entry: description, proteins matched and more information.

Protein sequences

What is InterPro?

• InterPro is an integrated sequence analysis resource

• It combines predictive models (known as signatures)

from different databases

• It provides functional analysis of protein sequences by

classifying them into families and predicting domains and

important sites

• First release in 1999

• 11 partner databases

• Add annotation to UniProtKB/TrEMBL

• Provides matches to over 80% of UniProtKB

• Source of >85 million Gene Ontology (GO) mappings to >24 million distinct UniProtKB sequences

• 50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of software

Facts about InterPro

• Signatures are provided by member databases

• They are scanned against the UniProt database to see which

sequences they match

• Curators manually inspect the matches before integrating the

signatures into InterPro

InterPro signature integration process

InterPro curators

InterPro signature integration process

• Signatures representing the same entity are integrated together

• Relationships between entries are traced, where possible

• Curators add literature referenced abstracts, cross-refs to other databases, and GO terms

http://www.ebi.ac.uk/interpro/

Search using protein sequences

Family

Type

InterPro entry types

Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure. Ex. Telomerase family.

Family

Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Ex. DNA binding domain.Domain

Short sequences typically repeated within a protein. Ex. Tubulin binding repeats in microtubule associated protein Tau. Repeats

PTM Active Site

Binding Site

Conserved Site

Sites Ex. Phosphorylation sites, ion binding sites, tubulin conserved site.

Type Name IdentifierContributing signatures

Description

GO terms

References

TypeName Identifier

Contributing signatures

Description

References

Relationships

InterPro family and domain relationships

Family relationships in InterPro:

Interleukin-15/Interleukin-21 family (IPR003443)

Interleukin-15 (IPR020439)

Interleukin-15Avian

(IPR020451)

Interleukin-15Fish

(IPR020410)

Interleukin-15Mammal

(IPR020466)

Interleukin-21(IPR028151)

Relationships

InterPro relationships: domains

Protein kinase-like domain

Protein kinase domain

Serine/threoninekinase catalytic

domain

Tyrosinekinase catalytic

domain

Gene Ontology

• Allow cross-species and/or cross-database comparisons

• Unify the representation of gene and gene product attributes across species

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

• protein kinase activity• insulin receptor activity

• Cell cycle• Microtubule cytoskeleton organisation

GO:0003677 DNA bindingGO:0003721 telomeric template RNA reverse transcriptase activityGO:0005634 Nucleus

Search using

keywords

Summary

• Protein classification could help scientists to gain information about protein functions.

• Blast is fast and easy to use but has its drawbacks.

• Alternative approach: protein signature databases build models (protein signatures) by using different methods (patterns, fingerprints, profile and HMMs).

• InterPro integrates these signatures from 11 member databases. It serves as a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites.

Why use InterPro?

• Large amounts of manually curated data

• 35,634 signatures integrated into 25,214 entries

• Cites 38,877 PubMed publications

• Large coverage of protein sequence space

• Regularly updated

• ~ 8 week release schedule

• New signatures added

• Scanned against latest version of UniProtKB

Caution

We need your feedback!missing/additional referencesreporting problemsrequests

• InterPro is a predictive protein signature database - results are predictions, and should be treated as such

• InterPro entries are based on signatures supplied to us by our member databases

....this means no signature, no entry!

EBI support page.

And one more thing…..

http://www.ebi.ac.uk/support/

The InterPro Team:

Amaia Sangrador

Craig McAnulla

MatthewFraser

Maxim Scheremetjew

Siew-Yit Yong

Alex Mitchell

Sebastien Pesseat

SarahHunter

GiftNuka

Hsin-YuChang

www.ebi.ac.uk/interproTwitter: @InterProDB

http://www.ebi.ac.uk/Information/Staff/person_maintx.php?s_person_id=1164

Database Basis Institution Built from Focus URL

Pfam HMM Sanger Institute Sequence alignment

Family & Domain based on conserved sequence

http://pfam.sanger.ac.uk/

Gene3D HMM UCL Structure alignment

Structural Domainhttp://gene3d.biochem.ucl.ac.uk/Gene3D/

Superfamily HMM Uni. of Bristol Structure alignment

Evolutionary domain relationships

http://supfam.cs.bris.ac.uk/SUPERFAMILY/

SMART HMM EMBL Heidelberg Sequence alignment

Functional domain annotation

http://smart.embl-heidelberg.de/

TIGRFAM HMM J. Craig Venter Inst. Sequence alignment

Microbial Functional Family Classification

http://www.jcvi.org/cms/research/projects/tigrfams/overview/

Panther HMM Uni. S. California Sequence alignment

Family functional classification

http://www.pantherdb.org/

PIRSF HMM PIR, Georgetown, Washington D.C.

Sequence alignment

Functional classification

http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml

PRINTS Fingerprints Uni. of Manchester Sequence alignment

Family functional classification

http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

PROSITE Patterns & Profiles SIB Sequence

alignmentFunctional annotation

http://expasy.org/prosite/

HAMAP Profiles SIB Sequence alignment

Microbial protein family classification

http://expasy.org/sprot/hamap/

ProDom Sequence clustering

PRABI : Rhône-Alpes Bioinformatics Center

Sequence alignment

Conserved domain prediction

http://prodom.prabi.fr/prodom/current/html/home.php

Thank you!

www.ebi.ac.uk

Twitter: @emblebi

Facebook: EMBLEBI

YouTube: EMBLMedia

The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.

The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.

Documents

Protein function and classification Hsin-Yu Chang