Upload
baldric-bernard-moody
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Classifying proteins into families and identifying protein homologues can help scientists to
characterise unknown proteins.
Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in 2009. Which model organism they used for this study ?
1. Tetrahymena thermophila
2. Saccharomyces cerevisiae3. Mouse
4. Human
A single Tetrahymena thermophila cell has
40,000 telomeres, whereas a human cell only has 92.
1984Discovery of telomerase Greider and Blackburn
1989Telomere hypothesis of
cell senescenceSzostak
1995 Clone hTR1995/1997 Clone hTERT
1997 Telomerase knockout mouse
1998 Ectopic expression of telomerase in normal human epithelial cells cause the extension of their lifespan
1999/2000…Telomerase/telomere
dysfunctions and cancer
Gilson and Ségal-Bendirdjian, Biochimie, 2010.
Let’s pretend that human telomerase has not been
identified and we only know the protein sequences
of Tetrahymena telomerase. How can we find the
human telomerase?
BLAST (Basic Local Alignment Tool)
: compares protein sequences to sequence databases
and calculates the statistical significance of matches.
BLAST
Advantages:
• Relatively fast
• User friendly
• Very good at recognising similarity between closely related sequences
Drawbacks:
• sometimes struggle with multi-domain proteins
• less useful for weakly-similar sequences (e.g., divergent homologues)
Using Tetrahymena telomerase protein sequences as a query in BLAST, you will find a few human proteins that have very low identity.
Can we presume this protein is a telomerase
homologue from humans? Can we find more
information about it before pursuing it further?
Telomerase ribonucleoprotein complex - RNA binding domain
Reverse transcriptasedomain
Search for protein signatures (such as domains) in AAC51724.1
Plan experiments and find out more!
AAC51724.1 shares 23% identity with Tetrahymena telomerase. It also contains the
same domains as telomerase.
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
Protein databases that use signature approaches
HAMAP
Construction of protein signatures
• Construction of a multiple sequence alignment (MSA) from characterised protein sequences.
• Modelling the pattern of conserved amino acids at
specific positions within a MSA.
• Use these models to infer relationships with the characterised sequences
Three different protein signature approaches
Patterns
Single motif methods
Fingerprints
Multiple motif methods
Profiles & Hidden Markov
Models (HMMs)
Full alignment methods
Sequence alignment
Patterns
Sequence alignmentMotif
Pattern signature
[AC] – x -V- x(4) - {ED}Regular expression
PS00000
Pattern sequences
ALVKLISGAIVHESATCHVRDLSCCPVESTIS
Patterns are usually directed against functional sequence features such as: active sites, binding sites, etc.
Patterns
Advantages:
• Strict - a pattern with very little variability and can produce highly accurate matches
Drawbacks:
• Simple but less flexible
Fingerprints: a multiple motif approach
Sequence alignment
Motif 2 Motif 3Motif 1Define motifs
Fingerprint signature
PR00000
Motif sequencesxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Weight matrices
The significance of motif context
order
interval
• Identify small conserved regions in proteins
• Several motifs characterise family
1 2 3
• Good at modeling the often small differences between closely related proteins
• Distinguish individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificity
Fingerprints
Amino acids relatively well conserved across all chloride channel protein family members
Amino acids uniquely conserved in chloride channel protein 3 subfamily members.
Sequence alignment
Entire domain Define coverage
Whole protein
Use entire alignment of domain or protein family xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Build model (Profile or HMMs)
Profile or HMM signature
Profiles & HMMs
Profiles
Start with a multiple sequence alignment
Amino acids at each position in the alignment are scored according to the frequency
with which they occur
Scores are weighted according to
evolutionary distance using a BLOSUM matrix
• Good at identifying homologues
HMMs
Amino acid frequency at each position in the alignment and their transition probabilities
are encoded
Insertions and deletions are also modelled
Start with a multiple sequence alignment
• Very good at identifying evolutionarily distant homologues
• Can model very divergent regions of alignment
Advantages
Three different protein signature approaches
PatternsSingle motif
methods
FingerprintsMultiple motif
methods
Profiles & HMMs
hidden Markov models
Full alignment methods
Structuraldomains
Functional annotation of families/domains
Protein features
(sites)
Hidden Markov Models Finger prints
Profiles Patterns
HAMAP
The aim of InterPro
Family entry: description, proteins matched and more information.
Domain entry: description, proteins matched and more information.
Site entry: description, proteins matched and more information.
Protein sequences
What is InterPro?
• InterPro is an integrated sequence analysis resource
• It combines predictive models (known as signatures)
from different databases
• It provides functional analysis of protein sequences by
classifying them into families and predicting domains and
important sites
• First release in 1999
• 11 partner databases
• Add annotation to UniProtKB/TrEMBL
• Provides matches to over 80% of UniProtKB
• Source of >85 million Gene Ontology (GO) mappings to >24 million distinct UniProtKB sequences
• 50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of software
Facts about InterPro
• Signatures are provided by member databases
• They are scanned against the UniProt database to see which
sequences they match
• Curators manually inspect the matches before integrating the
signatures into InterPro
InterPro signature integration process
InterPro curators
InterPro signature integration process
• Signatures representing the same entity are integrated together
• Relationships between entries are traced, where possible
• Curators add literature referenced abstracts, cross-refs to other databases, and GO terms
InterPro entry types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure. Ex. Telomerase family.
Family
Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Ex. DNA binding domain.Domain
Short sequences typically repeated within a protein. Ex. Tubulin binding repeats in microtubule associated protein Tau. Repeats
PTM Active Site
Binding Site
Conserved Site
Sites Ex. Phosphorylation sites, ion binding sites, tubulin conserved site.
Family relationships in InterPro:
Interleukin-15/Interleukin-21 family (IPR003443)
Interleukin-15 (IPR020439)
Interleukin-15Avian
(IPR020451)
Interleukin-15Fish
(IPR020410)
Interleukin-15Mammal
(IPR020466)
Interleukin-21(IPR028151)
InterPro relationships: domains
Protein kinase-like domain
Protein kinase domain
Serine/threoninekinase catalytic
domain
Tyrosinekinase catalytic
domain
Gene Ontology
• Allow cross-species and/or cross-database comparisons
• Unify the representation of gene and gene product attributes across species
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
• protein kinase activity• insulin receptor activity
• Cell cycle• Microtubule cytoskeleton organisation
GO:0003677 DNA bindingGO:0003721 telomeric template RNA reverse transcriptase activityGO:0005634 Nucleus
Summary
• Protein classification could help scientists to gain information about protein functions.
• Blast is fast and easy to use but has its drawbacks.
• Alternative approach: protein signature databases build models (protein signatures) by using different methods (patterns, fingerprints, profile and HMMs).
• InterPro integrates these signatures from 11 member databases. It serves as a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites.
Why use InterPro?
• Large amounts of manually curated data
• 35,634 signatures integrated into 25,214 entries
• Cites 38,877 PubMed publications
• Large coverage of protein sequence space
• Regularly updated
• ~ 8 week release schedule
• New signatures added
• Scanned against latest version of UniProtKB
Caution
We need your feedback!missing/additional referencesreporting problemsrequests
• InterPro is a predictive protein signature database - results are predictions, and should be treated as such
• InterPro entries are based on signatures supplied to us by our member databases
....this means no signature, no entry!
EBI support page.
And one more thing…..
The InterPro Team:
Amaia Sangrador
Craig McAnulla
MatthewFraser
Maxim Scheremetjew
Siew-Yit Yong
Alex Mitchell
Sebastien Pesseat
SarahHunter
GiftNuka
Hsin-YuChang
www.ebi.ac.uk/interproTwitter: @InterProDB
Database Basis Institution Built from Focus URL
Pfam HMM Sanger Institute Sequence alignment
Family & Domain based on conserved sequence
http://pfam.sanger.ac.uk/
Gene3D HMM UCL Structure alignment
Structural Domainhttp://gene3d.biochem.ucl.ac.uk/Gene3D/
Superfamily HMM Uni. of Bristol Structure alignment
Evolutionary domain relationships
http://supfam.cs.bris.ac.uk/SUPERFAMILY/
SMART HMM EMBL Heidelberg Sequence alignment
Functional domain annotation
http://smart.embl-heidelberg.de/
TIGRFAM HMM J. Craig Venter Inst. Sequence alignment
Microbial Functional Family Classification
http://www.jcvi.org/cms/research/projects/tigrfams/overview/
Panther HMM Uni. S. California Sequence alignment
Family functional classification
http://www.pantherdb.org/
PIRSF HMM PIR, Georgetown, Washington D.C.
Sequence alignment
Functional classification
http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml
PRINTS Fingerprints Uni. of Manchester Sequence alignment
Family functional classification
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
PROSITE Patterns & Profiles SIB Sequence
alignmentFunctional annotation
http://expasy.org/prosite/
HAMAP Profiles SIB Sequence alignment
Microbial protein family classification
http://expasy.org/sprot/hamap/
ProDom Sequence clustering
PRABI : Rhône-Alpes Bioinformatics Center
Sequence alignment
Conserved domain prediction
http://prodom.prabi.fr/prodom/current/html/home.php
The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.