View
221
Download
0
Tags:
Embed Size (px)
Citation preview
A brief on:
Domain Families & Classification
• The discovery of domains in protein structures
• Domains at the sequence level
• Examples of “Domain Resources”
• Domain fusion
• Supra-domains
• Signaling domains and cell function
• InterPro
Evolution by Protein Domains
Classification to FamiliesWe can classify proteins into families by:
– A. Sequence (motifs; proteins)
– B. Structure
– C. Function (annotation)
– D. Evolution
Automatic
Large scale
Automatic
Large scale
Manual
High Quality
Manual
High Quality
Sequence Based Classification
• Proteins as a unit
• Proteins as combination of domains FunctionalStructural Sequence
The Goal:1. New Annotation, New Family, Family
connections (sub/ super) …2. Predicting power (given a new unknown
sequence)
Protein Multiple Alignment (Structurally supported)
Q: What is the best way to ‘represent’ this low sequence similarity of ~ 70 aa
Domains can be recognized through sequence similarity
Misannotation due to multidomain proteins
Smith and Zhang. Nat Biotechnol 1997 15:1222-3
Domain of known function
Domain of unknown function
kinase
Kinase-like
A
B
Kinase-like
A is similar to C, and C is similar to B, but A is not similar to B
Multidomain protein C
Annotation
Q: What is the best way to ‘represent’ this low sequence similarity of ~ 70 aa
‘Profile’ PSSM
Regular Expression
HMM
And more…
Multi domain protein families
Impossible to find ‘evolutionary relatedness”
without adding DOMAIN information…
• Domains are the evolutionary units of sequence that comprise the gene coding regions.
• Most genes are built from more than one domain.
• Novel genes can be created by recombination of domains into new domain arrangements.
How is a novel gene born?
Glycerone-P
Glycerate-1,3P2
Glycerate-3P
PGK1
GAPDH
TIM
Glyceraldehyde-3P
Thermotoga Maritima PGK+TIM
M. genitalium PGK
M. genitalium TIM
Phytophthora infestans TIM+GAPDH
M. genitalium GAPDH
From Glycolysis:
Correspondence between functional associations and genes
linked by the fusion method
8e-78
2e-47
9e-41
1e-42
False Transitivity of Local Alignment
CSKP HUMAN
DLG3 MOUSE
MPP3 HUMAN
K6A1 MOUSE
BLAST values
Pairwise similarities better than 1e-40 EScore
If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN
input
Used Terms:Motif = Domain = Signature
= Profile = Seed
Family = Cluster
These terms are used interchangeably, They are very (too) flexible
Domain Classification
(intro to few systems)
Protein Sequence Domain Classification
DOMO
ADDA
EVEREST
InterPro
CDD
MetaFam
ProSite
Pfam
Blocks+
Profile
SBASE
TigrFam
eMotif
SMART
PRINTS
ProDom
Based on different principles and a different focus
!
Integration: Data Fusion
InterPro 13,000 entries
Based on UniProt DB
Expert system Pfam
InterPro - >13,000 entries
2006 >8000
Sequence coverage Pfam-A : 75% Sequence coverage Pfam-B : 19% Other
Examples: complexity in domains
Identification ? Boundary ? Composition ?
Examples: complexity in domains
Identification ? Boundary ? Composition ?
Why domains and not proteins
Reducing false transitivity.
Exposing Mix and Match evolution
Immediate relevance to structural domain-families
Suggesting evolutionary ‘robust units’
Providing models for a family
Why automatic?
Overcoming large amounts of data
Unbiased identification of new families (even without an identified seed / without 3D structural information )
Domains are the building blocksof evolution: some facts..
Pyruvate kinase, PDB:1pkn
3 domains
Each occurs in diverse sets of protein families
Number of domains in proteins ranges from 1 up to tens
Structural based domain are ~ 150 aa Length varies: some are very short 30-40 aa, other are long > 500 aa
Domain definition is somewhat blurred
Domain boundary is an unsolved problem
What is a domain? You know it when you see one
Automatic vs Manual
>13,000 entries
General approaches• Motif based databases
• Prosite, Prints, Blocks, eMotif, InterPro• Domain-based databases
• Pfam, ProDom, Domo, Smart
• Manual/Semi-manual• Prosite
• Semi-automatically• Pfam, Smart
• Fully automatic• ProDom, Blocks, Domo, eMotif
• Use different models (regular expressions, profiles, HMMs)
• Based on each other
Example of semi - automatic
Pfam: Nucleic Acids Research, 2007, 1–8
1. Release of Pfam (22.0) contains 9318 protein families. cover 73.2% of sequences and 50.8%.
2. Pfam is now based on UniProtKB, NCBI GenPept and metagenomics projects.
3. ~ 500 new Pfam-A families for PDB sequences and SCOP entries.
Increasing the aa cover !
4. Clans are built manually (supported by literature, SCOP..) total of 283 clans comprising a total of 1808 Pfam-A
families.
The Power of Integration
Pfam, Prosite, SMART, PRINTS,tigrFamProDom
InterPro
SCOPCATHFSSP
GOENZKEGG
TRANSFERASE (METHYLTRANSFERASE) 1adm
Proteins were found to have spatially distinct structural unitsStructure Domains provide a “clean” definition
In 1974, Michael Rossman observes that structural domains can recur in different structural contexts
1ht0 – an alcohol dehydrogenase1i0z – a lactate dehydrogenase
Rossman fold
Domains can recur in multiple copies in the same protein
Fibronectin protein–1fnf
A distinct, compact, and stable protein structural unit that folds independently of other such units.
Structural definition of domains
A distinct, compact, and stable protein structural unit that folds independently of other
such units.
Structural definition of domains
Recurrent domains in diphtheria toxin (1ddt)
The diphtheria toxin is made up of three domains, each of which is involved in a different stage of infection (receptor binding, membrane penetration,
and catalysis of ADP-ribosylation of elongation factor 2). A structural neighbor is depicted next to each domain of diphtheria toxin (middle).
Dominant domain fold types.
Holm and Sander. PROTEINS: Structure, Function, and Genetics 33:88–96 (1998)
701
1,110
1,940
44,327
SCOP – a structural classification of proteins
Updated from Murzin et al. J. Mol. Biol. 247, 536-540.
Families are in turn grouped into superfamilies where sequence similarity is still recognizable and basic biochemical properties are conserved. Superfamilies and families are monophyletic(derive from a common ancestor)
Dominant domain fold types.
Holm and Sander. PROTEINS: Structure, Function, and Genetics 33:88–96 (1998)
Sequence Biology predominantly proceeds by decomposing proteins into their domains
Protein sequence families are constructed at the domain level
PrositeA dictionary of functional and structural motifs and domains
Valuable biological information on each familyEach motif/domain/family is represented as a regular expression, a rule
or a profile
Models are generated from (usually published) multiple alignments, manually calibrated to ensure selectivity and sensitivity
Patterns do not always cover complete domains whereas profiles usually span the whole domain
As of June 2002 contains 1800 patterns and profiles describing 1200 families or domains
G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C
1 2 3 4 5 6 7 8 9 10 11
A 0 0.25 0.25 1 0.5 0 1 0.5 0 0.25 0
C 0 0 0.25 0 0 0 0 0 0.25 0.25 1
G 1 0.5 0 0 0 0.25 0 0.5 0.75 0.25 0
T 0 0.25 0.5 0 0.5 0.75 0 0 0 0.25 0
OR
From the SMART database
Detecting domains at the sequence level
Fusion link
Glycyl-tRNA Synthetase
E. Coli:
CT796
Fusion Links
glyQ glyS
C. Trachomatis:
The fact that glyQ and glyS interact could have been predicted from the fusion protein CT796
InterproAn integrated resource of protein sites and functional domains
The good thing about standards is that there are so many of them to choose from…
Introducing Interpro….
http://www.ebi.ac.uk/interpro/
Interpro entry for a zinc finger domain
:taxonomy חיפוש לפי •
באדם:1Sirtתוצאות חיפוש לדוגמא עבור החלבון
.Alignmentהצגת
.HMM-Logoהצגת
iPfam - .PDB המבוסס על רשומות domain-domainמאגר אינטראקציות
יתרונות בולטים:
קישור ממאגרי המידע המובילים – •UniProt,PDB,interPro.
בקרה ידנית על החלוקה למשפחות.• עבור רצפים גלובלים ומקומיים.HMMחיפוש בעזרת • בהם משולב החלבון.domain architecturesריכוז של •עצים פילוגנטיים וטקסונומיים לחיפוש חלבונים •
הומולוגים מוכרים. בצורה גרפית.Alignment ו-HMMתצוגת •אפשרות להוריד את המאגר בשלמותו.•
Super-families of domains in Interpro(analogous to superfamilies in SCOP)
Some domains actually contain other domains!
GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATAT
AGGTGCCACAGTTATGGAGTSignalingandMulticellularityAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGACACTACCTGGACCTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCT
One of the key problems of becoming a multicellular organism is solving the problem of cell signaling.
inactive active inactive
pkinase phosphotase
Phosphorylation can reversibly alter the activity of an enzyme through the combined action of a protein kinase and a protein phosphatase.
signal transduction
Tyrosine phosphorylation is a major mechanism of transmembrane signaling.
Pawson and Scott. Scientific American (2000)
Protein tyrosine kinases (PTKs) add phosphate to tyrosines
SH2 domains (Src-homlogy 2)
SH2 domains are modules of ~100 amino acids that bind to specific phospho (pY)-containing peptide motifs
The Pawson Lab http://www.mshri.on.ca/pawson/domains.html
Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001
The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes.
Several modular domains have been identified that recognize specific sequences on their target acceptor proteins.
Protein modules for the assembly of signaling complexes
Pawson & Scott. Science (1997) 278 2075-2080
Pawson & Scott. Science (1997) 278 2075-2080
One way receptors may amplify their signaling is to use adaptor proteins that provide additional docking sites for modular signaling proteins.
Adaptor proteins
The Order of Domains in the Polypeptide Chains of Src and Abl, and Diagrams of Their Assembled, Autoinhibited StatesIn both cases, the SH3-SH2 clamp fixes the bilobed kinase domain in an inactive conformation. The domain color codes are SH3, yellow; SH2, green; kinase small lobe, dark blue; kinase large lobe, light blue. The activation loop in the large lobe is red. Connector, linker, and N- and C-terminal extensions are black. In Bcr/Abl, gene fusion has replaced the Abl cap by a long segment of Bcr.
Harrison, S. C. (2003). Cell, 112, 737–740.
Supra-domains in Src and Abl
A supra-domain is defined as a domain combination in a particular N-to-C-terminal orientation that occurs in at least two different domain architectures in different proteins with: (i) different types of domains at the N and C-terminal end of the combination; or (ii) different types of domains at one end and no domain at the other.
Supra-domainsEvolutionary units larger than single domains
Vogel C. J Mol Biol. 2004 336 (3) :809-23
N-terminal end C-terminal end
Each represents a different domain architecture
Supra-domain of size 2 and 3
Chothia C. Science 2003 300: 1701-1703Vogel C. J Mol Biol. 2004 336 (3) :809-23
Supra-domainsEvolutionary units larger than single domains
The P-loop containing nucleotide triphosphate (NTP) hydrolase domain and the translation protein domain occur as one combination in several different translation factors.
This supra-domain occurs in 35 different domain architectures,and five of these are given here.
The building blocks: modular interaction domains in signal transduction
Pawson & Nash. Science (2003) 300 445-452
The Order of Domains in the Polypeptide Chains of Src and Abl, and Diagrams of Their Assembled, Autoinhibited StatesIn both cases, the SH3-SH2 clamp fixes the bilobed kinase domain in an inactive conformation. The domain color codes are SH3, yellow; SH2, green; kinase small lobe, dark blue; kinase large lobe, light blue. The activation loop in the large lobe is red. Connector, linker, and N- and C-terminal extensions are black. In Bcr/Abl, gene fusion has replaced the Abl cap by a long segment of Bcr.
Supra-domains in Src and Abl