Upload
sierra
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Proteomics: Analyzing proteins space. Protein families. Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families - what is it good for? Explosion in biological sequence data => need to organize! - PowerPoint PPT Presentation
Citation preview
Proteomics: Analyzing proteins
space
Proteomics: Analyzing proteins
space
Protein familiesProtein familiesWhy proteins? • Shift of interest from “Genomics” to “Proteomics”
Classification of proteins to groups/families - what is it good for? • Explosion in biological sequence data => need to organize!
• Understanding relations/hierarchy of groups is interesting as is,
e.g. in evolutionary research.
• For applied research :
– Annotation of new proteins : predicting their function,
structure, cellular localization etc.
– Looking for new folds
Sequence-based classificationSequence-based classification
• By sequence similarity (domains, motifs
or complete proteins) : Pfam, PROSITE,
SMART, InterPro etc.
• InterPro – Synthesizes the data from Pfam,
PROSITE, Prints, ProDom, and SMART.
Considered as “best” domain-based classification
available
Other kinds of classificationOther kinds of classification• Global classification :
– Systers, Protomap, CLUSTr– MetaFam synthesizes global classification data
• By structure similarity : SCOP etc.
• By function : Albumin, RetNet, TumorGenes
etc.
• A long-term project in HUJI led by
Michal & Nati Linial.• Provides automatic global
classification of the known proteins.• Performs hierarchical clustering on sequence-based metric space of proteins.
• Allows to “place” an external protein into the hierarchy.
http://www.protonet.cs.huji.ac.il
Why clustering?
• We want to refine the “similarity” notion, compared to e.g. BLAST
• Exploit transitivity to improve grouping
• Can use a low threshold on similarity:
- uses vast information from low similarities
- allowable because clustering filters noise
Why hierarchical?Vertical Perspective
Horizontal Perspective
ProtoNet: Pre-Computation
• All-against-all gapped BLAST using BLOSUM62• SwissProt release 40.28 database (114,033 proteins)• BLAST identified ~2*107 relations between these
proteins with relatively high sequence similarity E-Score of 100 or less:
• Don’t want to lose information => very permissive!• But still less then ~6.5*109 => infeasible
),( 21 ppd
Clustering Method
• First, each cluster is considered a singleton
Clustering Method
• Next, we iteratively merge the pairs of clusters
• We choose to merge the ‘most similar’ pair of clusters.
Clustering Method
• Next, we iteratively merge the pairs of clusters
• We choose to merge the ‘most similar’ pair of clusters.
Clustering Method
• Next, we iteratively merge the pairs of clusters
• We choose to merge the ‘most similar’ pair of clusters.
Clustering Method
• As we progress the number of singletons drops
Clustering Method
• The clustering process gradually generates a tree of clusters
• Stop whenever we like
How to merge?
• The potential merging score is calculated for each pair of clusters relevant for merging at each level
• At the bottom equals
• Higher, designed to reflect the similarity of clusters.
• Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.
m n
),( 21 ppd
Potential Merging Score of
• Arithmetic Mean
VI
• Geometric Mean
VI
• Harmonic Mean
21
21)2,1(21 ),(CC
CCpp
ppd
),( 21 CC
21)2,1(
2121
),(1
CCpp
ppdCC
21)2,1(21
121
),(CCpp
ppd
CC
Missing Data Treatment
• For very low similarity pair (outside of ~2*107 ), its length is defined as
• Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)
)),((max),( 21,21 21ppdconstppd pp
Results: ProtoNet top 20Results: ProtoNet top 20
Why
cl
usteri
ng
at
all?
We
want
to
extend
the
range
of
“si
milarity”,
co
mpared
to
e.g.
BLASTExploit
transitivity
to
improve
groupingCan
use
a
low
threshold
on
si
milarity:
- uses
vast
infor
mation
fro
m
low
si
milarities
- allowable
because
clustering
filters
noise
20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level
Problem of result assessment: what is a “good” cluster?
• Contains all proteins in the family, does not
contain proteins not in family
• But what is family? Does any keyword define a
family?
• Stable as the merging events occur (long life-
time)?
Problem of result assessment: what is a “good” tree?
• Should we trust the resulting forest?
– Which clustering technique is better? Combined?
– Bootstrap?
• Do the clusters correspond to meaningful families of
proteins?
– Validation against InterPro, SCOP etc.
– Lack of will to automatically reconstruct them!!!
• What is the right level/cut to look at the forest?
Interpro Validation
• Interpro annotation allows systematic validation of the generated clustering
• The ‘geometric’ method exhibits high cluster purity– Corresponds to low FP
The Domain Problem
• Many proteins are composed of several domains
• The sequence similarity tools used are therefore local in
nature:
• The score of comparing two sequences is the edit
distance of the most similar subsequences of them
• This creates a false similarity problem:
The Modular Nature of Proteins
CSKP HUMAN
DLG3 MOUSE
K6A1 MOUSE
MPP3 HUMANSerine/Threonine protein kinase family active siteProtein kinase C-terminal domainPDZ domainSH3 domainGuanylate kinase
8e-78
2e-47
9e-41
1e-42
False Transitivity of Local Alignment
CSKP HUMAN
DLG3 MOUSE
MPP3 HUMAN
K6A1 MOUSE
We ran BLASTusing default parameters:
All these pairwise similarities havebetter than 1e-40 EScore
If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN
Alternative methods
• Different types of clustering– Non-binary– Goal-oriented => semi-guided– Graph theory insights
• Non-clustering ways of exploring the space of proteins
• Why BLAST E-score???• Enrichment of the metric using structure