56
Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Protein Annotations from Sequence Data Data Network based analyses

Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Embed Size (px)

Citation preview

Page 1: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Family based approaches

Methods which also exploit

non-homology based information

Protein Annotations from Sequence DataProtein Annotations from Sequence Data

Network based analyses

Page 2: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

a

a b

duplication

speciation

species 1 species 2

a a

paraloguesparalogues

orthologuesorthologues

Homology based inference of protein functionsHomology based inference of protein functions

orthologues - often have very similar functions

paralogues - may have related functions

ancestral protein

Page 3: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologues

search family resources

(ortholgues and paralogues)

analyse residue features to predict transmembrane, localisation etc

predict protein interactions

search for conserved residues

Page 4: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologuesHAMAP, EggNogg,

COGS, KOGS

search family resources

(ortholgues and paralogues)

analyse residue featurespredict

transmembrane, localisation etc

predict protein interactions

search for conserved residues

Page 5: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

HAMAP familiesHAMAP families

• Orthologous protein families used for High-quality Automated and Manual Annotation of microbial Proteomes in UniProtKB.

• 1,448 families, from Bacteria, Archaea and Plastid covering over 180,000 UniProtKB/Swiss-Prot entries, available on:

http://www.expasy.org/sprot/hamap/families.html

Anne-Lise Veuthey, SIB

Page 6: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

HAMAPHAMAP pipelinepipeline

UniProtKB/

TrEMBL

Profile

Automated annotation

Manual checking of warnings given by the system

HAMAP

family rules

Automatic retrieval of sequences

matching the profile

Page 7: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologues analyse residue featurespredict

transmembrane, localisation etc

predict protein interactions

search for conserved residues

search family resources

SMART, ProtoNet, Everest, Gene3D,

CATH, InterProPfam, TIGR, PRINTS,

SCOP

Page 8: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

(1)(1)Cluster 4.5 million sequences (510 completed Cluster 4.5 million sequences (510 completed genomes) into protein superfamilies using APC genomes) into protein superfamilies using APC clustering algorithmclustering algorithm

(2) Map domains onto the sequences using HMM (2) Map domains onto the sequences using HMM technology technology (CATH & Pfam domains)(CATH & Pfam domains)

335,000 protein superfamilies (orthofams)(189,000 have >5 sequences)19% are singletons

~11,000 domain superfamilies(2100 CATH of known structure – account for ~85% of domains)

BLAST, APC

CATH, Pfam HMM libraries

Page 9: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Gene3D - OrthoFamsGene3D - OrthoFams

Functional annotation of selected node

Root Node

30% ID

95% ID

335,000 Protein families built using Affinity Propogation Clustering.

Annotated with FunCat, HAMAP, EC, KEGG, GO, IntACT, HPRD, and others.

Benchmarking – 99.9% map to single HAMAP

Page 10: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Functional Catalogue (FunCat) Functional Catalogue (FunCat)

• Organized hierarchically with up to six levels.Organized hierarchically with up to six levels.

• ~1307 categories ~1307 categories

• Currently 9 organisms incorporated: yeast, human, Currently 9 organisms incorporated: yeast, human,

A.thalianaA.thaliana, …, …

Dmitrj Frishmann, GSF

Page 11: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Number of Entries

100,000

1,000,000

10,000,000

100,000,000

SWP

U50 U90 KBU10

0M

ESPar

c

ProtoNet 5.1

EVEREST 2.0

Michal Linial, HUJI

2.5M sequences

Michal Linial, HUJI

ProtoNet and EVEREST family ProtoNet and EVEREST family resourcesresources

Page 12: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

. Root

B22

B40

B14

B31

B32

B13

B44

B 10

B37

B26

B16

B28

B11

B27

B29

B20

B25

B7

B30

B18

B9

B42

B36

B23

B19

B5

B21

B33

B39

B12

B38

B17

B35

B8

B43

B34

B41

B24

B6

B15

B4

B1

B3

AE

E1

A1

A3

A2

A4

A5

A10

B2

A6

A7

A9

A8

A11

E2

A12

2.5M sequences from UniProt

UPGMA efficient clustering algorithm

Benchmarked against Pfam, SCOP

Page 13: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

ProtoName: ProtoName: ssafe inference of annotationafe inference of annotation

ProtoName: ProtoName: ssafe inference of annotationafe inference of annotation

For each cluster annotation assigned an Annotation Score if proteins achieve p-value <= 0.001

(b) Only clusters with > 5 proteins are considered

(c) Purity is >0.9 (TP/ TP+FN)

(d) Combination of functional keywords

For each protein, assign the annotations of its cluster and all parents

>40% of the clusters and 65% of proteins assigned a safe ProtoName

Page 14: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

protein superfamily protein superfamily

~11%of PROTEIN superfamilies in a genome are common to all ~11%of PROTEIN superfamilies in a genome are common to all kingdoms, kingdoms,

Page 15: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

protein superfamily protein superfamily

common domainscommon domains

nearly 60% of domains are from ~200 nearly 60% of domains are from ~200 superfamilies COMMON to all major kingdomssuperfamilies COMMON to all major kingdoms

these have been combined in different ways to modulate these have been combined in different ways to modulate function function

~11%of PROTEIN superfamilies in a genome are common to all ~11%of PROTEIN superfamilies in a genome are common to all kingdoms, kingdoms,

Page 16: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Evolution of functional subfamilies within superfamilies

.Root

B22

B40

B14

B31

B32

B13

B44

B 10

B37

B26

B16

B28

B11

B27

B29

B20

B25

B7

B30

B18 B9

B42

B36

B23

B19

B5

B21

B33

B39

B12

B38

B17

B35

B8

B43

B34

B41

B24

B6

B15

B4

B1

B3

AE

E1

A1

A3

A2

A4

A5

A10

B2

A6

A7

A9

A8

A11

E2

A12

Species tree built on the small subunit (SSU)

ribosomal RNA

superfamily

+

+++

+

++++++++

COG functional categoriesCOG functional categories

Page 17: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Percentage frequencies of functional shifts within domain superfamilies

Function is predominantly conserved within the same COG functional subcategory or major category

However, there are clearly cases of major functional shifts

[J] [L] [K] [A] [T] [M] [N] [O] [U] [D] [V] [E] [C] [H] [P] [F] [G] [I] [Q] [R] [S] [J] 29 1 0 0 1.9 0 0 0 0 9 0 0 0 2 0 2 0 0 0 0.8 7 [L] 5 41 2.5 100 0 0 0 10 0 18 0 0.7 0 0 0 2 0 3 0 1.5 0 [K] 5 3 40 0 21 0 0 3 12 0 0 5.2 1 2 5 4 9 0 0 1.5 7 [A] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [T] 5 3 20 0 62 1.6 50 3 6 0 0 1.5 1 0 0 0 5 0 6 2.3 7 [M] 5 0 1.2 0 0 46 0 3 0 0 0 2.2 2 5 5 6 9 12 0 4.6 0 [N] 0 1 1.2 0 3.8 0 20 0 18 0 0 0 0 0 0 0 0 0 0 1.5 0 [O] 1 2 1.2 0 0 0 0 40 12 9 0 1.5 3 0 3 6 0 0 6 0.8 0 [U] 0 1 1.2 0 0 0 30 3 24 0 0 1.5 0 0 0 0 0 0 0 0.8 0 [D] 0 1 0 0 0 0 0 5 0 18 0 0 0 0 0 0 0 0 0 0 0 [V] 1 1 0 0 0.9 3.2 0 3 0 0 67 0 0 0 0 0 0 0 0 0 0 [E] 9 4 7.4 0 2.8 4.8 0 0 0 9 0 27 3 7 0 12 2 3 6 5.4 0 [C] 0 1 0 0 0 3.2 0 3 0 18 0 13 41 17 10 10 4 12 0 16 7 [H] 5 1 0 0 0 0 0 5 6 0 0 5.2 5 25 10 2 9 3 6 3.8 7 [P] 1 9 2.5 0 0 0 0 5 0 0 0 0.7 11 13 41 2 7 3 0 3.1 0 [F] 4 0 1.2 0 1.9 4.8 0 0 0 0 0 1.5 4 2 0 26 0 0 0 1.5 0 [G] 0 2 3.7 0 0.9 9.5 0 3 12 0 0 1.5 6 5 0 2 33 3 0 4.6 13 [I] 7 0 0 0 0 3.2 0 3 0 9 0 2.2 3 0 0 2 2 18 6 2.3 0 [Q] 0 1 1.2 0 1.9 1.6 0 3 0 0 0 2.2 1 2 10 2 2 30 41 2.3 0 [R] 16 19 8.6 0 0.9 14 0 8 12 0 0 25 11 20 10 22 14 9 24 39 47 [S] 4 8 7.4 0 1.9 7.9 0 5 0 9 33 8.9 6 2 5 0 5 3 6 7.7 7

Total number of shifts 75 91 81 2 106 63 10 40 17 11 3 135 97 60 39 50 57 33 17 130 15

Parent Functions

Child Functions

parent functionsch

ild f

unct

ions

metabolismsignal

transduction

protein biosynthesis

poorlycharacterised

Page 18: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

3.40.50.720

3.40.50.300

3.40.50.150

2.60.40.10

1.10.10.10

2.40.50.140

Superfamily Variation: Structure/Sequence

0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms

Sequence Families

Str

uctu

ral D

iver

sity

Population in genomes (x 1000)

Str

uct

ura

l D

iver s

i ty

<10% of domain superfamilies (<200) are highly <10% of domain superfamilies (<200) are highly expanded in the genomes and functionally very diverseexpanded in the genomes and functionally very diverse

~2000superfamilies

Page 19: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

N-fold increase in functional annotation using N-fold increase in functional annotation using pairwise sequence identity thresholdspairwise sequence identity thresholds

general thresholds family specific thresholds

0

2

4

6

8

Gene3D (6.8%) H.sapiens (5%) A.thaliana (2.7%) C.elegans (1.1%) B.anthracis (3.7%)

N-f

old

in

crea

se i

n c

ove

rag

e

Domain - 50/80 and 40/80 cut-offs if identical MDA Domain - Family specific cut-off

N-f

old

incr

ease

in

N-f

old

incr

ease

in

covera

ge

covera

ge

>50% sequence identity - 90% probability of having related functions

If the domains have the same multidomain context

>30% sequence identity – 90% probability of having related functions

Page 20: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Some superfamilies contain multiple diverse functional subfamilies

Page 21: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologuesanalyse residue

featurespredict

transmembrane, localisation etc

predict protein interactions

search family resources

(orthologues and paralogues)

search for conserved residues

TreeDet, ScoreCons, GEMMA

ETtrace, SCI-PHY, FunShift

Page 22: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Score conservation

for each position in the

alignment using an entropy measure

1 = highly conserved

0 = unconservedPutative functional site

Structural model

Identify functional subfamilies by using information on sequence conserved residue positions

Scorecons –Thornton TreeDet - Valencia

multiple sequence alignment of relatives

from functional subfamily

Page 23: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Phylogenetic trees derived from multiple sequence alignments can be used to identify functional subfamilies

TreeDet - ValenciaSCI-PHY – SjolanderFunShift – SonnhammerETtrace - Lichtarge

Page 24: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

TreeDet method for identifying functional subfamiliesTreeDet method for identifying functional subfamiliesAlfonsoValencia group, CNIOAlfonsoValencia group, CNIO

Page 25: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

domain superfamily

GEMMA: Compares sequence profiles (HMMs) between subfamilies using COMPASS method

sequence subfamily 90% seq. id

putative functional subfamily

clusters sequence relatives predicted to have related functions

Page 26: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

0

10

20

30

40

50

60

70

80

90

100

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

5

10

15

20

25

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

1

2

3

4

5

6

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

GeMMA v SCI-PHY using gold standard Babbitt benchmark of 5 large curated superfamilies

Purity(high isbest)

Editdistance(low)

VIdistance(low is best)

Deviationfrom no.singletons(low)

Page 27: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Annotation (EC number) coverage of MEGA family 3.90.1200.10

0

10

20

30

40

50

60

70

80

Database annotations Annotations inherited w ithin S60 clusters Annotations inherited w ithin GeMMAfunctional subfamilies

Source of annotation

Co

vera

ge

of

fam

ily (

%)

Covera

ge o

f su

perf

am

ily (

%)

experimentalannotations

inherit functions at 50% seq. id.

inherit functions by GEMMA

Functional annotation coverage using different strategies

Page 28: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologues

search family resources

(orthologues and paralogues)

analyse residue featurespredict

transmembraneMEMSAT, TMHMM, ENSEMBLE, PONGO

predict protein interactions

search for conserved residues

analyse residue features predict disorder, signal

peptides, localisationBarcello, DisoPred,

FFpred

Page 29: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

A ’biological’ hydrophobicity scaleA ’biological’ hydrophobicity scale

(Hessa et al., Nature (Hessa et al., Nature 433433:377 & :377 & 450450:1026; Bernsel et al. PNAS in press):1026; Bernsel et al. PNAS in press)

Gunnar Von Heijne, STO

Page 30: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Pongo annotation enginePongo annotation engine

Seven predictors at the core: all-α TM topology;

(a) TMHMM 2.0 (b) MEMSAT (c) PRODIV (d) ENSEMBLE (e) ENSEMBLE 2.0(f) TMHMM DOMFIX

signal peptide; (a) SPEP

Rita Casadio, UNIBO

http://pongo.biocomp.unibo.it/pongo

Page 31: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Performance of the high scoring methods on the 121 high-resolved chains (from PDB)

Q topography Q topology

TMHMM 88/ 121 (73%) 67/ 121 (55%)

TMHMMdomfi x 87/ 121 (72%) 74/ 121 (61%)

PRODI V 99/ 121 (82%) 93/ 121 (77%)

MEMSAT 93/ 121 (77%) 90/ 121 (74%)

ENSEMBLE 1.0 105/ 121 (87%) 92/ 121 (76%)

ENSEMBLE 2.0 105/ 121 (87%) 95/ 121 (79%)

Correct Topography: Correct Position of TMhelices along the sequence

Correct Topology: Correct Position AND Correct Orientation with respect to the membrane plane

Page 32: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

The PONGO engine:

http://pongo.biocomp.unibo.it

Amico M, Finelli M, Rossi I, Zauli A, Elofsson A, Viklund H, von Heijne G, Jones D, Krogh A, Fariselli P, Martelli PL, Casadio R -PONGO: a web server for multiple predictions of all-alpha transmembrane proteins- Nucleic Acids Res 34(Web server issue):169-172 (2006)

Page 33: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

CBS prediction servers

Broad range of prediction servers

Amino acid sequence based methods within: Protein sorting

Post-translational modifications of proteins

Protein function and structure

Immunological features

Local protein features, e.g. “kinase-specific phosphorylation site”, “nuclear export signal”, “propeptide cleavage site”

Global properties, e.g. “cell cycle regulated”, “secreted via a non-classical pathway”, “member of the nucleolar subproteome”, GO categories, EC categories, ...

Soren Brunak, DTU

Page 34: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

FFPred: An Integrated Feature based Function Prediction FFPred: An Integrated Feature based Function Prediction Server for Vertebrate ProteomesServer for Vertebrate Proteomes

Inferring function using patterns of native disorder in proteins. Lobley, A.E., Swindells, M.B., Orengo, C.A. & Jones, D.T. (2007) PLoS Comput. Biol. 3:e162.

posterior probability estimate

GO Term SVM

Amino acid sequence

Sec.str disorder motifs localisation

Novel sequence

Characteristics

Classification

aa transmem

> 300 GO Term Classifiers for both Molecular Process and Biological Function Categories

David Jones, UCL

Page 35: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Protein Annotations from Sequence DataProtein Annotations from Sequence Data

Network based analyses

Page 36: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

CORUM:the comprehensive resource of mammalian protein complexes

No of Proteins/Protein complexes

• consists of 2100 protein complexes

• covers ~3000 different proteins, representing 15% of protein coding genes in mammals

Dmitrj Frishmann, GSF

Page 37: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

MKLNSHHIASNYEASKNFVNILQFEIRENYRSDKDSYKLDMVGSEQYASYP….

search for orthologues

search family resources

(orthologues and paralogues)

analyse residue featurespredict

transmembrane,disorder etc

search interactions resources

CORUM, IntAct, HPRD, BIND

search for conserved residues

Predict interactionsSTRING, DIMA G3D-BioMiner

PROLINKS

Page 38: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Gene3D-BioMiner Gene3D-BioMiner

hiPPIhomology inherited

Protein-Protein Interactions

CODACo-Occurance of Domains Analysis

GECOGene Expression

Correlation

PhyloTunerDomain family co-evolution detection

Visualisation in CytoScape

Adding known functional associations i.e. from FunCat.

Weighted Integration

Page 39: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

CODA: FUSED DOMAINS

Specie 1

Specie 2

Method adapted from Enright, Ouzounis but a new scoring scheme has been developed

BioMiner

Page 40: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Homology Inferred Protein Protein InteractionsInherit data provided by HPRD, IntAct, BIND, CORUM

HiPPI: Protein-protein physical interaction data

Superfamily A Superfamily B

Page 41: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Eisenberg Phylogenetic Profiles for Detecting Functional Eisenberg Phylogenetic Profiles for Detecting Functional AssociationsAssociations

Superfamily 1

Superfamily 2

Superfamily 3

CATH Domain Superfamily

Organism 1 2 3 4

35 0 12 60

12 13 14 11

6 0 0 0

Gene3D Phylogenetic Occurrence ProfilesGene3D Phylogenetic Occurrence Profiles

Superfamily 1

Superfamily 2

Superfamily 3

Superfamily Organism 1 2 3 4

1 0 1 0

1 0 1 0

0 0 1 1

FunctionallyFunctionallyLinked Linked presence or presence or

absence of absence of superfamily superfamily in organismin organism

number of number of sequence sequence relatives relatives

from from superfamily superfamily in organismin organism

Page 42: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Eisenberg Phylogenetic Profiles for Detecting Functional Eisenberg Phylogenetic Profiles for Detecting Functional AssociationsAssociations

Superfamily 1

Superfamily 2

Superfamily 3

CATH Domain Superfamily

Organism 1 2 3 4

7 0 3 0

3 6 4 5

6 0 2 0

Gene3D PhyloTuner Occurrence ProfilesGene3D PhyloTuner Occurrence Profiles

Superfamily 1

Superfamily 2

Superfamily 3

Superfamily Organism 1 2 3 4

1 0 1 0

1 0 1 0

0 0 1 1

FunctionallyFunctionallyLinked Linked presence or presence or

absence of absence of superfamily superfamily in organismin organism

number of number of sequence sequence relatives relatives

from from superfamily superfamily in organismin organism

Ranea et al.(2007) PLOS Comp. Biol.

Page 43: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

ClusterLevel

Genome Occurrence

Sp1 Sp2 Sp3

Superfam. 8 7 3

s30(a) 6 4 2

s30(b) 2 3 1

s35(a) 6 4 2

s35(b) 2 3 1

s40(a) 6 4 2

s40(b)(i) 0 3 0

s40(b)(ii) 2 0 1

s50(a) 6 4 2

s50(b)(i) 0 3 0

s50(b)(ii) 2 0 1

… … … …

Domain Superfamilies clustered at different levels of sequence identity:

Sup. S30 S35 S40 S50 … (S100)

Phylo-Tuner algorithm

Phylogenetic Occurrence Profile Matrix

Species1 Species2 Species3

Superfamily

Page 44: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Sup. S30 S35 S40 S50 … (S100)

Superfamily X

Sup. S30 S35 S40 S50 … (S100)

Superfamily Y

Sp1 Sp2 Sp3 Sp4 Sp5 … Spn

5

10Ematch

Ematch <<<< Eall_rest

Euclidian distance:

Phylo-Tuner

Sp1 Sp2 Sp3 Sp4 Sp5 … Spn

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7

.

.

.Cluster n

6 0 6 9 5 … 94 3 7 5 3 … 51 0 1 0 2 … 10 2 0 0 1 … 61 4 1 4 1 … 40 3 5 2 0 … 14 8 4 8 4 … 8. . . . . … .. . . . . … .. . . . . … .0 1 0 1 1 … 0

Sp1 Sp2 Sp3 Sp4 Sp5 … Spn

Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7

.

.

.Cluster n

3 0 6 0 4 … 104 3 7 5 6 … 50 0 1 0 2 … 11 2 1 0 1 … 61 4 0 4 1 … 40 4 5 2 0 … 12 6 4 8 4 … 7. . . . . … .. . . . . … .. . . . . … .0 1 0 1 1 … 0

Zs calculations

xx

Xi

Zs =

Ematch

Page 45: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

0

0.01

0.02

0.03

0.04

0.05

( ≥

-3.5

)

(-3.5

)-(-

3.0

)

(-3.0

)-(-

2.5

)

(-2.5

)-(-

2.0

)

(-2.0

)-(-

1.5

)

(-1.5

)-(-

1.0

)

(-1.0

)-(-

0.5

)

(-0.5

)-(0

.0)

(0.0

)-(0

.5)

(0.5

)-(1

.0)

(1.0

)-(1

.5)

(1.5

)-(2

.0)

(2.0

)-(2

.5)

(2.5

)-(3

.0)

(3.0

)-(3

.5)

(>=

3.5

) 0

10

20

30

40

50

60

70

80

90

100

Highly similar profiles correspond to pairs of families Highly similar profiles correspond to pairs of families with significant similarity in GO functions with significant similarity in GO functions

true positives false positives

ratio of true positives to false positives

Biological process

Ranea et al. (2007) PLOS Comp. Biol.

Page 46: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

1

10

100

1000

10000

100000

0.8 0.85 0.9 0.95 1

Precision

# h

its

GEC

CODA-CATH

CODA-Pfam

hiPPI

Fisher W integration

Performance of Gene3D-BioMiner integrated methods assessed using a yeast genome dataset and semantic

similarity of GO terms

Page 47: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Phylogenetic domain profiling

PF1 100101110001110001PF2 011100011101001100PF3 100101110001110001PF4 110001000100100000

PPI → DDI (DPEA: Riley et al.)

460 completed genomes!

2 versions:

Domain interactions derived from PDB

Finn et al. 2005 Stein et al. 2005

SIMAP/BOINC for Pfam domain search

known PPIs

predicted PPIs

Page 48: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

STRING – functional protein interactionsSTRING – functional protein interactions

378 genomes

Interaction evidence Genomic context Primary

experiments Pathway databases Literature mining

New network viewer Confidence view

vs. evidence view Miniature protein

structures

Peer Bork, EMBL

Page 49: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Protein interaction networks

Over 2 million interactions in 184 genomes, previously uncharacterised

Filtering out promiscuous domains, excluding implausible interactions

Kamburov A et al. 2007) Denoising inferred functional association networks obtained by gene fusion analysis. BMC Genomics, 2007; 8(1):460

Denoising Protein Interaction Networks Christos Ouzounis, CERTH

Page 50: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Evaluation of graph-based clustering algorithms for extracting complexes from protein interaction networks

• Evaluation protocolo Reference complexes: MIPS databaseo Test with altered networks: various proportions of random edge addition/removal.o Testing of all parametric conditions.o Definition of assessment statistics (Sensitivity, Positive Predictive Value,

Accuracy)

Reference network: MIPS complexes Altered network (100% edge additions, 40% removal)

Sylvain Brohée and Jacques van Helden (2006). BMC Bioinformatics 7: 488

Page 51: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Acknowledgements

Protein Families

Michal Linial HUJI, Jerusalem Anne Lise Veuthey SIB, SwistzerlandDmitrij Frishmann GSF, GermanyAlfonso Valencia CNIO, Spain

Feature Based Prediction

Gunnar Von Heijne STO, SwedenRita Casadio UNIBO, ItalyDavid Jones UCL, LondonSoren Brunak DTU, Denmark

Protein Interactions

Christos Ouzounis CERTH, GreeceJacques Van Helden ULB, Brussels

Page 52: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Network Analysis Tools (NeAT)Network Analysis Tools (NeAT)

• A toolbox for the analysis of networks, clusters and pathwayso Graph-based

clustering o Path findingo Graph

comparisonso Graph

randomizationo Graph alterationo …

Web site: http://rsat.scmbb.ulb.ac.be/neat/

Jaques Van Helden, ULB

Page 53: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Network Analysis of QTLs in Mouse

QTL1

QTL2

QTL3

Novel genes can be discovered describing the trait in question

Maps protein interaction network to an inferred QTL network

Assigns functional roles to protein subnetworks on the basis of the phenotypic traits they are mapped to

Christos Ouzounis, CERTH

Page 54: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

DASMI – Distributed Annotation System for Molecular Interactions

Based on the Distributed

Annotation System (DAS)

Interaction servers and

visualization clients

DASMI web: Client for inte-

gration of protein and domain

interactions and function, possible

application of quality measures

iPfam : Client for graphical

visualization of various domain

interaction data sets

Page 55: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

0

10

20

30

40

50

60

70

80

90

100

Arabidopsis C.elegans Drosophila Human Mouse Yeast

Organism

Gen

es w

ith

str

uct

ura

l an

no

tati

on

Gene3D Genthreader

Proportion of genome sequences which can be assigned to

2100 domain families of known structure in CATH

Page 56: Family based approaches Methods which also exploit non-homology based information Protein Annotations from Sequence Data Network based analyses

Conservation of enzyme function for homologous Conservation of enzyme function for homologous domainsdomains

0

10

20

30

40

50

60

70

80

90

100

11--20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100

Sequence Identity (%)

Fu

nct

ion

co

nse

rvat

ion

(A

.B.C

.D/A

.B.C

.-)

(%)

0.0E+00

2.0E+05

4.0E+05

6.0E+05

8.0E+05

1.0E+06

1.2E+06

Identical MDA - minimum overlap 80% Different MDA - minimum overlap 80%

Distribution of pairwise homologous comparisons

Conse

rvati

on o

f C

onse

rvati

on o

f EC

nu

mber

to 3

EC

nu

mber

to 3

le

vels

(%

)le

vels

(%

)

Sequence identitysame MDA

CATH-1CATH-1Pfam-1Pfam-1 Pfam-2Pfam-2MDA

different MDA

Num

ber

of

pair

s N

um

ber

of

pair

s of

rela

tives

of

rela

tives

>50% sequence identity - 90% probability of having related functions

If the domains have the same multidomain architecture (MDA)

>30% sequence identity – 90% probability of having related functions