118
L12 - Introduction to Protein Structure; Structure Comparison & Classification L13 - Predicting protein structure L14 - Predicting protein interactions L15 - Gene Regulatory Networks L16 - Protein Interaction Networks L17 - Computable Network Models 1

Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bull L12 - Introduction to Protein Structure Structure Comparison amp Classification

bull L13 - Predicting protein structure bull L14 - Predicting protein interactions bull L15 - Gene Regulatory Networks bull L16 - Protein Interaction Networks bull L17 - Computable Network Models

1

Predictions

Last time protein structure Now protein interactions

copy American Association for the Advancement of Science All rights reservedThis content is excludedfrom our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Lindorff-Larsen Kresten Stefano Piana et al How Fast-foldingProteins Fold Science 334 no 6055 (2011) 517-20

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

2

Prediction Challenges

bull Predict effect of point mutations bull Predict structure of complexes bull Predict all interacting proteins

3

bullDOI 101002prot24356

ldquoSimplerdquo challenge Starting with known structure of a complex predict how much a mutation changes binding affinity

copy Wiley Periodicals Inc All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Moretti Rocco Sarel J Fleishman et al Community‐wide Evaluation of Methods forPredicting the Effect of Mutations on Proteinndashprotein Interactions Proteins Structure Function andBioinformatics 81 no 11 (2013) 1980-7

4

bullDOI 101002prot24356

BLOSUM

First Round

Second Round (Given data for nine random mutations at each position)

Area under curve for predictions (varying cutoff in ranking)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

5

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 2: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Predictions

Last time protein structure Now protein interactions

copy American Association for the Advancement of Science All rights reservedThis content is excludedfrom our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Lindorff-Larsen Kresten Stefano Piana et al How Fast-foldingProteins Fold Science 334 no 6055 (2011) 517-20

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

2

Prediction Challenges

bull Predict effect of point mutations bull Predict structure of complexes bull Predict all interacting proteins

3

bullDOI 101002prot24356

ldquoSimplerdquo challenge Starting with known structure of a complex predict how much a mutation changes binding affinity

copy Wiley Periodicals Inc All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Moretti Rocco Sarel J Fleishman et al Community‐wide Evaluation of Methods forPredicting the Effect of Mutations on Proteinndashprotein Interactions Proteins Structure Function andBioinformatics 81 no 11 (2013) 1980-7

4

bullDOI 101002prot24356

BLOSUM

First Round

Second Round (Given data for nine random mutations at each position)

Area under curve for predictions (varying cutoff in ranking)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

5

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 3: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Prediction Challenges

bull Predict effect of point mutations bull Predict structure of complexes bull Predict all interacting proteins

3

bullDOI 101002prot24356

ldquoSimplerdquo challenge Starting with known structure of a complex predict how much a mutation changes binding affinity

copy Wiley Periodicals Inc All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Moretti Rocco Sarel J Fleishman et al Community‐wide Evaluation of Methods forPredicting the Effect of Mutations on Proteinndashprotein Interactions Proteins Structure Function andBioinformatics 81 no 11 (2013) 1980-7

4

bullDOI 101002prot24356

BLOSUM

First Round

Second Round (Given data for nine random mutations at each position)

Area under curve for predictions (varying cutoff in ranking)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

5

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 4: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bullDOI 101002prot24356

ldquoSimplerdquo challenge Starting with known structure of a complex predict how much a mutation changes binding affinity

copy Wiley Periodicals Inc All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Moretti Rocco Sarel J Fleishman et al Community‐wide Evaluation of Methods forPredicting the Effect of Mutations on Proteinndashprotein Interactions Proteins Structure Function andBioinformatics 81 no 11 (2013) 1980-7

4

bullDOI 101002prot24356

BLOSUM

First Round

Second Round (Given data for nine random mutations at each position)

Area under curve for predictions (varying cutoff in ranking)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

5

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 5: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bullDOI 101002prot24356

BLOSUM

First Round

Second Round (Given data for nine random mutations at each position)

Area under curve for predictions (varying cutoff in ranking)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

5

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 6: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Predicting Structures of Complexes

bull Can we use structural data to predict complexes

bull This might be easier than quantitative predictions for site mutants

bull But it requires us to solve a docking problem

copy source unknown All rights reserved This content is excluded fromour Creative Commons license For more information seehttpocwmiteduhelpfaq-fair-use

6

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 7: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Docking

N Tuncbag

Which surface(s) of protein A interactions with which surface of protein B

Courtesy of Nurcan Tuncbag Used with permission

7

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 8: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Docking Imagine we wanted to predict which proteins interact with our favorite molecule For each potential partner

bullEvaluate all possible relative positions and orientations

bullallow for structural rearrangements

bullmeasure energy of interaction

This approach would be extremely slow Itrsquos also prone to false positives

Why

Courtesy of Nurcan Tuncbag Used with permission

8

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 9: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Reducing the search space

bull Efficiently choose potential partners before structural comparisons

bull Use prior knowledge of interfaces to focus analysis on particular residues

9

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 10: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

10

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 11: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

PRISMrsquos Rationale

There are limited number of protein ldquoarchitecturesrdquo Protein structures can interact via similar architectural motifs even if the overall structures differ Find particular surface regions of proteins that are spatially similar to the complementary partners of a known interface

N Tuncbag 11

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 12: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bull Two components ndash rigid-body structural comparisons of target

proteins to known template protein-protein interfaces

ndash flexible refinement using a docking energy function

bull Evaluate using structural similarity and evolutionary conservation of putative binding residue hot spots

PRISMrsquos Rationale

N Tuncbag 12

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 13: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Subtilisin and its inhibitors Although global folds of Subtilisinrsquos partners are very different binding regions are structurally very conserved

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 13

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 14: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

14

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 15: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Hotspots

Figure from Clackson amp Wells (1995)

copy American Association for the Advancement of Science All rights reserved This content is excludedfrom our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Clackson Tim and James A Wells A Hot Spot of Binding Energy in a Hormone-ReceptorInterface Science 267 no 5196 (1995) 383-6

15

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 16: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bull Fewer than 10 of the residues at an interface contribute more than 2 kcalmol to binding

bull Hot spots ndash rich in Trp Arg and Tyr ndash occur on pockets on the two proteins that have

complementary shapes and distributions of charged and hydrophobic residues

ndash can include buried charge residues far from solvent

ndash O-ring structure excludes solvent from interface

httponlinelibrarywileycomdoi101002prot21396full 16

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 17: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 17

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 18: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission 18

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 19: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Ikb

NFkB

N Tuncbag

1 Identify interface of template (distance cutoff)

Courtesy of Nurcan Tuncbag Used with permission

19

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 20: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Ikb

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspots

Courtesy of Nurcan Tuncbag Used with permission 20

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 21: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

NFkB

N Tuncbag

ASPP2 1 Identify interface of template (distance cutoff)

2 Align entire surface of query to half-interfaces

3 Test 1 Overall structural match 2 Structural match of at

hotspots 3 Sequence match at

hotspot

4 Flexible refinement

Courtesy of Nurcan Tuncbag Used with permission 21

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 22: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Flowchart

Structural match of template and target does not depend on order of residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Tuncbag Nurcan Attila Gursoy et al Predicting Protein-protein Interactions on a Proteome Scale by Matching Evolutionary andStructural Similarities at Interfaces using PRISM Nature Protocols 6 no 9 (2011) 1341-54

22

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 23: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Predicted p27 Protein Partners

p27

Cdk2

CycA

ERCC1

TFIIH

Cks1

∆Gcalc= - 2954 kcalmol ∆Gcalc= - 3751 kcalmol ∆Gcalc= - 4406 kcalmol

N Tuncbag Courtesy of Nurcan Tuncbag Used with permission 23

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 24: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Next

PRISM Fast and accurate modeling of

protein-protein interactions by combining template-interface-based docking with flexible refinement

Tuncbag N Keskin O Nussinov R Gursoy A

httpwwwncbinlmnihgovpubmed22275112

PrePPI Structure-based prediction of

proteinndashprotein interactions on a genome-wide scale

Zhang et al httpwwwnaturecomnatur

ejournalv490n7421fullnature11503html

24

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 25: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

PrePPI Scores potential templates without building a homology model Criteria

Geometric similarity between the protomer and template Statistics based on preservation of contact residues

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

25

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 26: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NA

i

NBi

)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

26

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 27: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

27

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 28: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

28

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 29: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

29

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 30: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500 neighborsstructure) 3 Look for structure of a complex containing structural neighbors 4 Align sequences of MAMB to NANB based on structure

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

30

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 31: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

31

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 32: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

NA

NA

NB

NB

1 Identify interacting residues in template complex (Called NA1 NB3 in rest of paper)

2 Predict interacting residues for the homology models

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

32

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 33: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

NA

NA

NB

NB

Evaluate based on five measures

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

33

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 34: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

NA

NA

NB

NB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 34

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 35: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Evaluate based on five measures bullSIM structural similarity of NAMA and NBMB bullSIZ (number) COV (fraction) of interaction pairs can be aligned anywhere bullOS subset of SIZ at interface bullOL number of aligned pairs at interface

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use

35

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 36: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

ldquoThe final two scores reflect whether the residues that appear in the model interface have properties consistent with those that mediate known PPIs (for example residue type evolutionary conservation or statistical propensity to be in proteinndashprotein interfaces)rdquo

copy source unknown All rights reserved This content is excluded from our CreativeCommons license For more information see httpocwmiteduhelpfaq-fair-use 36

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 37: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Structure-based prediction of proteinndashprotein interactions on a genome-wide scale Nature 490 556ndash560 (25 October 2012) doi101038nature11503

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

37

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 38: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

1 Find homologous proteins of known structure (MAMB) 2 Find structural neighbors (NAiNBi)(avg1500

neighborsstructure) 3 Look for structure of a complex containing structural

neighbors 4 Align sequences of MAMB to NANB based on structure 5 Compute five scores 6 Train Bayesian classifier using ldquogold standardrdquo interactions

We will examine Bayesian classifiers soon

Courtesy of Macmillan Publishers Limited Used with permissionSource Zhang Qiangfeng Cliff Donald Petrey et al Structure-based Prediction of Protein-proteinInteractions on a Genome-wide Scale Nature 490 no 7421 (2012) 556-60

38

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 39: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

39

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 40: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Proteomics Protein complexes take the bait Anuj Kumar and Michael Snyder Nature 415 123-124(10 January 2002) doi101038415123a

Gavin A-C et al Nature 415 141-147 (2002) Ho Y et al Nature 415 180-183 (2002)

Detecting protein-protein interactions

What are the likely false positives What are the likely false negatives

40

Courtesy of Macmillan Publishers LimitedUsed with permissionSource Kumar Anuj and Michael Snyder Proteomics Protein Complexes take the Bait Nature 415 no 6868 (2002) 123-4 40

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 41: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations

41

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 42: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Mass-spec for protein-protein interactions

bull Extremely efficient method for detecting interactions

bull Proteins are in their correct subcellular location

Limitations bull overexpressiontagging can influence results bull only long-lived complexes will be detected

42

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 43: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Tagging strategies

TAP-tag (Endogenous protein levels) Tandem purification 1 Protein A-IgG purification 2 Cleave TEV site to elute 3 CBP-Calmodulin purification 4 EGTA to elute

Gavin et al (2002) Nature

Ho et al (2002) Nature over-expressed proteins and used only one tag

Courtesy of Macmillan Publishers Limited Used with permissionSource Gavin Anne-Claude Markus Boumlsche et al Functional Organization of the Yeast Proteome bySystematic Analysis of Protein Complexes Nature 415 no 6868 (2002) 141-7

43

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 44: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

How does this compare to mass-spec based approaches

Yeast two-hybrid

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of Cell Signaling Pathwaysusing the Evolving Yeast Two-hybrid System Biotechniques 44 no 5 (2008) 655

44

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 45: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bullDoes not require purification ndash will pick up more transient interactions bullBiased against proteins that do not express well or are incompatible with the nucleus

Biotechniques 2008 Apr44(5)655-62 Ratushny V Golemis E

Courtesy of BioTechniques Used with permissionSource Ratushny Vladimi and Erica A Golemis Resolving the Network of CellSignaling Pathways using the Evolving Yeast Two-hybrid System Biotechniques44 no 5 (2008) 655

45

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 46: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

46

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 47: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Error Rates

bull How can we estimate the error rates

Gold standard Experiment

False negatives

True positives

True positives

and false

positives

47

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 48: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Error Rates

Gold standard

Experiment 1

Experiment 2

True positives

from gold standard

48

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 49: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Data Integration

Gold standard

Experiment 1

Experiment 2

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III 49

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 50: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Mix of true and false

positives

III

50

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 51: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Define IV =

true positives V =

false positives

IV true

V false

III

51

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 52: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Data Integration

Gold standard

Experiment 1

I =True positives

from gold standard

II=Consensus true

positives

I

II

Fraction of consensus present in gold standard=III

Assume III=IIIIV

IV true

V false

III

52

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 53: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Estimated Error Rates

False positives

True positives

How complete are current yeast and human protein-interaction networks G Traver Hart Arun K Ramani and Edward M Marcotte Genome Biology 2006 7120doi101186gb-2006-7-11-120

I=

II=

III=

Assume that all of regions I and II are true positives If MIPS has no bias toward either Krogan or Gavin then the fraction of TP in MIPS will be the same in the common data (III) and the unique data (IIIIV) I = III II IV

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein-interaction Networks Genome Biology 7 no 11 (2006) 120

53

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 54: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Courtesy of BioMed Central Ltd Used with permissionSource Hart G Traver Arun K Ramani et al How Complete are Current Yeast and Human Protein interaction NetworksGenome Biology 7 no 11 (2006) 120

54

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 55: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Finding real interactions

bull Take only those that are reported by gt1 method

bull Filter out ldquostickyrdquo proteins bull Estimate probability of each interaction based

on data bull Use external data to predict

55

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 56: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Note log-log plot Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scale DataSets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

56

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 57: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Finding real interactions

bull Estimate probability of each interaction based on data ndash How can we compute ( )_ |P real PPI Data

( _ _ _ _ )Data two hybrid mass spec co evolution co expression=

( )Data mass_spec_expt_1mass_spec_expt_2=

or

57

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 58: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayes Rule

posterior prior likelihood

58

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 59: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Naiumlve Bayes Classification

posterior prior likelihood

likelihood ratio = ratio of posterior probabilities

if gt 1 classify as true if lt 1 classify as false

How do we compute this

59

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 60: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

60

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 61: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function = ( | _ )log( | _ )

P Data true PPIP Data false PPI

61

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 62: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )log( | _ )

P Data true PPIP Data false PPI

62

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 63: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

We assume the observations are independent (wersquoll see how to handle dependence soon)

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

63

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 64: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

We assume the observations are independent (wersquoll see how to handle dependence soon) We can compute these terms if we have a set of high-confidence positive and negative interactions Exactly how we compute the terms depends on the type of data For affinity purificationmass spec see Collins et al Mol Cell Proteomics 2007 httpwwwmcponlineorgcontent63439long

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

64

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 65: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Comparative assessment of large-scale data sets of proteinndashprotein interactions von Mering et al Nature 417 399-403 (23 May 2002) | doi101038nature750

Instead of requiring an interaction to be detected in all assays we can rank by

Courtesy of Macmillan Publishers Limited Used with permissionSource Von Mering Christian Roland Krause et al Comparative Assessment of Large-scaleData Sets of Proteinndashprotein Interactions Nature 417 no 6887 (2002) 399-403

65

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 66: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

TP(T

P+F

N)

ROC curve

True positives = True negatives =

Source Collins Sean R Patrick Kemmeren et al Toward a Comprehensive Atlas of the PhysicalInteractome of Saccharomyces Cerevisiae Molecular amp Cellular Proteomics 6 no 3 (2007) 439-50

66

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 67: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Collins et al Mol Cell Proteomics 2007

Compute using high-confidence negatives FP(TN+FP)

e de

nc )es

onfi ex

pl-c omgh c

ng h

i omfri

us (es

pute

vtiipo

sC

om

ROC curve

)FN+

(TP

TPROC curve

True positives = interactions between proteins that occur in a complex annotated in a human-curated database (MIPS or SGD) True negatives = proteins pairs that 1 are annotated to belong to distinct complexes 2 have different sub-cellular locations OR anticorrelated mRNA expression

68 67

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 68: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

Literature curated (excluding 2-hybrid)

Error rate equivalent to

literature curated

68

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 69: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Collins et al Mol Cell Proteomics 2007

9074 interactions

Compute using high-confidence negatives

FP(TN+FP)

Com

pute

usi

ng h

igh-

conf

iden

ce

posi

tives

(fro

m c

ompl

exes

)

ROC curve

TP(T

P+F

N)

ROC curve

69

copy American Society for Biochemistry and Molecular Biology All rights

reserved This content is excluded from our Creative Commons license

For more information see httpocwmiteduhelpfaq-fair-useSource Collins Sean R Patrick Kemmeren et al Toward a ComprehensiveAtlas of the Physical Interactome of Saccharomyces Cerevisiae Molecular amp

Cellular Proteomics 6 no 3 (2007) 439-50

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 70: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Outline

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

70

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 71: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayesian Networks

A method for using probabilities to reason

71

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 72: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

In Biology

bull Gene regulation bull Signaling bull Prediction

72

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 73: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

bull Bayesian Networks are a tool for reasoning with probabilities

bull Consist of a graph (network) and a set of probabilities

bull These can be ldquolearnedrdquo from the data

Bayesian Networks

73

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 74: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayesian Networks

A ldquonaturalrdquo way to think about biological networks

Predict unknown variables from observations

75

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

74

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 75: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|Y2HUetz Y2HIto IPGavin IPHo)

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

75

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 76: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayesian Networks

You could try to write out all the joint probabilities P(PPI|RosettaCCGOMIPSESS) But some problems are too big

Courtesy of Macmillan Publishers Limited Used with permissionSource Yang Xia Joshua L Deignan et al Validation of Candidate Causal Genes for Obesity that Affect Shared Metabolic Pathways and NetworksNature genetics 41 no 4 (2009) 415-23

76

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 77: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Bayesian Networks

bull Complete joint probability tables are large and often unknown

bull N binary variables = 2N states ndash only one constraint (sum of all probabilities =1)

=gt 2N - 1 parameters

77

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 78: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Graphical Structure Expresses our Beliefs

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed)

78

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 79: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

79

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 80: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

Naiumlve Bayes assumes all observations are independent

But some observations may be coupled

80

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 81: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Graphical Structure Expresses our Beliefs P1-P2

REAL

Detected by X1

Detected by Xn

hellip

P1-P2 REAL

Detected by X1

Detected by Xn hellip Detected

by Xj Detected by Xj+1 hellip

Membrane Highly expressed

bull The graphical structure can be decided in advance based on knowledge of the system or learned from the data

81

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 82: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Graphical Structure In a Bayesian Network we donrsquot need the full probability distribution A node is independent of its ancestors given its parents For example The activity of a gene does not depend on the activity of TF A1 once I know TF B2 82

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 83: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Prediction we observe the ldquocausesrdquo (roots parents) and want to predict the ldquoeffectsrdquo (leaveschildren)

Given Grades GREs will we admit

Inference we observe the ldquoeffectsrdquo (leaves children) and want to infer the hidden values of the ldquocausesrdquo (rootsparents)

You meet an admitted the student Is she smart

Admit

83

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 84: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

Admit

Making predictionsinferences requires knowing the joint probabilities P(admit grade inflation smart grades GREs) We will find conditional probabilities to be very useful

84

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 85: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

S G (above

threshold)

P(SG)

F F 0665

F T 0035

T F 006

T T 024

Automated Admissions Decisions Joint Probability

S P(G=F|S) P(G=T|S)

F 095 005

T 02 08

P(S=F)=07 P(S=T)=03

Grades

Smart

Conditional Probability

Formulations are equivalent and both require same number of constraints 85

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 86: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Grades

Smart

Automated Admissions Decisions Grade

inflation

GREs

In a Bayesian Network we donrsquot need the full probability distribution The joint probability depends only on ldquoparentsrdquo For example GRE scores do not depend on the level of grade inflation at the school but the grades do

Admit

86

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 87: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

ldquoExplaining Awayrdquo Season

S Sprinkler

R Rain

Grass wet

Does the probability that itrsquos raining depend on whether the sprinkler is on

Slippery

In a causal sense clearly not But in a probabilistic model the knowledge that it is raining influences our beliefs

87

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 88: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Coin 1

Coin 2

Score

p(C2=H)=p(C2=T) = 5 p(C1=H)=p(C1=T) = 5

C1 C2 Score

H H 1

T T 1

H T 0

T H 0

Does the prob that C1 =H on depend on whether C2 =T If we know the score then our belief in the state of C1 is influenced by our belief in the state C2

ldquoExplaining Awayrdquo

88

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 89: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

How do we obtain a BN

bull Two problems ndash learning graph structure

bull NP-complete bull approximation algorithms

ndash probability distributions

89

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 90: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Learning Models from Data

bull Assume we know the structure how do we find the parameters

bull Define an objective function and search for parameters that optimize this function

90

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 91: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Learning Models from Data

bull Two common objective functions ndash Maximum likelihood

bull Define the likelihood over training data Xi

ndash Maximum posterior

bull Good search algorithms exist ndash Gradient descent EM Gibbs Sampling hellip

91

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 92: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Learning Models from Data

Coin 1

Coin 2

Score

p(C2=H)= p(C2=T) =

p(C1=H)= p(C1=T) =

C1 C2 Score

H H

T T

H T

T H

D= (C1C2S) = (HT0) (HH1) hellip

92

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 93: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Learning Models from Data

bull Searching for the BN structure NP-complete ndash Too many possible structures to evaluate all of

them even for very small networks ndash Many algorithms have been proposed ndash Incorporated some prior knowledge can reduce

the search space bull Which measurements are likely independent bull Which nodes should regulate transcription bull Which should cause changes in phosphorylation

93

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 94: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Resources to learn more

Kevin Murphyrsquos tutorial httpwwwcsubcca~murphykBayesbnintrohtml

94

copy Fabio Gagliardi Cozman All rights reserved This content is excluded from our Creative

Commons license For more information see httpocwmiteduhelpfaq-fair-use

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 95: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

A worked ldquotoyrdquo example

Best to work through on your own

95

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 96: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Chain Rule of Probability for Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(SGRD) = P(S)P(G|S)P(R|GS)P(A|SGR)

= P(S)P(G|S)P(R|S)P(A|GR) (why)

(because of conditional independence assumption)

Recall P(XY) = P(X|Y)P(Y)

96

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 97: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Prediction with Bayes Nets S

P(F) P(T) 05 05

G S P(F) P(T) F 09 01 T 08 02 R

S P(F) P(T) F 08 02 T 02 08

A G R P(F) P(T) F F 09 01 F T 08 02 T F 05 05 T T 02 08

P(A=T|S=T) = ΣΣ P(G|S=T)P(R|S=T)P(A=T|GR) G=FT R=FT

= (08)(02)(01) + (08)(08)(02) + (02)(02)(05) + (02)(08)(08) =29 F F F T T F T T

Rough grading being smart doesnrsquot help very much

GREs are better correlated with intelligence than the grades

G Grade

s

S Smart

R GREs

A Admit

97

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 98: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

P(A=T) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) S=FT G=FT

R=FT

= P(S=T)P(A=T|S=T) + P(S=F)P(A=T|S=F) = 021 P(S=T) = 05 P(S=F) = 05 P(A=T|S=F) calculated analogously to P(A=T|S=T)

P(S=T) = 05 P(A=T|S=T) calculated on previous slide =29 P(A=T|S=F) =14

P(S=T|A=T) = P(S=TA=T)P(A=T) Or using Bayes Rule = P(S=T)P(A=T|S=T)P(A=T)

98

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 99: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Inference with Bayes Nets

G Grade

s

S Smart

R GREs

A Admit

If a student is not admitted is it more likely they had bad GREs or bad grades

Compute P(R=F|A=F) and P(G=F|A=T)

P(R=F|A=F) = P(R=FA=F) P(A=F) = [Σ Σ P(S)P(G=F)P(R)P(A=F)]P(A=F) G=FT

R=FT

P(A=F) = Σ Σ Σ P(S)P(G|S)P(R|S)P(A=T|GR) (as before)

Tedious but straightforward to compute

P(G=F|A=T) = 92 = 16 P(R=F|A=F) 56

99

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 100: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

End of worked example

100

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 101: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Goal

bull Estimate interaction probability using ndash Affinity capture ndash Two-hybrid ndash Less physical data

P1-P2 REAL

Detected by X1

Detected by Xn

hellip

Cause (often ldquohiddenrdquo)

Effects (observed) 101

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 102: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Properties of real interactions correlated expression Expression Profile Reliability (EPR)

Deane et al Mol amp Cell Proteomics (2002) 15 349-356

INT = high confidence interactions from

small scale experiments

d = ldquodistancerdquo that measures the difference between two mRNA expression profiles

Note proteins involved in ldquotruerdquo protein-protein interactions have more similar mRNA expression profiles than random pairs Use this to assess how good an experimental set

of interactions is

Source Deane Charlotte M Łukasz Salwiński et al Protein Interactions Two Methods for Aassessment ofthe Reliability of High Throughput Observations Molecular amp Cellular Proteomics 1 no 5 (2002) 349-56

102

copy American Society for Biochemistry and Molecular Biology All rights reserved This content is excluded

from our Creative Commons license For more information see httpocwmiteduhelpfaq-fair-use

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 103: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Co-evolution

Cokus et al BMC Bioinformatics 2007 8(Suppl 4)S7 doi1011861471-2105-8-S4-S7

More likely to interact

Which pattern below is more likely to represent a pair of interacting proteins

104

Courtesy of Cokus et al License CC-BYSource Cokus Shawn Sayaka Mizutani et al An Improved Method for Identifying FunctionallyLinked Proteins using Phylogenetic Profiles BMC Bioinformatics 8 no Suppl 4 (2007) S7

103

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 104: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Rosetta Stone

bull Look for genes that are fused in some organisms ndash Almost 7000 pairs found in

E coli ndash gt6 of known interactions

can be found with this method

ndash Not very common in eukaryotes

104

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 105: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Integrating diverse data

httpwwwsciencemagorgcontent3025644449abstract 105

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 106: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Advantage of Bayesian Networks

bull Data can be a mix of types numerical and categorical

bull Accommodates missing data bull Give appropriate weights to different sources bull Results can be interpreted easily

106

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 107: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Requirement of Bayesian Classification

bull Gold standard training data ndash Independent from evidence ndash Large ndash No systematic bias

Positive training data MIPS bull Hand-curated from literature Negative training data bull Proteins in different subcellular compartments

107

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 108: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Integrating diverse data

109

copy American Association for the Advancement of Science All rights reserved This content is excluded fromour Creative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data science 302 no 5644 (2003) 449-53

108

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 109: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

copy American Association for the Advancement of Science All rights reserved This content is excluded from ourCreative Commons license For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-proteinInteractions from Genomic Data Science 302 no 5644 (2003) 449-53

109

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 110: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

likelihood ratio = if gt 1 classify as true if lt 1 classify as false

log likelihood ratio =

Prior probability is the same for all interactions --does not affect ranking

Ranking function =

( | _ )( | _ )log( | _ ) ( | _ )

Mi

i i

P Observation true PPIP Data true PPIP Data false PPI P Observation false PPI

=

prod

110

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 111: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Protein pairs in the essentiality data can take on three discrete values (EE both essential NN both non-essential and NE one essential and one not)

11142150

81924573734

Likelihood=L= )|()|(

negfPposfP

111

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 112: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

112

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 113: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from GenomicData Science 302 no 5644 (2003) 449-53

113

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 114: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Fully connected rarr Compute probabilities for all 16 possible combinations

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

114

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 115: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Interpret with caution as numbers are small

copy American Association for the Advancement of Science All rights reserved This content is excluded from our Creative Commonslicense For more information see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approach for Predicting Protein-protein Interactions from Genomic DataScience 302 no 5644 (2003) 449-53

115

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 116: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

TF=FP

prediction based on single data type all have TPFPlt1

How many gold-standard events do we score correctly at different likelihood cutoffs

( | _ )log( | _ )

P Data true PPIP Data false PPI

copy American Association for the Advancement of Science All rights reservedThis content is excluded from our Creative Commons license For moreinformation see httpocwmiteduhelpfaq-fair-useSource Jansen Ronald Haiyuan Yu et al A Bayesian Networks Approachfor Predicting Protein-protein Interactions from Genomic Data Science 302no 5644 (2003) 449-53

116

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 117: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

Summary

bull Structural prediction of protein-protein interactions

bull High-throughput measurement of protein-protein interactions

bull Estimating interaction probabilities bull Bayes Net predictions of protein-protein

interactions

117

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary
Page 118: Lecture 14: Predicting Protein Interactions › courses › biology › 7-91j-foundations-of...Look for structure of a complex containing structural neighborsAlign sequences of MA,MB

MIT OpenCourseWarehttpocwmitedu

791J 20490J 20390J 736J 6802J 6874J HST506J Foundations of Computational and Systems BiologySpring 2014

For information about citing these materials or our Terms of Use visit httpocwmiteduterms

  • Slide Number 1
  • Predictions
  • Prediction Challenges
  • Slide Number 4
  • Slide Number 5
  • Predicting Structures of Complexes
  • Docking
  • Docking
  • Reducing the search space
  • Next
  • PRISMrsquos Rationale
  • Slide Number 12
  • Subtilisin and its inhibitors
  • Hotspots
  • Hotspots
  • Slide Number 16
  • Slide Number 17
  • Slide Number 18
  • Slide Number 19
  • Slide Number 20
  • Slide Number 21
  • Flowchart
  • Slide Number 23
  • Next
  • Slide Number 25
  • Slide Number 26
  • Slide Number 27
  • Slide Number 28
  • Slide Number 29
  • Slide Number 30
  • Slide Number 31
  • Slide Number 32
  • Slide Number 33
  • Slide Number 34
  • Slide Number 35
  • Slide Number 36
  • Slide Number 37
  • Slide Number 38
  • Outline
  • Detecting protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Mass-spec for protein-protein interactions
  • Tagging strategies
  • Yeast two-hybrid
  • Slide Number 45
  • Outline
  • Error Rates
  • Error Rates
  • Data Integration
  • Data Integration
  • Data Integration
  • Data Integration
  • Estimated Error Rates
  • Slide Number 54
  • Finding real interactions
  • Slide Number 56
  • Finding real interactions
  • Bayes Rule
  • Naiumlve Bayes Classification
  • Slide Number 61
  • Slide Number 62
  • Slide Number 63
  • Slide Number 64
  • Slide Number 65
  • Slide Number 66
  • ROC curve
  • ROC curve
  • ROC curve
  • ROC curve
  • Outline
  • Bayesian Networks
  • In Biology
  • Slide Number 74
  • Slide Number 75
  • Slide Number 76
  • Slide Number 77
  • Bayesian Networks
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Graphical Structure Expresses our Beliefs
  • Slide Number 83
  • Slide Number 84
  • Slide Number 85
  • Slide Number 86
  • Slide Number 87
  • ldquoExplaining Awayrdquo
  • Slide Number 89
  • How do we obtain a BN
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Learning Models from Data
  • Resources to learn more
  • A worked ldquotoyrdquo example
  • Chain Rule of Probability for Bayes Nets
  • Prediction with Bayes Nets
  • Inference with Bayes Nets
  • Inference with Bayes Nets
  • End of worked example
  • Goal
  • Slide Number 103
  • Co-evolution
  • Rosetta Stone
  • Integrating diverse data
  • Advantage of Bayesian Networks
  • Requirement of Bayesian Classification
  • Integrating diverse data
  • Slide Number 110
  • Slide Number 111
  • Slide Number 112
  • Slide Number 113
  • Slide Number 114
  • Slide Number 115
  • Slide Number 116
  • Slide Number 117
  • Summary